--- title: Working with Data teaching: 20 exercises: 10 questions: - "How should I work with numeric data in Python?" - "What's the recommended way to handle and analyse tabular data?" - "How can I import tabular data for analysis in Python and export the results?" objectives: - "handle and summarise numeric data with Numpy." - "filter values in their data based on a range of conditions." - "load tabular data into a Pandas dataframe object." - "describe what is meant by the data type of an array/series, and the impact this has on how the data is handled." - "add and remove columns from a dataframe." - "select, aggregate, and visualise data in a dataframe." keypoints: - "Specialised third-party libraries such as Numpy and Pandas provide powerful objects and functions that can help us analyse our data." - "Pandas dataframe objects allow us to efficiently load and handle large tabular data." - "Use the `pandas.read_csv` and `pandas.write_csv` functions to read and write tabular data." --- ## plan - Toby currently scheduled to lead this session - Numpy - arrays - masking - aside about data types and potential hazards - reading data from a file (with note that more will come later on this topic) - link to existing image analysis material - Pandas - when an array just isn't enough - DataFrames - re-use material from [Software Carpentry][swc-python-gapminder]? - ideally with a more relevant example dataset... [maybe a COVID one](https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data/resource/260bbbde-2316-40eb-aec3-7cd7bfc2f590) - include an aside about I/O - reading/writing files (pandas (the `.to_*()` methods and highlight some: `csv`, `json`, `feather`, `hdf`), numpy, `open()`, (?) bytes vs strings, (?) encoding) - Finish with example of `df.plot()` to set the scene for plotting section > ## Working with Filtered Data > * On what date was the most cases reported in Germany so far? > * What was the mean number of cases reported per day in Germany in April 2020? > * Is this higher or lower than the mean for March 2020? > * On how many days in March was the number of cases in Germany higher than the mean for April? > > ## Solution > > ~~~ > > mask_germany = c['countryterritoryCode'] == 'DEU' > > id_max = c[mask_germany]['cases'].idxmax() > > > > print(c.iloc[id_max]['dateRep']) > > > > mask_april = (c['year'] == 2020) & (c['month'] == 4) > > mean_april = c[mask_germany & mask_april]['cases'].mean() > > > > mask_march = (c['year'] == 2020) & (c['month'] == 3) > > mean_march = c[mask_germany & mask_march]['cases'].mean() > > > > print(mean_april) > > print(mean_march) > > > > mask_higher_mean_april = (c['cases'] > mean_april) > > selection = c[mask_germany & mask_march & mask_higher_mean_april] > > nbr_days = len(selection) # Assume clean data > > > > print(nbr_days) > > ~~~ > > {: .language-python } > {: .solution } {: .challenge } {% include links.md %}