Newer
Older
---
title: Working with Data
teaching: 20
exercises: 10
questions:
- "How should I work with numeric data in Python?"
- "What's the recommended way to handle and analyse tabular data?"
- "How can I import tabular data for analysis in Python and export the results?"
objectives:
- "handle and summarise numeric data with Numpy."
- "filter values in their data based on a range of conditions."
- "load tabular data into a Pandas dataframe object."
- "describe what is meant by the data type of an array/series, and the impact this has on how the data is handled."
- "add and remove columns from a dataframe."
- "select, aggregate, and visualise data in a dataframe."
keypoints:
- "Specialised third-party libraries such as Numpy and Pandas provide powerful objects and functions that can help us analyse our data."
- "Pandas dataframe objects allow us to efficiently load and handle large tabular data."
- "Use the `pandas.read_csv` and `pandas.write_csv` functions to read and write tabular data."
---
## plan
- Toby currently scheduled to lead this session
- Numpy
- arrays
- masking
- aside about data types and potential hazards
- reading data from a file (with note that more will come later on this topic)
- link to existing image analysis material
- Pandas
- when an array just isn't enough
- DataFrames - re-use material from [Software Carpentry][swc-python-gapminder]?
- ideally with a more relevant example dataset... [maybe a COVID one](https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data/resource/260bbbde-2316-40eb-aec3-7cd7bfc2f590)
- include an aside about I/O - reading/writing files (pandas (the `.to_*()` methods and highlight some: `csv`, `json`, `feather`, `hdf`), numpy, `open()`, (?) bytes vs strings, (?) encoding)
- Finish with example of `df.plot()` to set the scene for plotting section
> The nuclei image contains a binary segmentation i.e.:
> * 1 = nuclei
> * 0 = not nuclei
>
> 1. Find the median value of the raw image within the nuclei
> 2. Create a new version of raw where all values outside the nuclei are 0
> > ## Solution
> > ~~~
> > # 1
> > pixels_in_nuclei = raw[nuclei == 1]
> > print(np.median(pixels_in_nuclei))
> >
> > # 2
> > new_image = raw.copy()
> > new_image[nuclei == 0] = 0
> > plt.imshow(new_image, cmap='gray')
> > ~~~
> > {: .language-python }
> {: .solution }
{: .challenge }
> 1. On what date were the most cases reported in Germany so far?
> 2. What was the mean number of cases reported per day in Germany in April 2020?
> 3. Is this higher or lower than the mean for March 2020?
> 4. On how many days in March was the number of cases in Germany higher than the mean for April?
> > mask_germany = covid_cases['countryterritoryCode'] == 'DEU'
> > id_max = covid_cases[mask_germany]['cases'].idxmax()
> > print(covid_cases.iloc[id_max]['dateRep'])
> > mask_april = (covid_cases['year'] == 2020) & (covid_cases['month'] == 4)
> > mean_april = covid_cases[mask_germany & mask_april]['cases'].mean()
> > mask_march = (covid_cases['year'] == 2020) & (covid_cases['month'] == 3)
> > mean_march = covid_cases[mask_germany & mask_march]['cases'].mean()
> > print("Mean cases per day was {} in April than in March 2020.".
> > format(["lower", "higher"][mean_april > mean_march]))
> > mask_higher_mean_april = (covid_cases['cases'] > mean_april)
> > selection = covid_cases[mask_germany & mask_march & mask_higher_mean_april]
> > nbr_days = len(selection) # Assume clean data
> > print(nbr_days)
> > ~~~
> > {: .language-python }
> {: .solution }