Verified Commit 2790642b authored by Toby Hodges's avatar Toby Hodges
Browse files

Merge branch 'master' into pandas-ex-4

parents 7da5de59 b64c9d1f
......@@ -34,6 +34,74 @@ keypoints:
- ideally with a more relevant example dataset... [maybe a COVID one](https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data/resource/260bbbde-2316-40eb-aec3-7cd7bfc2f590)
- include an aside about I/O - reading/writing files (pandas (the `.to_*()` methods and highlight some: `csv`, `json`, `feather`, `hdf`), numpy, `open()`, (?) bytes vs strings, (?) encoding)
- Finish with example of `df.plot()` to set the scene for plotting section
## Reading data to a numpy array
We'll use the popular image analysis package scikit-image, to read two example images into numpy
arrays.
~~~
from skimage.io import imread
raw = imread('cilliated_cell.png')
nuclei = imread('cilliated_cell_nuclei.png')
# if you want to see what these images look like - we can use matplotlib (more to come later!)
import matplotlib.pyplot as plt
plt.imshow(raw, cmap='gray')
plt.imshow(nuclei)
~~~
{: .language-python }
> ## Exploring Image Arrays
> * What are the dimensions of these arrays?
> * What data type are these arrays?
> * What is the minimum and maximum value of these arrays?
>
> > ## Solution
> > ~~~
> > print(raw.shape)
> > print(raw.dtype)
> > print(np.max(raw))
> > print(np.min(raw))
> > ~~~
> > {: .language-python }
> {: .solution }
{: .challenge }
> ## Working with Filtered Data
>
> 1. On what date were the most cases reported in Germany so far?
> 2. What was the mean number of cases reported per day in Germany in April 2020?
> 3. Is this higher or lower than the mean for March 2020?
> 4. On how many days in March was the number of cases in Germany higher than the mean for April?
>
> > ## Solution
> > ~~~
> > # 1
> > mask_germany = covid_cases['countryterritoryCode'] == 'DEU'
> > id_max = covid_cases[mask_germany]['cases'].idxmax()
> > print(covid_cases.iloc[id_max]['dateRep'])
> >
> > # 2
> > mask_april = (covid_cases['year'] == 2020) & (covid_cases['month'] == 4)
> > mean_april = covid_cases[mask_germany & mask_april]['cases'].mean()
> > print(mean_april)
> >
> > # 3
> > mask_march = (covid_cases['year'] == 2020) & (covid_cases['month'] == 3)
> > mean_march = covid_cases[mask_germany & mask_march]['cases'].mean()
> > print(mean_march)
> > print("Mean cases per day was {} in April than in March 2020.".
> > format(["lower", "higher"][mean_april > mean_march]))
> >
> > # 4
> > mask_higher_mean_april = (covid_cases['cases'] > mean_april)
> > selection = covid_cases[mask_germany & mask_march & mask_higher_mean_april]
> > nbr_days = len(selection) # Assume clean data
> > print(nbr_days)
> > ~~~
> > {: .language-python }
> {: .solution }
{: .challenge }
- where data includes a column with unique values, this can be set as the `index`.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment