02-data.md

---
title: Working with Data
teaching: 20
exercises: 10
questions:
- "How should I work with numeric data in Python?"
- "What's the recommended way to handle and analyse tabular data?"
- "How can I import tabular data for analysis in Python and export the results?"
objectives:
- "handle and summarise numeric data with Numpy."
- "filter values in their data based on a range of conditions."
- "load tabular data into a Pandas dataframe object."
- "describe what is meant by the data type of an array/series, and the impact this has on how the data is handled."
- "add and remove columns from a dataframe."
- "select, aggregate, and visualise data in a dataframe."
keypoints:
- "Specialised third-party libraries such as Numpy and Pandas provide powerful objects and functions that can help us analyse our data."
- "Pandas dataframe objects allow us to efficiently load and handle large tabular data."
- "Use the `pandas.read_csv` and `pandas.write_csv` functions to read and write tabular data."
---

## plan

- Toby currently scheduled to lead this session
- Numpy
  - arrays
  - masking
  - aside about data types and potential hazards
  - reading data from a file (with note that more will come later on this topic)
  - link to existing image analysis material
- Pandas
  - when an array just isn't enough
  - DataFrames - re-use material from [Software Carpentry][swc-python-gapminder]?
    - ideally with a more relevant example dataset... [maybe a COVID one](https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data/resource/260bbbde-2316-40eb-aec3-7cd7bfc2f590)
    - include an aside about I/O - reading/writing files (pandas (the `.to_*()` methods and highlight some: `csv`, `json`, `feather`, `hdf`), numpy, `open()`, (?) bytes vs strings, (?) encoding)
  - Finish with example of `df.plot()` to set the scene for plotting section

## Numpy

## Reading data to a numpy array

We'll use the popular image analysis package scikit-image,
to read two example images into numpy arrays.

~~~
from skimage.io import imread
raw = imread('cilliated_cell.png')
nuclei = imread('cilliated_cell_nuclei.png')
# if you want to see what these images look like - we can use matplotlib (more to come later!)
import matplotlib.pyplot as plt
plt.imshow(raw, cmap='gray')
plt.imshow(nuclei)
~~~
{: .language-python }

> ## Exploring Image Arrays
>
> * What are the dimensions of these arrays?
> * What data type are these arrays?
> * What is the minimum and maximum value of these arrays?
>
> > ## Solution
> > ~~~
> > print(raw.shape)
> > print(raw.dtype)
> > print(np.max(raw))
> > print(np.min(raw))
> > ~~~
> > {: .language-python }
> {: .solution }
{: .challenge }

> ## Masking arrays
>
> The nuclei image contains a binary segmentation i.e.:
>
> * 1 = nuclei
> * 0 = not nuclei
>
> 1. Find the median value of the raw image within the nuclei
> 2. Create a new version of raw where all values outside the nuclei are 0
>
> > ## Solution
> > ~~~
> > # 1
> > pixels_in_nuclei = raw[nuclei == 1]
> > print(np.median(pixels_in_nuclei))
> >
> > # 2
> > new_image = raw.copy()
> > new_image[nuclei == 0] = 0
> > plt.imshow(new_image, cmap='gray')
> > ~~~
> > {: .language-python }
> {: .solution }
{: .challenge }


## Pandas

- load data

~~~
covid_cases = pd.read_csv("data/CovidCaseData_20200624.csv")
~~~
{: .language-python }

- display dataframe
- `head` & `tail`
- `shape`

~~~
print(f'covid_cases is a {type(covid_cases)} object with {covid_cases.shape[0]} rows and {covid_cases.shape[1]} columns')
print('covid_cases has the following columns:\n' + '\n'.join(covid_cases.columns))
~~~
{: .language-python }

- much easier to use `covid_cases.info()`
- get statistics on quantitative columns with `.describe()`
  - sometimes this information isn't helpful (e.g. mean of year column)
  - but good to see we don't have any missing data (count is identical for all columns)

### Selecting Data in a Dataframe

- use `iloc` & `loc` to select specific cell(s) from a dataframe

~~~
print(covid_cases.iloc[24242,4])
print(covid_cases.iloc[24242,])
print(covid_cases.iloc[24242,:])
~~~
{: .language-python }

- you can use slices with `iloc` & `loc`

~~~
print(covid_cases.iloc[100:120,4:6])
print(covid_cases.iloc[100:120,:])

# select a whole row
print(covid_cases.loc[0,:])
# or
print(covid_cases.loc[0,])

# select a whole column
print(covid_cases.loc[:,'continentExp'])
# or
print(covid_cases['continentExp'])
# or(!)
print(covid_cases.continentExp)
~~~
{: .language-python }

- columns are returned as a pandas Series object

~~~
type(covid_cases['continentExp'])
~~~
{: .language-python }

- get the unique values of a series (as a numpy array):

~~~
pd.unique(covid_cases['continentExp'])
~~~
{: .language-python }

### Filtering Data in a Dataframe

- you've seen indexing, but with large datasets you can't easily explore, it's much more useful to select/filter data based on value

~~~
covid_cases['continentExp'] == 'Europe'
~~~
{: .language-python }

- INSERT EXERCISE from !11 here

- results of filtering can be used in further operations

~~~
covid_cases[covid_cases['continentExp'] == 'Europe'].max()
~~~
{: .language-python }

> ## Working with Filtered Data
>
> 1. On what date were the most cases reported in Germany so far?
> 2. What was the mean number of cases reported per day in Germany in April 2020?
> 3. Is this higher or lower than the mean for March 2020?
> 4. On how many days in March was the number of cases in Germany higher than the mean for April?
>
> > ## Solution
> > ~~~
> > # 1
> > mask_germany = covid_cases['countryterritoryCode'] == 'DEU'
> > id_max = covid_cases[mask_germany]['cases'].idxmax()
> > print(covid_cases.iloc[id_max]['dateRep'])
> >
> > # 2
> > mask_april = (covid_cases['year'] == 2020) & (covid_cases['month'] == 4)
> > mean_april = covid_cases[mask_germany & mask_april]['cases'].mean()
> > print(mean_april)
> >
> > # 3
> > mask_march = (covid_cases['year'] == 2020) & (covid_cases['month'] == 3)
> > mean_march = covid_cases[mask_germany & mask_march]['cases'].mean()
> > print(mean_march)
> > print("Mean cases per day was {} in April than in March 2020.".
> >       format(["lower", "higher"][mean_april > mean_march]))
> >
> > # 4
> > mask_higher_mean_april = (covid_cases['cases'] > mean_april)
> > selection = covid_cases[mask_germany & mask_march & mask_higher_mean_april]
> > nbr_days = len(selection)   # Assume clean data
> > print(nbr_days)
> > ~~~
> > {: .language-python }
> {: .solution }
{: .challenge }

### Combining Dataframes

- where you have unique values in a column, can set this as the 'index'

~~~
asia_lockdowns = pd.read_csv('data/AsiaLockdowns.csv', index_col=0)
africa_lockdowns = pd.read_csv('data/AfricaLockdowns.csv', index_col=0)
~~~
{: .language-python }

- rows from two dataframes with the same columns can be easily stacked into a single df with `pd.concat`

- INSERT EXERCISE from !21 here

- INSERT EXERCISE from !17 here

- there are blanks in the lockdown end date column - probably because these lockdowns haven't ended yet(?).
- let's try to set the end date to the latest date included in the case data

~~~
latest_date = covid_cases['dateRep'].max()
print(latest_date)
~~~
{: .language-python }

- whoa! something is wrong here!
- check the type of the data in the `"dateRep"` column

~~~
print(covid_cases['dateRep'].dtype)
~~~
{: .language-python }

- these values are being treated (and therefore sorted) as strings!

#### Working with Datetime Columns

- luckily, pandas provides an easy way to convert these to a datetime type, which can be properly handled

~~~
pd.to_datetime(covid_cases['dateRep'], dayfirst=True)
# dayfirst=True is necessary because by default pandas reads mm/dd/yyyy dates :(
covid_cases['dateRep'] = pd.to_datetime(covid_cases['dateRep'], dayfirst=True)
print(covid_cases['dateRep'].max())
~~~
{: .language-python }

- much better!
- now we can fill in those blank lockdown end dates

~~~
covid_lockdowns['End date'] = covid_lockdowns['End date'].
                                fillna(covid_cases['dateRep'].max())
~~~
{: .language-python }

- INSERT EXERCISE from !22 here
- we saw earlier how to add rows to a df with `concat`
- you can also add additional columns to a dataframe
  - treat it like a dictionary

~~~
covid_cases['casesPerMillion'] = covid_cases['cases'] / (covid_cases['popData2019']/1e6)
covid_cases.head()
~~~
{: .language-python }

- and you can add data from another dataframe
  - but need to make the key column names match up first!

~~~
covid_lockdowns.index.name='countriesAndTerritories'
covid_cases.merge(covid_lockdowns, on="countriesAndTerritories")
~~~
{: .language-python }

- For more on combining series and dataframes, we recommend these two chapters of Jake Vanderplas' _Python Data Science Handbook_:
  - [Combining Datasets: Concat and Append](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html)
  - [Combining Datasets: Merge and Join](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html)

~~~
combined = covid_cases.merge(covid_lockdowns, on="countriesAndTerritories")
~~~
{: .language-python }

- INSERT EXERCISE from !24 here

### Groupby & Split-Apply-Combine

- we'll keep working with the inner-joined `combined` df
- as a demonstration of the power of pandas, let's calculate the average time between the beginning of a country's lockdown, and the peak in new cases in that country
  - do this with a rolling average of cases, to make the method less sensitive to noise in data
- the dataframe is currently sorted in reverse date order
- before we can calculate a rolling mean, we'll need to re-sort the data from longest-ago to most-recent:

~~~
combined = combined.sort_values(by=['countriesAndTerritories','dateRep'])
~~~
{: .language-python }

- now we can calculate our rolling mean of cases for each country
- this uses the split-apply-combine paradigm:
  - first grouping by country,
  - then calculating the rolling mean
  - we get back a single Series, indexed by country

~~~
rolling_mean_cases = combined.groupby('countriesAndTerritories')['cases'].rolling(7).mean()
~~~
{: .language-python }

- unfortunately, the index of our rolling mean series is incompatible with our main df, so we can't add it as a new column

~~~
combined['rolling mean'] = rolling_mean_cases
~~~
{: .language-python }

- we can use the `reset_index` method to remove the `"countriesAndTerritories"` index

~~~
rolling_mean_cases = rolling_mean_cases.reset_index(0, drop=True)
~~~
{: .language-python }

- and now we can add the rolling means as a new column

~~~
combined['rolling mean'] = rolling_mean_cases

# plot rolling average for Germany
combined[combined['countriesAndTerritories']=='Germany'].set_index('dateRep')['rolling mean'].plot(kind='line')
# plot cumulative sum of cases for Germany
combined[combined['countriesAndTerritories']=='Germany'].set_index('dateRep')['cases'].cumsum().plot(kind='line')
~~~
{: .language-python }

- now we want to calculate the time difference between the start of lockdown and the peak of rolling mean cases in each country
- INSERT EXERCISE from !23 here
- set the index so we can compare the dates:

~~~
peak_dates.index = combined.loc[peak_rows]['countriesAndTerritories']
start_dates.index = combined.loc[peak_rows]['countriesAndTerritories']
print((peak_dates - start_dates).median())
~~~
{: .language-python }

- as a final flourish, let's create a boxplot showing the distribution of these time differences:
- because `plot(kind="boxplot")` doesn't know how to work with datetime values, we need to extract the number of days as an integer.
- can do this with the `days()` method of the datetime object, but have to write a function to get it so we can use `.apply`...

~~~
def get_days(t):
    return t.days

(peak_dates - start_dates).apply(get_days).plot(kind='box')
~~~
{: .language-python }

![](/fig/boxplot.png)

{% include links.md %}