-
Julian Bauer authoredJulian Bauer authored
title: Working with Data
teaching: 20
exercises: 10
questions:
- "How should I work with numeric data in Python?"
- "What's the recommended way to handle and analyse tabular data?"
- "How can I import tabular data for analysis in Python and export the results?"
objectives:
- "handle and summarise numeric data with Numpy."
- "filter values in their data based on a range of conditions."
- "load tabular data into a Pandas dataframe object."
- "describe what is meant by the data type of an array/series, and the impact this has on how the data is handled."
- "add and remove columns from a dataframe."
- "select, aggregate, and visualise data in a dataframe."
keypoints:
- "Specialised third-party libraries such as Numpy and Pandas provide powerful objects and functions that can help us analyse our data."
- "Pandas dataframe objects allow us to efficiently load and handle large tabular data."
- "Use the `pandas.read_csv` and `pandas.write_csv` functions to read and write tabular data."
plan
- Toby currently scheduled to lead this session
- Numpy
- arrays
- masking
- aside about data types and potential hazards
- reading data from a file (with note that more will come later on this topic)
- link to existing image analysis material
- Pandas
- when an array just isn't enough
- DataFrames - re-use material from [Software Carpentry][swc-python-gapminder]?
- ideally with a more relevant example dataset... maybe a COVID one
- include an aside about I/O - reading/writing files (pandas (the
.to_*()
methods and highlight some:csv
,json
,feather
,hdf
), numpy,open()
, (?) bytes vs strings, (?) encoding)
- Finish with example of
df.plot()
to set the scene for plotting section
Numpy
Reading data to a numpy array
We'll use the popular image analysis package scikit-image, to read two example images into numpy arrays.
from skimage.io import imread
raw = imread('cilliated_cell.png')
nuclei = imread('cilliated_cell_nuclei.png')
# if you want to see what these images look like - we can use matplotlib (more to come later!)
import matplotlib.pyplot as plt
plt.imshow(raw, cmap='gray')
plt.imshow(nuclei)
{: .language-python }
Exploring Image Arrays
- What are the dimensions of these arrays?
- What data type are these arrays?
- What is the minimum and maximum value of these arrays?
Solution
print(raw.shape) print(raw.dtype) print(np.max(raw)) print(np.min(raw))
{: .language-python } {: .solution } {: .challenge }
Masking arrays
The nuclei image contains a binary segmentation i.e.:
- 1 = nuclei
- 0 = not nuclei
- Find the median value of the raw image within the nuclei
- Create a new version of raw where all values outside the nuclei are 0
Solution
# 1 pixels_in_nuclei = raw[nuclei == 1] print(np.median(pixels_in_nuclei)) # 2 new_image = raw.copy() new_image[nuclei == 0] = 0 plt.imshow(new_image, cmap='gray')
{: .language-python } {: .solution } {: .challenge }
Pandas
- load data
import pandas as pd
covid_cases = pd.read_csv("data/CovidCaseData_20200624.csv")
{: .language-python }
-
display dataframe
-
head
&tail
covid_cases.head() covid_cases.tail()
{: .language-python }
shape
print(f'covid_cases is a {type(covid_cases)} object with {covid_cases.shape[0]} rows and {covid_cases.shape[1]} columns') print('covid_cases has the following columns:\n' + '\n'.join(covid_cases.columns))
{: .language-python }
-
-
much easier to use
covid_cases.info()
-
get statistics on quantitative columns with
.describe()
- sometimes this information isn't helpful (e.g. mean of year column)
- but good to see we don't have any missing data (count is identical for all columns)
Selecting Data in a Dataframe
- use
iloc
&loc
to select specific cell(s) from a dataframe
print(covid_cases.iloc[24242,4])
print(covid_cases.iloc[24242,])
print(covid_cases.iloc[24242,:])
{: .language-python }
- you can use slices with
iloc
&loc
print(covid_cases.iloc[100:120,4:6])
print(covid_cases.iloc[100:120,:])
# select a whole row
print(covid_cases.loc[0,:])
# or
print(covid_cases.loc[0,])
# select a whole column
print(covid_cases.loc[:,'continentExp'])
# or
print(covid_cases['continentExp'])
# or(!)
print(covid_cases.continentExp)
{: .language-python }
- columns are returned as a pandas Series object
type(covid_cases['continentExp'])
{: .language-python }
- get the unique values of a series (as a numpy array):
pd.unique(covid_cases['continentExp'])
{: .language-python }
Filtering Data in a Dataframe
- you've seen indexing, but with large datasets you can't easily explore, it's much more useful to select/filter data based on value
covid_cases['continentExp'] == 'Europe'
{: .language-python }
-
INSERT EXERCISE from !11 here
-
results of filtering can be used in further operations
covid_cases[covid_cases['continentExp'] == 'Europe'].max()
{: .language-python }
Working with Filtered Data
- On what date were the most cases reported in Germany so far?
- What was the mean number of cases reported per day in Germany in April 2020?
- Is this higher or lower than the mean for March 2020?
- On how many days in March was the number of cases in Germany higher than the mean for April?
Solution
# 1 mask_germany = covid_cases['countryterritoryCode'] == 'DEU' id_max = covid_cases[mask_germany]['cases'].idxmax() print(covid_cases.iloc[id_max]['dateRep']) # 2 mask_april = (covid_cases['year'] == 2020) & (covid_cases['month'] == 4) mean_april = covid_cases[mask_germany & mask_april]['cases'].mean() print(mean_april) # 3 mask_march = (covid_cases['year'] == 2020) & (covid_cases['month'] == 3) mean_march = covid_cases[mask_germany & mask_march]['cases'].mean() print(mean_march) print("Mean cases per day was {} in April than in March 2020.". format(["lower", "higher"][mean_april > mean_march])) # 4 mask_higher_mean_april = (covid_cases['cases'] > mean_april) selection = covid_cases[mask_germany & mask_march & mask_higher_mean_april] nbr_days = len(selection) # Assume clean data print(nbr_days)
{: .language-python } {: .solution } {: .challenge }
Combining Dataframes
- where you have unique values in a column, you can set this as the 'index'
asia_lockdowns = pd.read_csv('data/AsiaLockdowns.csv', index_col=0)
africa_lockdowns = pd.read_csv('data/AfricaLockdowns.csv', index_col=0)
{: .language-python }
-
rows from two dataframes with the same columns can be easily stacked into a single df with
pd.concat
-
INSERT EXERCISE from !21 here
-
INSERT EXERCISE from !17 here
-
there are blanks in the lockdown end date column - probably because these lockdowns haven't ended yet(?).
-
let's try to set the end date to the latest date included in the case data
latest_date = covid_cases['dateRep'].max()
print(latest_date)
{: .language-python }
- whoa! something is wrong here!
- check the type of the data in the
"dateRep"
column
print(covid_cases['dateRep'].dtype)
{: .language-python }
- these values are being treated (and therefore sorted) as strings!
Working with Datetime Columns
- luckily, pandas provides an easy way to convert these to a datetime type, which can be properly handled
pd.to_datetime(covid_cases['dateRep'], dayfirst=True)
# dayfirst=True is necessary because by default pandas reads mm/dd/yyyy dates :(
covid_cases['dateRep'] = pd.to_datetime(covid_cases['dateRep'], dayfirst=True)
print(covid_cases['dateRep'].max())
{: .language-python }
- much better!
- now we can fill in those blank lockdown end dates
covid_lockdowns['End date'] = covid_lockdowns['End date'].
fillna(covid_cases['dateRep'].max())
{: .language-python }
- INSERT EXERCISE from !22 here
- we saw earlier how to add rows to a df with
concat
- you can also add additional columns to a dataframe
- treat it like a dictionary
covid_cases['casesPerMillion'] = covid_cases['cases'] / (covid_cases['popData2019']/1e6)
covid_cases.head()
{: .language-python }
- and you can add data from another dataframe
- but need to make the key column names match up first!
covid_lockdowns.index.name='countriesAndTerritories'
covid_cases.merge(covid_lockdowns, on="countriesAndTerritories")
{: .language-python }
- For more on combining series and dataframes, we recommend these two chapters of Jake Vanderplas' Python Data Science Handbook:
combined = covid_cases.merge(covid_lockdowns, on="countriesAndTerritories")
{: .language-python }
- INSERT EXERCISE from !24 here
Groupby & Split-Apply-Combine
- we'll keep working with the inner-joined
combined
df - as a demonstration of the power of pandas, let's calculate the average time between the beginning of a country's lockdown, and the peak in new cases in that country
- do this with a rolling average of cases, to make the method less sensitive to noise in data
- the dataframe is currently sorted in reverse date order
- before we can calculate a rolling mean, we'll need to re-sort the data from longest-ago to most-recent:
combined = combined.sort_values(by=['countriesAndTerritories','dateRep'])
{: .language-python }
- now we can calculate our rolling mean of cases for each country
- this uses the split-apply-combine paradigm:
- first grouping by country,
- then calculating the rolling mean
- we get back a single Series, indexed by country
rolling_mean_cases = combined.groupby('countriesAndTerritories')['cases'].rolling(7).mean()
{: .language-python }
- unfortunately, the index of our rolling mean series is incompatible with our main df, so we can't add it as a new column
combined['rolling mean'] = rolling_mean_cases
{: .language-python }
- we can use the
reset_index
method to remove the"countriesAndTerritories"
index
rolling_mean_cases = rolling_mean_cases.reset_index(0, drop=True)
{: .language-python }
- and now we can add the rolling means as a new column
combined['rolling mean'] = rolling_mean_cases
# plot rolling average for Germany
combined[combined['countriesAndTerritories']=='Germany'].set_index('dateRep')['rolling mean'].plot(kind='line')
# plot cumulative sum of cases for Germany
combined[combined['countriesAndTerritories']=='Germany'].set_index('dateRep')['cases'].cumsum().plot(kind='line')
{: .language-python }
- now we want to calculate the time difference between the start of lockdown and the peak of rolling mean cases in each country
- INSERT EXERCISE from !23 here
- set the index so we can compare the dates:
peak_dates.index = combined.loc[peak_rows]['countriesAndTerritories']
start_dates.index = combined.loc[peak_rows]['countriesAndTerritories']
print((peak_dates - start_dates).median())
{: .language-python }
- as a final flourish, let's create a boxplot showing the distribution of these time differences:
- because
plot(kind="boxplot")
doesn't know how to work with datetime values, we need to extract the number of days as an integer. - can do this with the
days()
method of the datetime object, but have to write a function to get it so we can use.apply
...
def get_days(t):
return t.days
(peak_dates - start_dates).apply(get_days).plot(kind='box')
{: .language-python }
{% include links.md %}