Newer
Older
---
title: Working with Data
teaching: 20
exercises: 10
questions:
- "How should I work with numeric data in Python?"
- "What's the recommended way to handle and analyse tabular data?"
- "How can I import tabular data for analysis in Python and export the results?"
objectives:
- "handle and summarise numeric data with Numpy."
- "filter values in their data based on a range of conditions."
- "load tabular data into a Pandas dataframe object."
- "describe what is meant by the data type of an array/series, and the impact this has on how the data is handled."
- "add and remove columns from a dataframe."
- "select, aggregate, and visualise data in a dataframe."
keypoints:
- "Specialised third-party libraries such as Numpy and Pandas provide powerful objects and functions that can help us analyse our data."
- "Pandas dataframe objects allow us to efficiently load and handle large tabular data."
- "Use the `pandas.read_csv` and `pandas.write_csv` functions to read and write tabular data."
---
## plan
- Toby currently scheduled to lead this session
- Numpy
- arrays
- masking
- aside about data types and potential hazards
- reading data from a file (with note that more will come later on this topic)
- link to existing image analysis material
- Pandas
- when an array just isn't enough
- DataFrames - re-use material from [Software Carpentry][swc-python-gapminder]?
- ideally with a more relevant example dataset... [maybe a COVID one](https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data/resource/260bbbde-2316-40eb-aec3-7cd7bfc2f590)
- include an aside about I/O - reading/writing files (pandas (the `.to_*()` methods and highlight some: `csv`, `json`, `feather`, `hdf`), numpy, `open()`, (?) bytes vs strings, (?) encoding)
- Finish with example of `df.plot()` to set the scene for plotting section
## Reading data to a numpy array
We'll use the popular image analysis package scikit-image,
to read two example images into numpy arrays.
~~~
from skimage.io import imread
raw = imread('cilliated_cell.png')
nuclei = imread('cilliated_cell_nuclei.png')
# if you want to see what these images look like - we can use matplotlib (more to come later!)
import matplotlib.pyplot as plt
plt.imshow(raw, cmap='gray')
plt.imshow(nuclei)
~~~
{: .language-python }
> ## Exploring Image Arrays
> * What are the dimensions of these arrays?
> * What data type are these arrays?
> * What is the minimum and maximum value of these arrays?
> > ## Solution
> > ~~~
> > print(raw.shape)
> > print(raw.dtype)
> > print(np.max(raw))
> > print(np.min(raw))
> > ~~~
> > {: .language-python }
> {: .solution }
{: .challenge }
> The nuclei image contains a binary segmentation i.e.:
> * 1 = nuclei
> * 0 = not nuclei
>
> 1. Find the median value of the raw image within the nuclei
> 2. Create a new version of raw where all values outside the nuclei are 0
> > ## Solution
> > ~~~
> > # 1
> > pixels_in_nuclei = raw[nuclei == 1]
> > print(np.median(pixels_in_nuclei))
> >
> > # 2
> > new_image = raw.copy()
> > new_image[nuclei == 0] = 0
> > plt.imshow(new_image, cmap='gray')
> > ~~~
> > {: .language-python }
> {: .solution }
{: .challenge }
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
- load data
~~~
covid_cases = pd.read_csv("data/CovidCaseData_20200624.csv")
~~~
{: .language-python }
- display dataframe
- `head` & `tail`
- `shape`
~~~
print(f'covid_cases is a {type(covid_cases)} object with {covid_cases.shape[0]} rows and {covid_cases.shape[1]} columns')
print('covid_cases has the following columns:\n' + '\n'.join(covid_cases.columns))
~~~
{: .language-python }
- much easier to use `covid_cases.info()`
- get statistics on quantitative columns with `.describe()`
- sometimes this information isn't helpful (e.g. mean of year column)
- but good to see we don't have any missing data (count is identical for all columns)
### Selecting Data in a Dataframe
- use `iloc` & `loc` to select specific cell(s) from a dataframe
~~~
print(covid_cases.iloc[24242,4])
print(covid_cases.iloc[24242,])
print(covid_cases.iloc[24242,:])
~~~
{: .language-python }
- you can use slices with `iloc` & `loc`
~~~
print(covid_cases.iloc[100:120,4:6])
print(covid_cases.iloc[100:120,:])
# select a whole row
print(covid_cases.loc[0,:])
# or
print(covid_cases.loc[0,])
# select a whole column
print(covid_cases.loc[:,'continentExp'])
# or
print(covid_cases['continentExp'])
# or(!)
print(covid_cases.continentExp)
~~~
{: .language-python }
- columns are returned as a pandas Series object
~~~
type(covid_cases['continentExp'])
~~~
{: .language-python }
- get the unique values of a series (as a numpy array):
~~~
pd.unique(covid_cases['continentExp'])
~~~
{: .language-python }
### Filtering Data in a Dataframe
- you've seen indexing, but with large datasets you can't easily explore, it's much more useful to select/filter data based on value
~~~
covid_cases['continentExp'] == 'Europe'
~~~
{: .language-python }
- INSERT EXERCISE from !11 here
- results of filtering can be used in further operations
~~~
covid_cases[covid_cases['continentExp'] == 'Europe'].max()
~~~
{: .language-python }
> 1. On what date were the most cases reported in Germany so far?
> 2. What was the mean number of cases reported per day in Germany in April 2020?
> 3. Is this higher or lower than the mean for March 2020?
> 4. On how many days in March was the number of cases in Germany higher than the mean for April?
> > mask_germany = covid_cases['countryterritoryCode'] == 'DEU'
> > id_max = covid_cases[mask_germany]['cases'].idxmax()
> > print(covid_cases.iloc[id_max]['dateRep'])
> > mask_april = (covid_cases['year'] == 2020) & (covid_cases['month'] == 4)
> > mean_april = covid_cases[mask_germany & mask_april]['cases'].mean()
> > mask_march = (covid_cases['year'] == 2020) & (covid_cases['month'] == 3)
> > mean_march = covid_cases[mask_germany & mask_march]['cases'].mean()
> > print("Mean cases per day was {} in April than in March 2020.".
> > format(["lower", "higher"][mean_april > mean_march]))
> > mask_higher_mean_april = (covid_cases['cases'] > mean_april)
> > selection = covid_cases[mask_germany & mask_march & mask_higher_mean_april]
> > nbr_days = len(selection) # Assume clean data
> > print(nbr_days)
> > ~~~
> > {: .language-python }
> {: .solution }
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
### Combining Dataframes
- where you have unique values in a column, can set this as the 'index'
~~~
asia_lockdowns = pd.read_csv('data/AsiaLockdowns.csv', index_col=0)
africa_lockdowns = pd.read_csv('data/AfricaLockdowns.csv', index_col=0)
~~~
{: .language-python }
- rows from two dataframes with the same columns can be easily stacked into a single df with `pd.concat`
- INSERT EXERCISE from !21 here
- INSERT EXERCISE from !17 here
- there are blanks in the lockdown end date column - probably because these lockdowns haven't ended yet(?).
- let's try to set the end date to the latest date included in the case data
~~~
latest_date = covid_cases['dateRep'].max()
print(latest_date)
~~~
{: .language-python }
- whoa! something is wrong here!
- check the type of the data in the `"dateRep"` column
~~~
print(covid_cases['dateRep'].dtype)
~~~
{: .language-python }
- these values are being treated (and therefore sorted) as strings!
#### Working with Datetime Columns
- luckily, pandas provides an easy way to convert these to a datetime type, which can be properly handled
~~~
pd.to_datetime(covid_cases['dateRep'], dayfirst=True)
# dayfirst=True is necessary because by default pandas reads mm/dd/yyyy dates :(
covid_cases['dateRep'] = pd.to_datetime(covid_cases['dateRep'], dayfirst=True)
print(covid_cases['dateRep'].max())
~~~
{: .language-python }
- much better!
- now we can fill in those blank lockdown end dates
~~~
covid_lockdowns['End date'] = covid_lockdowns['End date'].
fillna(covid_cases['dateRep'].max())
~~~
{: .language-python }
- INSERT EXERCISE from !22 here
- we saw earlier how to add rows to a df with `concat`
- you can also add additional columns to a dataframe
- treat it like a dictionary
~~~
covid_cases['casesPerMillion'] = covid_cases['cases'] / (covid_cases['popData2019']/1e6)
covid_cases.head()
~~~
{: .language-python }
- and you can add data from another dataframe
- but need to make the key column names match up first!
~~~
covid_lockdowns.index.name='countriesAndTerritories'
covid_cases.merge(covid_lockdowns, on="countriesAndTerritories")
~~~
{: .language-python }
- For more on combining series and dataframes, we recommend these two chapters of Jake Vanderplas' _Python Data Science Handbook_:
- [Combining Datasets: Concat and Append](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html)
- [Combining Datasets: Merge and Join](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html)
~~~
combined = covid_cases.merge(covid_lockdowns, on="countriesAndTerritories")
~~~
{: .language-python }
- INSERT EXERCISE from !24 here
### Groupby & Split-Apply-Combine
- we'll keep working with the inner-joined `combined` df
- as a demonstration of the power of pandas, let's calculate the average time between the beginning of a country's lockdown, and the peak in new cases in that country
- do this with a rolling average of cases, to make the method less sensitive to noise in data
- the dataframe is currently sorted in reverse date order
- before we can calculate a rolling mean, we'll need to re-sort the data from longest-ago to most-recent:
~~~
combined = combined.sort_values(by=['countriesAndTerritories','dateRep'])
~~~
{: .language-python }
- now we can calculate our rolling mean of cases for each country
- this uses the split-apply-combine paradigm:
- first grouping by country,
- then calculating the rolling mean
- we get back a single Series, indexed by country
~~~
rolling_mean_cases = combined.groupby('countriesAndTerritories')['cases'].rolling(7).mean()
~~~
{: .language-python }
- unfortunately, the index of our rolling mean series is incompatible with our main df, so we can't add it as a new column
~~~
combined['rolling mean'] = rolling_mean_cases
~~~
{: .language-python }
- we can use the `reset_index` method to remove the `"countriesAndTerritories"` index
~~~
rolling_mean_cases = rolling_mean_cases.reset_index(0, drop=True)
~~~
{: .language-python }
- and now we can add the rolling means as a new column
~~~
combined['rolling mean'] = rolling_mean_cases
# plot rolling average for Germany
combined[combined['countriesAndTerritories']=='Germany'].set_index('dateRep')['rolling mean'].plot(kind='line')
# plot cumulative sum of cases for Germany
combined[combined['countriesAndTerritories']=='Germany'].set_index('dateRep')['cases'].cumsum().plot(kind='line')
~~~
{: .language-python }
- now we want to calculate the time difference between the start of lockdown and the peak of rolling mean cases in each country
- INSERT EXERCISE from !23 here
- set the index so we can compare the dates:
~~~
peak_dates.index = combined.loc[peak_rows]['countriesAndTerritories']
start_dates.index = combined.loc[peak_rows]['countriesAndTerritories']
print((peak_dates - start_dates).median())
~~~
{: .language-python }
- as a final flourish, let's create a boxplot showing the distribution of these time differences:
- because `plot(kind="boxplot")` doesn't know how to work with datetime values, we need to extract the number of days as an integer.
- can do this with the `days()` method of the datetime object, but have to write a function to get it so we can use `.apply`...
~~~
def get_days(t):
return t.days
(peak_dates - start_dates).apply(get_days).plot(kind='box')
~~~
{: .language-python }
