Skip to content
Snippets Groups Projects
title: Working with Data
teaching: 20
exercises: 10
questions:
- "How should I work with numeric data in Python?"
- "What's the recommended way to handle and analyse tabular data?"
- "How can I import tabular data for analysis in Python and export the results?"
objectives:
- "handle and summarise numeric data with Numpy."
- "filter values in their data based on a range of conditions."
- "load tabular data into a Pandas dataframe object."
- "describe what is meant by the data type of an array/series, and the impact this has on how the data is handled."
- "add and remove columns from a dataframe."
- "select, aggregate, and visualise data in a dataframe."
keypoints:
- "Specialised third-party libraries such as Numpy and Pandas provide powerful objects and functions that can help us analyse our data."
- "Pandas dataframe objects allow us to efficiently load and handle large tabular data."
- "Use the `pandas.read_csv` and `pandas.write_csv` functions to read and write tabular data."

plan

  • Toby currently scheduled to lead this session
  • Numpy
    • arrays
    • masking
    • aside about data types and potential hazards
    • reading data from a file (with note that more will come later on this topic)
    • link to existing image analysis material
  • Pandas
    • when an array just isn't enough
    • DataFrames - re-use material from [Software Carpentry][swc-python-gapminder]?
      • ideally with a more relevant example dataset... maybe a COVID one
      • include an aside about I/O - reading/writing files (pandas (the .to_*() methods and highlight some: csv, json, feather, hdf), numpy, open(), (?) bytes vs strings, (?) encoding)
    • Finish with example of df.plot() to set the scene for plotting section

Numpy

Reading data to a numpy array

We'll use the popular image analysis package scikit-image, to read two example images into numpy arrays.

from skimage.io import imread
raw = imread('cilliated_cell.png')
nuclei = imread('cilliated_cell_nuclei.png')
# if you want to see what these images look like - we can use matplotlib (more to come later!)
import matplotlib.pyplot as plt
plt.imshow(raw, cmap='gray')
plt.imshow(nuclei)

{: .language-python }

Exploring Image Arrays

  • What are the dimensions of these arrays?
  • What data type are these arrays?
  • What is the minimum and maximum value of these arrays?

Solution

print(raw.shape)
print(raw.dtype)
print(np.max(raw))
print(np.min(raw))

{: .language-python } {: .solution } {: .challenge }

Masking arrays

The nuclei image contains a binary segmentation i.e.:

  • 1 = nuclei
  • 0 = not nuclei
  1. Find the median value of the raw image within the nuclei
  2. Create a new version of raw where all values outside the nuclei are 0

Solution

# 1
pixels_in_nuclei = raw[nuclei == 1]
print(np.median(pixels_in_nuclei))

# 2
new_image = raw.copy()
new_image[nuclei == 0] = 0
plt.imshow(new_image, cmap='gray')

{: .language-python } {: .solution } {: .challenge }

Pandas

  • load data
import pandas as pd
covid_cases = pd.read_csv("data/CovidCaseData_20200624.csv")

{: .language-python }

  • display dataframe

    • head & tail
    covid_cases.head()
    covid_cases.tail()

    {: .language-python }

    • shape
    print(f'covid_cases is a {type(covid_cases)} object with {covid_cases.shape[0]} rows and {covid_cases.shape[1]} columns')
    print('covid_cases has the following columns:\n' + '\n'.join(covid_cases.columns))

    {: .language-python }

  • much easier to use covid_cases.info()

  • get statistics on quantitative columns with .describe()

    • sometimes this information isn't helpful (e.g. mean of year column)
    • but good to see we don't have any missing data (count is identical for all columns)

Selecting Data in a Dataframe

  • use iloc & loc to select specific cell(s) from a dataframe
print(covid_cases.iloc[24242,4])
print(covid_cases.iloc[24242,])
print(covid_cases.iloc[24242,:])

{: .language-python }

  • you can use slices with iloc & loc
print(covid_cases.iloc[100:120,4:6])
print(covid_cases.iloc[100:120,:])

# select a whole row
print(covid_cases.loc[0,:])
# or
print(covid_cases.loc[0,])

# select a whole column
print(covid_cases.loc[:,'continentExp'])
# or
print(covid_cases['continentExp'])
# or(!)
print(covid_cases.continentExp)

{: .language-python }

  • columns are returned as a pandas Series object
type(covid_cases['continentExp'])

{: .language-python }

  • get the unique values of a series (as a numpy array):
pd.unique(covid_cases['continentExp'])

{: .language-python }

Filtering Data in a Dataframe

  • you've seen indexing, but with large datasets you can't easily explore, it's much more useful to select/filter data based on value
covid_cases['continentExp'] == 'Europe'

{: .language-python }

  • INSERT EXERCISE from !11 here

  • results of filtering can be used in further operations

covid_cases[covid_cases['continentExp'] == 'Europe'].max()

{: .language-python }

Working with Filtered Data

  1. On what date were the most cases reported in Germany so far?
  2. What was the mean number of cases reported per day in Germany in April 2020?
  3. Is this higher or lower than the mean for March 2020?
  4. On how many days in March was the number of cases in Germany higher than the mean for April?

Solution

# 1
mask_germany = covid_cases['countryterritoryCode'] == 'DEU'
id_max = covid_cases[mask_germany]['cases'].idxmax()
print(covid_cases.iloc[id_max]['dateRep'])

# 2
mask_april = (covid_cases['year'] == 2020) & (covid_cases['month'] == 4)
mean_april = covid_cases[mask_germany & mask_april]['cases'].mean()
print(mean_april)

# 3
mask_march = (covid_cases['year'] == 2020) & (covid_cases['month'] == 3)
mean_march = covid_cases[mask_germany & mask_march]['cases'].mean()
print(mean_march)
print("Mean cases per day was {} in April than in March 2020.".
      format(["lower", "higher"][mean_april > mean_march]))

# 4
mask_higher_mean_april = (covid_cases['cases'] > mean_april)
selection = covid_cases[mask_germany & mask_march & mask_higher_mean_april]
nbr_days = len(selection)   # Assume clean data
print(nbr_days)

{: .language-python } {: .solution } {: .challenge }

Combining Dataframes

  • where you have unique values in a column, you can set this as the 'index'
asia_lockdowns = pd.read_csv('data/AsiaLockdowns.csv', index_col=0)
africa_lockdowns = pd.read_csv('data/AfricaLockdowns.csv', index_col=0)

{: .language-python }

  • rows from two dataframes with the same columns can be easily stacked into a single df with pd.concat

  • INSERT EXERCISE from !21 here

  • INSERT EXERCISE from !17 here

  • there are blanks in the lockdown end date column - probably because these lockdowns haven't ended yet(?).

  • let's try to set the end date to the latest date included in the case data

latest_date = covid_cases['dateRep'].max()
print(latest_date)

{: .language-python }

  • whoa! something is wrong here!
  • check the type of the data in the "dateRep" column
print(covid_cases['dateRep'].dtype)

{: .language-python }

  • these values are being treated (and therefore sorted) as strings!

Working with Datetime Columns

  • luckily, pandas provides an easy way to convert these to a datetime type, which can be properly handled
pd.to_datetime(covid_cases['dateRep'], dayfirst=True)
# dayfirst=True is necessary because by default pandas reads mm/dd/yyyy dates :(
covid_cases['dateRep'] = pd.to_datetime(covid_cases['dateRep'], dayfirst=True)
print(covid_cases['dateRep'].max())

{: .language-python }

  • much better!
  • now we can fill in those blank lockdown end dates
covid_lockdowns['End date'] = covid_lockdowns['End date'].
                                fillna(covid_cases['dateRep'].max())

{: .language-python }

  • INSERT EXERCISE from !22 here
  • we saw earlier how to add rows to a df with concat
  • you can also add additional columns to a dataframe
    • treat it like a dictionary
covid_cases['casesPerMillion'] = covid_cases['cases'] / (covid_cases['popData2019']/1e6)
covid_cases.head()

{: .language-python }

  • and you can add data from another dataframe
    • but need to make the key column names match up first!
covid_lockdowns.index.name='countriesAndTerritories'
covid_cases.merge(covid_lockdowns, on="countriesAndTerritories")

{: .language-python }

combined = covid_cases.merge(covid_lockdowns, on="countriesAndTerritories")

{: .language-python }

  • INSERT EXERCISE from !24 here

Groupby & Split-Apply-Combine

  • we'll keep working with the inner-joined combined df
  • as a demonstration of the power of pandas, let's calculate the average time between the beginning of a country's lockdown, and the peak in new cases in that country
    • do this with a rolling average of cases, to make the method less sensitive to noise in data
  • the dataframe is currently sorted in reverse date order
  • before we can calculate a rolling mean, we'll need to re-sort the data from longest-ago to most-recent:
combined = combined.sort_values(by=['countriesAndTerritories','dateRep'])

{: .language-python }

  • now we can calculate our rolling mean of cases for each country
  • this uses the split-apply-combine paradigm:
    • first grouping by country,
    • then calculating the rolling mean
    • we get back a single Series, indexed by country
rolling_mean_cases = combined.groupby('countriesAndTerritories')['cases'].rolling(7).mean()

{: .language-python }

  • unfortunately, the index of our rolling mean series is incompatible with our main df, so we can't add it as a new column
combined['rolling mean'] = rolling_mean_cases

{: .language-python }

  • we can use the reset_index method to remove the "countriesAndTerritories" index
rolling_mean_cases = rolling_mean_cases.reset_index(0, drop=True)

{: .language-python }

  • and now we can add the rolling means as a new column
combined['rolling mean'] = rolling_mean_cases

# plot rolling average for Germany
combined[combined['countriesAndTerritories']=='Germany'].set_index('dateRep')['rolling mean'].plot(kind='line')
# plot cumulative sum of cases for Germany
combined[combined['countriesAndTerritories']=='Germany'].set_index('dateRep')['cases'].cumsum().plot(kind='line')

{: .language-python }

  • now we want to calculate the time difference between the start of lockdown and the peak of rolling mean cases in each country
  • INSERT EXERCISE from !23 here
  • set the index so we can compare the dates:
peak_dates.index = combined.loc[peak_rows]['countriesAndTerritories']
start_dates.index = combined.loc[peak_rows]['countriesAndTerritories']
print((peak_dates - start_dates).median())

{: .language-python }

  • as a final flourish, let's create a boxplot showing the distribution of these time differences:
  • because plot(kind="boxplot") doesn't know how to work with datetime values, we need to extract the number of days as an integer.
  • can do this with the days() method of the datetime object, but have to write a function to get it so we can use .apply...
def get_days(t):
    return t.days

(peak_dates - start_dates).apply(get_days).plot(kind='box')

{: .language-python }

{% include links.md %}