Skip to content
Snippets Groups Projects
title: Working with Data
teaching: 20
exercises: 10
questions:
- "How should I work with numeric data in Python?"
- "What's the recommended way to handle and analyse tabular data?"
- "How can I import tabular data for analysis in Python and export the results?"
objectives:
- "handle and summarise numeric data with Numpy."
- "filter values in their data based on a range of conditions."
- "load tabular data into a Pandas dataframe object."
- "describe what is meant by the data type of an array/series, and the impact this has on how the data is handled."
- "add and remove columns from a dataframe."
- "select, aggregate, and visualise data in a dataframe."
keypoints:
- "Specialised third-party libraries such as Numpy and Pandas provide powerful objects and functions that can help us analyse our data."
- "Pandas dataframe objects allow us to efficiently load and handle large tabular data."
- "Use the `pandas.read_csv` and `pandas.write_csv` functions to read and write tabular data."

plan

  • Toby currently scheduled to lead this session
  • Numpy
    • arrays
    • masking
    • aside about data types and potential hazards
    • reading data from a file (with note that more will come later on this topic)
    • link to existing image analysis material
  • Pandas
    • when an array just isn't enough
    • DataFrames - re-use material from [Software Carpentry][swc-python-gapminder]?
      • ideally with a more relevant example dataset... maybe a COVID one
      • include an aside about I/O - reading/writing files (pandas (the .to_*() methods and highlight some: csv, json, feather, hdf), numpy, open(), (?) bytes vs strings, (?) encoding)
    • Finish with example of df.plot() to set the scene for plotting section

Working with Filtered Data

  1. On what date were the most cases reported in Germany so far?
  2. What was the mean number of cases reported per day in Germany in April 2020?
  3. Is this higher or lower than the mean for March 2020?
  4. On how many days in March was the number of cases in Germany higher than the mean for April?

Solution

# 1
mask_germany = covid_cases['countryterritoryCode'] == 'DEU'
id_max = covid_cases[mask_germany]['cases'].idxmax()
print(covid_cases.iloc[id_max]['dateRep'])

# 2
mask_april = (covid_cases['year'] == 2020) & (covid_cases['month'] == 4)
mean_april = covid_cases[mask_germany & mask_april]['cases'].mean()
print(mean_april)

# 3
mask_march = (covid_cases['year'] == 2020) & (covid_cases['month'] == 3)
mean_march = covid_cases[mask_germany & mask_march]['cases'].mean()
print(mean_march)
print("Mean cases per day was {} in April than in March 2020.".
      format(["lower", "higher"][mean_april > mean_march]))

# 4
mask_higher_mean_april = (covid_cases['cases'] > mean_april)
selection = covid_cases[mask_germany & mask_march & mask_higher_mean_april]
nbr_days = len(selection)   # Assume clean data
print(nbr_days)

{: .language-python } {: .solution } {: .challenge }

{% include links.md %}