Skip to content
Snippets Groups Projects
title: Working with Data
teaching: 20
exercises: 10
questions:
- "How should I work with numeric data in Python?"
- "What's the recommended way to handle and analyse tabular data?"
- "How can I import tabular data for analysis in Python and export the results?"
objectives:
- "handle and summarise numeric data with Numpy."
- "filter values in their data based on a range of conditions."
- "load tabular data into a Pandas dataframe object."
- "describe what is meant by the data type of an array/series, and the impact this has on how the data is handled."
- "add and remove columns from a dataframe."
- "select, aggregate, and visualise data in a dataframe."
keypoints:
- "Specialised third-party libraries such as Numpy and Pandas provide powerful objects and functions that can help us analyse our data."
- "Pandas dataframe objects allow us to efficiently load and handle large tabular data."
- "Use the `pandas.read_csv` and `pandas.write_csv` functions to read and write tabular data."

plan

  • Toby currently scheduled to lead this session
  • Numpy
    • arrays
    • masking
    • aside about data types and potential hazards
    • reading data from a file (with note that more will come later on this topic)
    • link to existing image analysis material
  • Pandas
    • when an array just isn't enough
    • DataFrames - re-use material from [Software Carpentry][swc-python-gapminder]?
      • ideally with a more relevant example dataset... maybe a COVID one
      • include an aside about I/O - reading/writing files (pandas (the .to_*() methods and highlight some: csv, json, feather, hdf), numpy, open(), (?) bytes vs strings, (?) encoding)
    • Finish with example of df.plot() to set the scene for plotting section

Working with Filtered Data

  • On what date was the most cases reported in Germany so far?
  • What was the mean number of cases reported per day in Germany in April 2020?
  • Is this higher or lower than the mean for March 2020?
  • On how many days in March was the number of cases in Germany higher than the mean for April?

Solution

mask_germany = c['countryterritoryCode'] == 'DEU'
id_max = c[mask_germany]['cases'].idxmax()

print(c.iloc[id_max]['dateRep'])

mask_april = (c['year'] == 2020) & (c['month'] == 4)
mean_april = c[mask_germany & mask_april]['cases'].mean()

mask_march = (c['year'] == 2020) & (c['month'] == 3)
mean_march = c[mask_germany & mask_march]['cases'].mean()

print(mean_april)
print(mean_march)

mask_higher_mean_april = (c['cases'] > mean_april)
selection = c[mask_germany & mask_march & mask_higher_mean_april]
nbr_days = len(selection)   # Assume clean data

print(nbr_days)

{: .language-python } {: .solution } {: .challenge }

{% include links.md %}