Verified Commit 4e7a42c0 authored by Toby Hodges's avatar Toby Hodges
Browse files

first draft of plotting w/pandas

parent bee1f5db
......@@ -129,7 +129,7 @@ Overall, a `Figure` is composed of the following elements:
> ## Axes vs Axis
> The names `Axes` and `Axis` can be confusing as the first is the plural form of the second.
> The names `Axes` and `Axis` can be confusing as the first is the plural form of the second.
> You could think of `Axes` as a collection of `Axis`, however, it's easier
> to think of `Axes` as a *subplot* which in turn contains two or more `Axis`.
......@@ -602,6 +602,226 @@
# Plotting from pandas
As we saw at the end of the [Working with Data](../02-data/) section,
it is possible to plot data in a pandas `DataFrame` or `Series`
directly from the object itself.
To demonstrate this, we will borrow a dataset
prepared by [Software Carpentry][swc-gapminder-data]
containing data on GDP and population size from [Gapminder][gapminder].
import pandas as pd
gapminder = pd.read_csv('../data/gapminder_all.csv', index_col='country')
{: .language-python }
continent gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \
Algeria Africa 2449.008185 3013.976023 2550.816880
Angola Africa 3520.610273 3827.940465 4269.276742
Benin Africa 1062.752200 959.601080 949.499064
Botswana Africa 851.241141 918.232535 983.653976
Burkina Faso Africa 543.255241 617.183465 722.512021
gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \
Algeria 3246.991771 4182.663766 4910.416756 5745.160213
Angola 5522.776375 5473.288005 3008.647355 2756.953672
Benin 1035.831411 1085.796879 1029.161251 1277.897616
Botswana 1214.709294 2263.611114 3214.857818 4551.142150
Burkina Faso 794.826560 854.735976 743.387037 807.198586
gdpPercap_1987 gdpPercap_1992 ... pop_1962 pop_1967 \
country ...
Algeria 5681.358539 5023.216647 ... 11000948.0 12760499.0
Angola 2430.208311 2627.845685 ... 4826015.0 5247469.0
Benin 1225.856010 1191.207681 ... 2151895.0 2427334.0
Botswana 6205.883850 7954.111645 ... 512764.0 553541.0
Burkina Faso 912.063142 931.752773 ... 4919632.0 5127935.0
pop_1972 pop_1977 pop_1982 pop_1987 pop_1992 \
Algeria 14760787.0 17152804.0 20033753.0 23254956.0 26298373.0
Angola 5894858.0 6162675.0 7016384.0 7874230.0 8735988.0
Benin 2761407.0 3168267.0 3641603.0 4243788.0 4981671.0
Botswana 619351.0 781472.0 970347.0 1151184.0 1342614.0
Burkina Faso 5433886.0 5889574.0 6634596.0 7586551.0 8878303.0
pop_1997 pop_2002 pop_2007
Algeria 29072015.0 31287142 33333216
Angola 9875024.0 10866106 12420476
Benin 6066080.0 7026113 8078314
Botswana 1536536.0 1630347 1639131
Burkina Faso 10352843.0 12251209 14326203
{: .output }
We can plot one of these columns,
e.g. the populations in 1997,
by selecting the column and then calling `.plot`:
{: .langauge-python }
A good visualisation should give the viewer a better understanding of the
underlying data.
Clearly this isn't a good visualisation!
Perhaps more meaningful than showing the population of all countries in 1997,
would be to show how the population of a single country has changed over time.
gapminder.loc['China','pop_1952':'pop_2007'].plot() # we provide a range of column names to .loc
{: .language-python }
As the examples above show,
the default is for the `plot` method to produce a line plot,
just like `pyplot.plot`.
(This is no coincidence,
as the pandas `plot` method is in fact
a wrapper for function calls to `matplotlib.pyplot`.)
We may specify our preference for another type of plot with
the `kind` parameter:
{: .language-python }
Note: you can also use `` and `.plot.<kind>` more generally,
which is useful for getting help:
{: .language-python}
Help on method hexbin in module pandas.plotting._core:
hexbin(x, y, C=None, reduce_C_function=None, gridsize=None, **kwargs) method of pandas.plotting._core.PlotAccessor instance
Generate a hexagonal binning plot.
Generate a hexagonal binning plot of `x` versus `y`. If `C` is `None`
(the default), this is a histogram of the number of occurrences
of the observations at ``(x[i], y[i])``.
If `C` is specified, specifies values at given coordinates
``(x[i], y[i])``. These values are accumulated for each hexagonal
bin and then reduced according to `reduce_C_function`,
having as default the NumPy's mean function (:meth:`numpy.mean`).
(If `C` is specified, it must also be a 1-D sequence
of the same length as `x` and `y`, or a column label.)
x : int or str
The column label or position for x points.
y : int or str
The column label or position for y points.
C : int or str, optional
The column label or position for the value of `(x, y)` point.
reduce_C_function : callable, default `np.mean`
Function of one argument that reduces all the values in a bin to
a single number (e.g. `np.mean`, `np.max`, `np.sum`, `np.std`).
gridsize : int or tuple of (int, int), default 100
The number of hexagons in the x-direction.
The corresponding number of hexagons in the y-direction is
chosen in a way that the hexagons are approximately regular.
Alternatively, gridsize can be a tuple with two elements
specifying the number of hexagons in the x-direction and the
Additional keyword arguments are documented in
The matplotlib ``Axes`` on which the hexbin is plotted.
{: .output }
So far, these plots we've been making from pandas
have existed in their own figure
but we can use the `ax` parameter to attach to a pre-made Axes object.
This can be useful to include the plot as part of a larger figure
(as in the example below)
or to provide a handle for further downstream customisation of
plot style, layout, etc.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2, 2, figsize=(16,8))
gapminder[gapminder['continent']=='Asia']['pop_1997'].plot(kind='bar', ax=ax[1,0])
gapminder[gapminder['continent']=='Asia']['gdpPercap_1997'].plot(kind='bar', ax=ax[0,0])
gapminder[gapminder['continent']=='Asia'].plot(kind='scatter', x='pop_1997', y='gdpPercap_1997', ax=ax[0,1])
ax[1,1].axis('off') # make the bottom-right subplot blank
{: .language-python }
In the example above,
we also use `x=` and `y=` to plot two columns against each other.
Notice how the column names ("pop_1997" and "gdpPercap_1997")
were referred to as strings -
it is assumed that string values like these will refer to columns inside the
`DataFrame` from which `plot` was called.
> ## Other plotting methods
> In addition to the `kind`s of plot we can produce with `.plot`
> (the full list is:
> `area`,
> `bar`,
> `barh`,
> `box`,
> `density`/`kde`,
> `hexbin`,
> `hist`,
> `pie`,
> and `scatter`),
> there also exist separate `.hist` and `.boxplot` methods,
> which use a separate interface.
> When searching for help and reading examples online,
> you might see these methods being used instead of
> `.plot(kind='box')` or `.plot(kind='hist')`.
{: .callout }
> ## Further Reading - plotting from pandas
> We have only superficially explored the pandas plotting interface here
> because we don't want to create further confusion by dwelling on
> _yet another_ interface to Matplotlib.
> If you'd like to learn more about this topic,
> we recommend the following resources:
> - [pandas user guide](
> - [links included in pandas "cookbook" section on plotting](
> - [a blogpost about changing the plotting backend used by Pandas](
{: .callout }
# Where to go from here
## The matplotlib gallery
## stackoverflow
......@@ -801,24 +1021,24 @@ gapminder = pd.read_csv('data/gapminder_all.csv', index_col='country')
> Accidents happen and hard lessons often come in a spiky package!
> You knew that the day your laptop got stolen, with the only copy of
> your code and analysis inside, was going to come back to to haunt you.
> your code and analysis inside, was going to come back to to haunt you.
> You promised yourself that you wouldn't make the same mistake twice,
> that you'd learn how to use [git][swc-git] and have [a backup][worldbackupday].
> that you'd learn how to use [git][swc-git] and have [a backup][worldbackupday].
> You were considerate enough to print your plot and add it to your lab book
> but today, this happened:
> ![plot_stains](../fig/coffee-plot.png)
> As your group works on different aspects of the same proteins, you managed to
> get the [protein sequences](../data/coffee-plot-proteins.fasta) from your colleague.
> get the [protein sequences](../data/coffee-plot-proteins.fasta) from your colleague.
> With this information, you must now recreate the plot to look as close as
> possible to the original.
> possible to the original.
> You will have to calculate the [isoelectric point](
> and [instability index]( of the proteins.
> > ## Hint
> >
> > The `Bio` module is part of the [biopython][biopython] package.
> > The `Bio` module is part of the [biopython][biopython] package.
> > The following function can help you calculate both measures.
> > ~~~
> > from Bio.SeqUtils import ProtParam
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment