The central feature of the NumPy library, is an object known as the `ndarray`
...
...
@@ -82,7 +102,7 @@ way. The `ndarray` is:
* Homogeneous = all elements must be of the same type e.g. all integers
* Vectorised = allows us to do fast operations on the whole array, without needing loops (we'll come back to this later!)
### What data to use with NumPy?
### What Data to Use with NumPy?
NumPy can be useful for working with all kinds of numeric data that fulfill the criteria above
(i.e. homogeneous and multidimensional). One of the most common applications though, is image data.
Images are essentially arrays of numbers that represent the brightness of each pixel. For example, take the simple
...
...
@@ -98,7 +118,7 @@ The entire nuclei image may appear black when viewed in a web browser.
This is nothing to worry about.
(The code examples assume that you save these files in a folder called `data`.)
### Reading data to a NumPy array
### Reading Data to a NumPy Array
We'll use the popular image analysis package scikit-image,
to read two example images into NumPy arrays (if you want to learn more about image analysis with
tools like scikit-image - check out our existing [image analysis course] [image-analysis-course]).
...
...
@@ -119,7 +139,7 @@ plt.imshow(nuclei)
The 'raw' image is an electron microscopy image of cells from the marine worm
*Platynereis dumerilii*. The 'nuclei' image is a segmentation of the nuclei of these same cells.
### The basic features of NumPy arrays
### The Basic Features of NumPy Arrays
Once we have our data in a NumPy array, we want to explore it a bit
e.g. how many dimensions does our array have, and of what size?
This is represented by the `shape` of an array, which denotes the length of the array in each dimension.
...
...
@@ -171,7 +191,7 @@ print(np.std(raw))
> {: .solution }
{: .challenge }
### Indexing arrays
### Indexing Arrays
Now we have a general idea of what our array contains, we want to start manipulating particular regions
of it.
...
...
@@ -208,7 +228,7 @@ array([[156, 173, 156, 161],
~~~
{: .output }
> ## 2.2. Subsetting a NumPy array
> ## 2.2. Subsetting a NumPy Array
>
> Crop the 'raw' image, by removing a border of 500 pixels on all sides.
>
...
...
@@ -222,7 +242,7 @@ array([[156, 173, 156, 161],
> {: .solution }
{: .challenge }
### Boolean indexing
### Boolean Indexing
Sometimes we want to access certain parts of an array not based on position, but instead on some criterion.
e.g. selecting values that are above some threshold.
...
...
@@ -271,7 +291,7 @@ print(raw[criteria])
~~~
{: .output }
> ## 2.3. Masking arrays
> ## 2.3. Masking Arrays
>
> The nuclei image contains a binary segmentation i.e.:
>
...
...
@@ -296,7 +316,7 @@ print(raw[criteria])
> {: .solution }
{: .challenge }
### The power of vectorisation
### The Power of Vectorisation
One of the big advantages of NumPy is that operations are **vectorised**. This means that operations can be
applied to the whole array very quickly, without the need for loops. Many of the operations we've used
...
...
@@ -330,7 +350,7 @@ np.exp(2)
~~~
{: .language-python }
### NumPy data types
### NumPy Data Types
As we touched on briefly earlier, each `ndarray` has a particular data type (`dtype`) assigned to it.
This defines what kind of values (and what range of values) can be placed in the array.
...
...
@@ -360,7 +380,7 @@ bitsize allows you to store a wider range of values, but will take up more space
always a trade-off between the space it takes up in your computer's memory, and the size of the numbers you want to store.
Note that the size of the values stored in the array has little effect on the memory it takes up i.e. an array of small values but with a large bitsize will still take up a lot of memory.
> ## 2.4. Working with data types
> ## 2.4. Working with Data Types
>
> 1. Increase the brightness of the image by 100
> 2. Why does the result look so bizarre? What is going wrong here?
...
...
@@ -419,7 +439,7 @@ to create our first `DataFrame` object.
The code examples assume that you save these files in a folder called `data`.)
~~~
import pandas as pd # this is how pandas is traditionally installed
import pandas as pd # this is how pandas is traditionally imported
covid_cases = pd.read_csv("{{page.cases_data}}")
~~~
{: .language-python }
...
...
@@ -470,7 +490,7 @@ From this output, we can already get a feeling for the data we've loaded:
- the data in these columns include **integers** (day-of-month, number of cases, etc), **floating point numbers** (population of country in 2019), and **non-numeric data** (dates, continent and country names, etc)
- the rows seem to be **indexed numerically** starting from 0
- the dataframe includes data from **at least two countries** (Afghanistan and Zimbabwe) **and continents** (Asia and Africa).
- based on the appearance of Afghanistan at the top of the dataframe, and Zimbabwe at the bottom, we may assume that counries are ordered alphabetically though we aren't yet able to understand all the details of that ordering
- based on the appearance of Afghanistan at the top of the dataframe, and Zimbabwe at the bottom, we may assume that countries are ordered alphabetically though we aren't yet able to understand all the details of that ordering
> ## Jupyter 🧡 pandas
>
...
...
@@ -492,8 +512,7 @@ From this output, we can already get a feeling for the data we've loaded:
### Working with Dataframes
All those `...` in the output from `print` above indicate that some lines
(in this case 25507!) were skipped to display
a truncated view of the dataframe.
were skipped to display a truncated view of the dataframe.
For such a large dataset, it's unhelpful to view the entire thing at once.
For convenience, `DataFrame` objects are equipped with
`head` and `tail` methods that allow us to view only the first or last