{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Python Programming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Plotting Data with Bokeh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Opening Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we have been doing so far has required you to type the data into the programs by hand, which is a bit cruel. For this worksheet, we will be using a larger dataset (still tiny by many standards) and you can download a file containing the data from [GitHub](https://github.com/tobyhodges/ITPP/blob/v2/speciesDistribution.txt). Just right-click 'Raw' at the top of the file content and download/save the linked file into the same directory as you are keeping the Python scripts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, this requires that we know how to get data out of the file and into our Python program and that is what we are going to do in this worksheet. Specifically we are talking about reading data out of text files. Binary files face their own challenges, and I am not going to get into that in this course since handling them is very dependent on the implementation of the binary file. In any case, for a number of significant classes of binary files, such as images, BAM files or NetCDF formatted data, there are already Python modules to enable you to access the data in a simple way. But in any case, we will look at text files for now and firstly we need to know how to open them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have downloaded the file, you should make sure that it is saved into the same folder where you are going to save the python programs that you will use to analyse it. We will start simple, just by opening the file at the Python shell prompt." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "f = open('speciesDistribution.txt', 'r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file is now open, and `f` is a variable referring to a _file_ data type. Obviously, the file argument for `open` is a string containing the filename, but the `'r'` probably needs to be explained. This argument is called the file mode, and `'r'` means that you only want to read data. If you specify `'w'`, it means that you want to write data into the file, which we will talk about later. One very important point is that when you open a file that already exists for writing, the contents of the file are cleared, and can’t be recovered. If you instead want to append data to an existing file you should specify `'a'` as the mode. If you specify `'r+'` then you can read and write to the file. These are the same regardless of the operating system that you are working on, but Windows has a few specific ones of it’s own, which you shouldn’t use if you can avoid them. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you might expect by now, file objects have their own methods and you can use some of these to read data from the file. The easiest way of doing this is to use `.readlines()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lines = f.readlines()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The variable lines now refers to a list of strings containing each of the lines in the file. Try looking at one or two of them. If you didn’t look at the contents of the file before you opened it with your program, have a look at it now. If you compare `lines[1]` in Python with the second line in the file, you will see some differences. Most obvious is the presence of a `\\n` at the end of each line in the Python list. These are _newline characters_ and we need to remember to remove these when we process the data from the file. Although it looks like two characters, it is what is called an escape character: just a single character but one with special meaning to the program and which we cannot normally see in a string. On most of the other lines there is another escape `\\t`, which is a _tab character_. Again, we need to remember this for use later. Tabs are often used to separate data items on the lines of text files because, amongst other reasons, they are much less likely to occur within the data than spaces." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting Data from Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `.readlines()` to create a list containing all of the lines is nice and simple, but has a major drawback. It’s fine when your file is small enough to read all of the lines into memory, but if you are reading a 32Gb SAM file, you are likely to run into problems. Here, you want to read one line at a time, and process it. Python files do have a `.readline()` method that will read only one line, but it’s best to just use a `for` loop. Python has an idea of 'iterable' data types which you can put into `for` loops. We have seen two of these so far: the list and the dictionary. For a list you get each element in turn, and for a dictionary you get each key in turn. Strings are also iterable and return each character in turn. The point of mentioning this now is that files are also iterable, and Python tries to pass you exactly what we want: one line at a time. So we can start to write a program now to start processing this data file. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Note__ This will be the largest program you have written so far, and what I do when I am embarking on writing a large program is to start with just the basic structure and make sure that works then add to the program step by step and keep running it to make sure it is doing what I expect before it gets too complicated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To begin, in an editor window" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "datafile = open('speciesDistribution.txt', 'r')\n", "for line in datafile:\n", " print line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so far so good, the program is basically printing the whole file out to the Python Shell window. However, I forgot about the newline characters at the end of the lines. You have probably noticed that the `print` statement automatically adds a newline to the end of everything it prints, so now we are getting two after each line, which is why the output is double-spaced. So the first thing to do is to fix that, by removing the newline characters from the lines as we read them in. Strings have a `.strip()` method which removes any newlines, spaces or tabs (we called these characters 'whitespace') at the start and end of each line. So add the line" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "line = line.strip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to the loop before the print statement (at the correct level of indentation) and try the program again. Now the output should look single spaced." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "datafile = open('speciesDistribution.txt', 'r')\n", "for line in datafile:\n", " line = line.strip()\n", " print line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Processing the File" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we look again at the file, we can see that it consists of two types of data. Some lines contain the names of sampling sites and some contain a letter and a number. The letters are taxon designators and the numbers represent abundance of that taxon at that particular site (in this case, as measured by high-throughput DNA sequencing of 18S rRNA). We need to process the two line types differently and store the information in a suitable data structure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a moment to think about how you think we might go about doing that, and what the best data structure type to use might be for storing the taxon codes and counts for each site. Don’t worry if you find this a little confusing and/or daunting: we are going to work through it one step at a time, starting by identifying each site described in the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lines with the site names in them all start with the substring `Site:`, so they are easy to recognise. We can use the string’s `.startswith()` method in an if statement to identify these lines so that we can process them separately. Try using this method at the Python command line so you understand how it works before putting it into the program." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _Exercise 5.1_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change the program to only print out the lines that start with `Site:`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once that works, remove the `Site:` substring (and the space that follows it) from the string and just print the actual site name. Make sure that you store the name in a variable at this point as well - we will need it later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Starting to Build the Data Structure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have isolated the site names we can think some more about what kind of data structures we will use to store the data we read from the file. Remember what you learned in the previous worksheet, about how important it is to choose an appropriate data structure. In this case, we have some named sites and then some data corresponding to those sites. That to me sounds like a dictionary. The data we have for each site consists of several lines, which each contain a taxon code (the letter) and a count for that taxon. Again, this sounds like a dictionary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we need a dictionary keyed by each site name, for which the associated value is another dictionary, keyed by the taxon IDs with values that are the counts for that site. So we need to create a dictionary of dictionaries. As with the whole program, it’s probably best to start simple." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to create the top-level dictionary before we can populate it with the data from the file. We do this by defining an empty dictionary. You can do this by putting the line" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sites = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "just before the start of the loop that reads the file. This is often referred to as “initialising” a data structure, and is a strategy that you will use a lot when working with data read into Python from other sources. Now every time you find the name of a new site in the file, you need to create the entry in this dictionary for that site name. Again, the value associated with this site name needs to be a new, empty dictionary. The example below shows how you can extract the site name from a line and create a new dictionary for it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "datafile = open('speciesDistribution.txt', 'r')\n", "for line in datafile:\n", " line = line.strip()\n", " if line.startswith('Site: '): # you should have come up with something similar to\n", " siteName = line[6:] # this in your solution to exercise .1 ...\n", " sites[siteName] = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _Exercise 5.2_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change your program to create the empty dictionaries as above, then right at the end, outside the loop, get it to print out the keys for the sites dictionary. These should be all of the site names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Splitting Lines and Converting Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we are creating a dictionary for each site, we just need to parse the taxon/count lines from the file and put them into the appropriate dictionary for their site. As is common, on these lines we have two items of data. (We know too that once we see one of these lines we must also have the site name, which we have kept in a variable since it was extracted from the `Site:` line.) We can split the line as we did before to get the separate fields. In this case there will be two fields and they are returned as a list, but we can unpack them directly into individual variables in the assignment statement if we want to. So, after inserting an `else:` statement to go with the `if` statement that contains the `.startswith()` test to find the site names, you could type (again with the appropriate level of indentation): " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`taxonID, count = line.split()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just a couple of words of caution. Firstly, this will split the string on all whitespace characters. This is fine in our case, but if any of your data were to contain spaces (for example if the single letter taxon names were classic binomial species names like _Homo sapiens_ instead), they would be split too. You can limit to just tabs with:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`taxonID, count = line.split(‘\\t’)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That’s solved the first problem. The second issue here is the data types. Type the following at the Python shell prompt:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "line = 'A\\t29304'\n", "taxonID, count = line.split('\\t')\n", "count" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "count = count + 99" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "count = 29304\n", "count" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "count = count + 99" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first time that Python prints the value of count, it has quotation marks around it, and you get an error when you try to add 99 to it. The second time it doesn’t have quotation marks and you don’t receive an error when adding 99. This is because the first time, the value of count is not a number but a string representing the number. Perl programmers don’t have to worry about this kind of thing, because Perl will automatically convert things for you when it thinks it needs to. With Python we have to be a bit more careful and convert the data ourselves. This is done with" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "count = int(count)\n", "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to convert to an integer and, if needed, you could convert it back again with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "count = str(count)\n", "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now when you add the lines to your program, you have variables containing the site name, the taxonID and the count (which you can now make sure is converted to a proper integer). You can put these into the dictionary of dictionaries like this: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sites[siteName][taxonID] = count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this statement, `sites[siteName]` refers to the dictionary we created for that site, so we can just append another subscript onto it to get a reference to the data item for this taxon in that site dictionary. Hopefully, that makes some sense. Take a look back at the discussion of nested dictionaries in Worksheet 3 if you need to. Now, finally, all of the data from the file is where we want it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _Exercise 5.3_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make the changes and make sure your program runs without errors. We will also need another change, to keep track of the names/IDs of taxa as we encounter them. At the top of the program, create a new empty list of taxon IDs e.g.,:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "taxa = []" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, when you add a count to the dictionary of dictionaries, check if the taxon ID is in this new list and add it if not (just like you did when merging the shopping lists in Worksheet 2). We will then have a non-redundant list of taxon names to play with in a minute." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Filling in the Blanks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, there is a problem with this data. Some of the taxa were not detected at every one of the sampled sites, so the data for these sites do not include counts associated with those taxa. This means that if we were, say, to plot the data in bar charts, some would have fewer bars than others or the bars would be in different positions, rather than just having a gap (or zero-height bar) where the taxon wasn’t found. What you need to do to avoid this is create new entries with counts of zero for the missing taxa at each site. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _Exercise 5.4_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put the zero values in the data structure. To do this you will need to loop through the sites, and for each site, loop through the IDs in the full, non-redundant taxon list and if a taxon ID is not in the keys of the dictionary for the site, add it with an associated count of zero. Then you will need to check your program is working correctly. A good way to do that is described in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Formatting Data Structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you are building up data structures like this, they can get very complex and it’s difficult to keep track and be sure that you are putting everything in the right place. Fortunately, there is a Python module (part of the standard library), which lets you print out the data in a comprehensible way. Of course, you could just print the entire data structure in one statement and this works, but it can be hard to read - there is no formatting at all - and it often doesn’t really help. The `pprint` module formats the data in a hierarchical way, making it easier to understand. At the top of your program, you need to import the `pprint` module with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You then create a formatter that will do the work for you with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pp = pprint.PrettyPrinter(indent=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, when you want to check a data structure, you can just do the following and get a nice readable printout of your data structure:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "variable = studentNumbers # this is the dictionary from Worksheet 2\n", "pp.pprint(variable)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare this output to the way that the same dictionary is displayed by the default `print` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print studentNumbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I hope you'll agree that the `pprint` version is much easier to interpret by eye." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _Exercise 5.5_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `pprint` module to dump out the contents of you data structure and check that the data corresponds with what you thought it should look like." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plotting Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a number of options available for plotting data in Python. For many years the standard approach was to use a library called `pyplot` from the module `matplotlib`, which closely resembles the plotting interface of the mathematical programming language _MatLab_. The `matplotlib` module is very powerful and extremely flexible, and it is still widely used, but I find the interface a little hard to work with and it is often confusing for beginners. Over recent years, the range of options for plotting data in Python has expanded, with several new modules being introduced that make it easy to create many standard types of plot." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we will be using the `bokeh` plotting library, which makes it easy to create attractive, interactive plots that render in HTML. To do this, you will need to have the `bokeh` module installed. `bokeh` isn’t included in the standard installation of Python, but it is included as part of the Anaconda distribution that I recommended at the start of the course. If you're not using Anaconda, don't worry: you can easily install `bokeh` using the Python package manager `pip`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A couple of notes before we begin: \n", "\n", "- __If you are using the Anaconda Python distribution you don't need to follow the next few steps!__ \n", "- To install modules, you will need to have administrator priviledges for the computer that you're working on. \n", "- If at any point you are unsure about how to follow these instructions, you should ask for help. \n", "\n", "First of all, you should make sure that you have `pip` installed. To do this, you need to open a terminal/command prompt (_not the Python shell_) and type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```Bash\n", "pip help\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have `pip` installed, you should see some helpful output listing all the available options for running the package manager. If not, you will get an (equally helpful) error message. To install `pip` go [here](https://pip.pypa.io/en/stable/installing/) and follow the instructions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To install `bokeh` and `pandas` with `pip`, you simply have to run the commands" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```Bash\n", "pip install bokeh\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "at the command line and respond to any prompts from the package manager. That’s it. (If you are working on a different operating system and/or distribution, ask for help and we will find a way for you to install the packages that you need.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you have the module installed you can return to the Python prompt and type:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from bokeh.charts import Bar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This statement imports only the function `Bar` from the `charts` library of the module `bokeh`. The `charts` library of `bokeh` provides a collection of functions that make it very easy to plot common types of figure - in this case a bar chart. In addition to this function, we will need two more - one to tell Python which file we want to write the HTML output to, and another to tell Python to create the figure. Both of these functions are also available through the `bokeh.charts` library, and we can import them with the same syntax as above." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from bokeh.charts import Bar, output_file, output_notebook, show \n", "#This is the same as importing each function on a separate line, but saves us some space" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Different Approaches to Importing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An important note at this point. Here, we have imported three functions from the `charts` library. If we had already defined a variable or function with the same name as one of these, that would be overwritten by this import. This means that you always need to be careful when importing, and check that you don't clobber some import part of your program in the process. If you are feeling reckless, you could, for example, type (__don't!__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```Python\n", "from bokeh.charts import *\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "to import every function from the `charts` library. Don’t do that. _Ever._ One day it will trip you up and it will take weeks to find out exactly what you have done wrong. Not that I’m speaking from experience, or anything..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, if you want to import everything in the `bokeh.charts` library, you should do so like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```Python\n", "from bokeh import charts\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can access the functions from `charts` in the rest of your program by adding the library name as a prefix. For example, to invoke the `show` function, you would type:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```Python\n", "charts.show()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This way, any functions, objects, or variables that you have already defined will be kept separate from those imported from `bokeh.charts` and won't be overwritten. The separate collections kept under the name of their libraries are known as _namespaces_: here we have imported `charts` in its own namespace, meaning that we have to use the prefix before the name of any of its functions etc that we want to use. This is a much safer approach and can also make your code easier to understand when you or someone else comes back and reads it later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One last thing on importing: `charts` is a pretty short name for a library, but if you are working with one with a much longer name and you want to save yourself some typing, you can use the `as` statement to specify an abbreviation that Python will use to refer to the library you have imported. For example:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```Python\n", "from someModule import reallyLongLibraryName as rlln\n", "\n", "# Now you can invoke the functions from reallyLongLibraryName with the shortened prefix\n", "myVar = rlln.some_cool_function(argument1, argument2)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use this approach when importing `pandas`: shortening the name of the module to `pd` is very common amongst the Python community and you will often see this is documentation, tutorials etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Simple bokeh Plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, we now have `bokeh` loaded and ready to go, and we can get on with trying to plot a series of bar charts of our data. In the end, we will generate some fairly pretty plots, one for each site where data was collected. Each individual bar chart will be created with the `Bar` function that we have just imported. This function has a lot of options that control the appearance of the chart that is produced, but at its simplest it requires only a dictionary object containing the data to be plotted:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = dict(x=list(range(0,21)), y=list(range(0,21)))\n", "myFirstPlot = Bar(data, 'x', 'y')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To view this figure, we need to use the other two functions that we imported. I will use `output_notebook()` here, to render the plot directly into the Jupyter Notebook that these course materials are written in." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "output_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are working within the Python/IPython shell, you should use `output_file()`, a function that specifies the name of the file to which the figure is written." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "```Python\n", "output_file('myFirstPlot.html') # call the file whatever you like, but you should use a .html extension\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And `show()` is used to tell Python that you are done creating the plot and want to output it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "show(myFirstPlot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! We've produced our first `bokeh` figure! You might have noticed that the scaling of axes has been taken care of for you, and there is a toolbar along the top of the plot. Included in these tools is panning and scroll zooming, alowing you to zoom in and out of the plot and navigate around to better interrogate plotted data. This can be really helpful, but can also get a little annoying when we want our view of the data to remain static. That's ok: it's really easy to switch off. Let's do that, at the same time as adding a title and axis names to the plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "myFirstPlot = Bar(data, 'x', 'y', title='My First Plot', tools='', xlabel='number', ylabel='height')\n", "show(myFirstPlot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hopefully, you can understand what we did there. As well as the `data` object, containing the values that we wanted to plot out, we passed several additional arguments to the `Bar` function. These arguments alterred the figure that was output. The arguments all have logical names - `title`, `xlabel` etc - and there are other arguments that we could specify. If you want to learn about them, you can find all of the `bokeh.charts` documentation [here](http://bokeh.pydata.org/en/latest/docs/user_guide/charts.html#userguide-charts)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `charts` library also contains several other functions for common types of plot, and `bokeh` provides functions to display these plots together. For example, below we will create a scatter plot and display that alongside our bar chart in the output:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from bokeh.charts import Scatter\n", "from bokeh.io import hplot\n", "data2 = pd.DataFrame({'X': [0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7],\n", " 'Y': [1,2,1,2,1,2,1,2,3,1,2,3,3,3,2,1],\n", " 'S': ['A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B']})\n", "print(data2)\n", "mySecondPlot = Scatter(data2, x='X', y='Y', color='S', legend='top_right')\n", "layout = hplot(myFirstPlot, mySecondPlot)\n", "show(layout)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These examples should have given you an understanding of the basics of creating and laying out charts with `bokeh`. Now, you will apply this knowledge to plot out the data that was read and stored from the file earlier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### _Exercise 5.6_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot all of the data that you read from the file earlier into a single file of bar plots, one for each site. This is challenging, but take it bit by bit and you should be able to do it. \n", "* You will need to start by declaring the name of the output HTML file.\n", "* Next, start with a loop over the sites. \n", "* In that loop, you need to \n", " * gather the data for the bar heights (the counts),\n", " * then call the `Bar` function, passing in the row heights, a list of labels for the bars (the taxon IDs) and the site name as a title for the chart. \n", "* Once you have it working like that, try changing the program so that the sites and taxon IDs are in alphabetical order. \n", "* Then try changing the color of the bars, so that each taxon is represented in a different color. Make sure that this coloring is consistent across your plots. \n", "\n", "You might need to refer to the `help()` documentation and the online user guide linked above for the plotting functions to achieve everything listed above, but with a little time and effort, you should be able to get your plots to look as below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are really adventurous, plot all of the data on a single set of axes, with the data interleaved and the bars for different sites in different colours." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After you have finished on this exercise, or if you get really stuck and need to look at a solution, take a look at [this notebook](http://nbviewer.ipython.org/github/tobyhodges/ITPP/blob/v2/Exercise5_6WalkthroughBokeh.ipynb), which runs through my way of producing the site plots with a different bar color for each taxon." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Files are opened with the `open()` command, and this returns a file object.\n", "* Methods of the file object, such as `.readline()` or `.readlines()` can be used to get data from the file.\n", "* Files can also be used as iterable data type in `for` statements (and other contexts).\n", "* Python doesn’t convert data types automatically, so you need to use functions like `str()` and `int()` to convert between strings and numbers.\n", "* Python modules provide additional functionality for the language, and can perform many common data analysis tasks. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }