5_PlottingDataWithBokeh.ipynb 38.1 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Python Programming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Plotting Data with Bokeh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Opening Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What we have been doing so far has required you to type the data into the programs by hand, which is a bit cruel. For this worksheet, we will be using a larger dataset (still tiny by many standards) and you can download a file containing the data from [GitHub](https://github.com/tobyhodges/ITPP/blob/v2/speciesDistribution.txt). Just right-click 'Raw' at the top of the file content and download/save the linked file into the same directory as you are keeping the Python scripts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, this requires that we know how to get data out of the file and into our Python program and that is what we are going to do in this worksheet. Specifically we are talking about reading data out of text files. Binary files face their own challenges, and I am not going to get into that in this course since handling them is very dependent on the implementation of the binary file. In any case, for a number of significant classes of binary files, such as images, BAM files or NetCDF formatted data, there are already Python modules to enable you to access the data in a simple way. But in any case, we will look at text files for now and firstly we need to know how to open them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have downloaded the file, you should make sure that it is saved into the same folder where you are going to save the python programs that you will use to analyse it. We will start simple, just by opening the file at the Python shell prompt."
   ]
  },
  {
   "cell_type": "code",
47
   "execution_count": null,
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "f = open('speciesDistribution.txt', 'r')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file is now open, and `f` is a variable referring to a _file_ data type.  Obviously, the file argument for `open` is a string containing the filename, but the `'r'` probably needs to be explained. This argument is called the file mode, and `'r'` means that you only want to read data. If you specify `'w'`, it means that you want to write data into the file, which we will talk about later. One very important point is that when you open a file that already exists for writing, the contents of the file are cleared, and can’t be recovered. If you instead want to append data to an existing file you should specify `'a'` as the mode. If you specify `'r+'` then you can read and write to the file. These are the same regardless of the operating system that you are working on, but Windows has a few specific ones of it’s own, which you shouldn’t use if you can avoid them. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you might expect by now, file objects have their own methods and you can use some of these to read data from the file.  The easiest way of doing this is to use `.readlines()`:"
   ]
  },
  {
   "cell_type": "code",
72
   "execution_count": null,
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lines = f.readlines()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The variable lines now refers to a list of strings containing each of the lines in the file. Try looking at one or two of them. If you didn’t look at the contents of the file before you opened it with your program, have a look at it now. If you compare `lines[1]` in Python with the second line in the file, you will see some differences. Most obvious is the presence of a `\\n` at the end of each line in the Python list. These are _newline characters_ and we need to remember to remove these when we process the data from the file. Although it looks like two characters, it is what is called an escape character: just a single character but one with special meaning to the program and which we cannot normally see in a string. On most of the other lines there is another escape `\\t`, which is a _tab character_. Again, we need to remember this for use later. Tabs are often used to separate data items on the lines of text files because, amongst other reasons, they are much less likely to occur within the data than spaces."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Getting Data from Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `.readlines()` to create a list containing all of the lines is nice and simple, but has a major drawback. It’s fine when your file is small enough to read all of the lines into memory, but if you are reading a 32Gb SAM file, you are likely to run into problems. Here, you want to read one line at a time, and process it. Python files do have a `.readline()` method that will read only one line, but it’s best to just use a `for` loop. Python has an idea of 'iterable' data types which you can put into `for` loops. We have seen two of these so far: the list and the dictionary. For a list you get each element in turn, and for a dictionary you get each key in turn. Strings are also iterable and return each character in turn. The point of mentioning this now is that files are also iterable, and Python tries to pass you exactly what we want: one line at a time. So we can start to write a program now to start processing this data file. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Note__ This will be the largest program you have written so far, and what I do when I am embarking on writing a large program is to start with just the basic structure and make sure that works then add to the program step by step and keep running it to make sure it is doing what I expect before it gets too complicated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To begin, in an editor window"
   ]
  },
  {
   "cell_type": "code",
118
   "execution_count": null,
119
120
121
   "metadata": {
    "collapsed": false
   },
122
   "outputs": [],
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
   "source": [
    "datafile = open('speciesDistribution.txt', 'r')\n",
    "for line in datafile:\n",
    "    print line"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, so far so good, the program is basically printing the whole file out to the Python Shell window. However, I forgot about the newline characters at the end of the lines. You have probably noticed that the `print` statement automatically adds a newline to the end of everything it prints, so now we are getting two after each line, which is why the output is double-spaced. So the first thing to do is to fix that, by removing the newline characters from the lines as we read them in. Strings have a `.strip()` method which removes any newlines, spaces or tabs (we called these characters 'whitespace') at the start and end of each line. So add the line"
   ]
  },
  {
   "cell_type": "code",
138
   "execution_count": null,
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "line = line.strip()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to the loop before the print statement (at the correct level of indentation) and try the program again.  Now the output should look single spaced."
   ]
  },
  {
   "cell_type": "code",
156
   "execution_count": null,
157
158
159
   "metadata": {
    "collapsed": false
   },
160
   "outputs": [],
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
   "source": [
    "datafile = open('speciesDistribution.txt', 'r')\n",
    "for line in datafile:\n",
    "    line = line.strip()\n",
    "    print line"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Processing the File"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we look again at the file, we can see that it consists of two types of data. Some lines contain the names of sampling sites and some contain a letter and a number. The letters are taxon designators and the numbers represent abundance of that taxon at that particular site (in this case, as measured by high-throughput DNA sequencing of 18S rRNA). We need to process the two line types differently and store the information in a suitable data structure."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a moment to think about how you think we might go about doing that, and what the best data structure type to use might be for storing the taxon codes and counts for each site. Don’t worry if you find this a little confusing and/or daunting: we are going to work through it one step at a time, starting by identifying each site described in the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The lines with the site names in them all start with the substring `Site:`, so they are easy to recognise.  We can use the string’s `.startswith()` method in an if statement to identify these lines so that we can process them separately.  Try using this method at the Python command line so you understand how it works before putting it into the program."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
200
    "#### _Exercise 5.1_"
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the program to only print out the lines that start with `Site:`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once that works, remove the `Site:` substring (and the space that follows it) from the string and just print the actual site name. Make sure that you store the name in a variable at this point as well - we will need it later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Starting to Build the Data Structure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have isolated the site names we can think some more about what kind of data structures we will use to store the data we read from the file. Remember what you learned in the previous worksheet, about how important it is to choose an appropriate data structure. In this case, we have some named sites and then some data corresponding to those sites. That to me sounds like a dictionary. The data we have for each site consists of several lines, which each contain a taxon code (the letter) and a count for that taxon. Again, this sounds like a dictionary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we need a dictionary keyed by each site name, for which the associated value is another dictionary, keyed by the taxon IDs with values that are the counts for that site. So we need to create a dictionary of dictionaries. As with the whole program, it’s probably best to start simple."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to create the top-level dictionary before we can populate it with the data from the file. We do this by defining an empty dictionary. You can do this by putting the line"
   ]
  },
  {
   "cell_type": "code",
247
   "execution_count": null,
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sites = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "just before the start of the loop that reads the file. This is often referred to as “initialising” a data structure, and is a strategy that you will use a lot when working with data read into Python from other sources. Now every time you find the name of a new site in the file, you need to create the entry in this dictionary for that site name. Again, the value associated with this site name needs to be a new, empty dictionary. The example below shows how you can extract the site name from a line and create a new dictionary for it."
   ]
  },
  {
   "cell_type": "code",
265
   "execution_count": null,
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "datafile = open('speciesDistribution.txt', 'r')\n",
    "for line in datafile:\n",
    "    line = line.strip()\n",
    "    if line.startswith('Site: '):  # you should have come up with something similar to\n",
    "        siteName = line[6:]        # this in your solution to exercise .1 ...\n",
    "        sites[siteName] = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
283
    "#### _Exercise 5.2_"
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change your program to create the empty dictionaries as above, then right at the end, outside the loop, get it to print out the keys for the sites dictionary. These should be all of the site names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Splitting Lines and Converting Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we are creating a dictionary for each site, we just need to parse the taxon/count lines from the file and put them into the appropriate dictionary for their site. As is common, on these lines we have two items of data. (We know too that once we see one of these lines we must also have the site name, which we have kept in a variable since it was extracted from the `Site:` line.) We can split the line as we did before to get the separate fields. In this case there will be two fields and they are returned as a list, but we can unpack them directly into individual variables in the assignment statement if we want to. So, after inserting an `else:` statement to go with the `if` statement that contains the `.startswith()` test to find the site names, you could type (again with the appropriate level of indentation): "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`taxonID, count = line.split()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just a couple of words of caution.  Firstly, this will split the string on all whitespace characters.  This is fine in our case, but if any of your data were to contain spaces (for example if the single letter taxon names were classic binomial species names like _Homo sapiens_ instead), they would be split too.  You can limit to just tabs with:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`taxonID, count = line.split(‘\\t’)`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That’s solved the first problem. The second issue here is the data types. Type the following at the Python shell prompt:"
   ]
  },
  {
   "cell_type": "code",
337
   "execution_count": null,
338
339
340
   "metadata": {
    "collapsed": false
   },
341
   "outputs": [],
342
343
344
345
346
347
348
349
   "source": [
    "line = 'A\\t29304'\n",
    "taxonID, count = line.split('\\t')\n",
    "count"
   ]
  },
  {
   "cell_type": "code",
350
   "execution_count": null,
351
352
353
   "metadata": {
    "collapsed": false
   },
354
   "outputs": [],
355
356
357
358
359
360
   "source": [
    "count = count + 99"
   ]
  },
  {
   "cell_type": "code",
361
   "execution_count": null,
362
363
364
   "metadata": {
    "collapsed": false
   },
365
   "outputs": [],
366
367
368
369
370
371
372
   "source": [
    "count = 29304\n",
    "count"
   ]
  },
  {
   "cell_type": "code",
373
   "execution_count": null,
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "count = count + 99"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first time that Python prints the value of count, it has quotation marks around it, and you get an error when you try to add 99 to it. The second time it doesn’t have quotation marks and you don’t receive an error when adding 99. This is because the first time, the value of count is not a number but a string representing the number. Perl programmers don’t have to worry about this kind of thing, because Perl will automatically convert things for you when it thinks it needs to. With Python we have to be a bit more careful and convert the data ourselves.  This is done with"
   ]
  },
  {
   "cell_type": "code",
391
   "execution_count": null,
392
393
394
   "metadata": {
    "collapsed": false
   },
395
   "outputs": [],
396
397
398
399
400
401
402
403
404
405
406
407
408
409
   "source": [
    "count = int(count)\n",
    "count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to convert to an integer and, if needed, you could convert it back again with:"
   ]
  },
  {
   "cell_type": "code",
410
   "execution_count": null,
411
412
413
   "metadata": {
    "collapsed": false
   },
414
   "outputs": [],
415
416
417
418
419
420
421
422
423
424
425
426
427
428
   "source": [
    "count = str(count)\n",
    "count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now when you add the lines to your program, you have variables containing the site name, the taxonID and the count (which you can now make sure is converted to a proper integer). You can put these into the dictionary of dictionaries like this: "
   ]
  },
  {
   "cell_type": "code",
429
   "execution_count": null,
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sites[siteName][taxonID] = count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this statement, `sites[siteName]` refers to the dictionary we created for that site, so we can just append another subscript onto it to get a reference to the data item for this taxon in that site dictionary. Hopefully, that makes some sense. Take a look back at the discussion of nested dictionaries in Worksheet 3 if you need to. Now, finally, all of the data from the file is where we want it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
449
    "#### _Exercise 5.3_"
450
451
452
453
454
455
456
457
458
459
460
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make the changes and make sure your program runs without errors.  We will also need another change, to keep track of the names/IDs of taxa as we encounter them.  At the top of the program, create a new empty list of taxon IDs e.g.,:"
   ]
  },
  {
   "cell_type": "code",
461
   "execution_count": null,
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "taxa = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, when you add a count to the dictionary of dictionaries, check if the taxon ID is in this new list and add it if not (just like you did when merging the shopping lists in Worksheet 2).  We will then have a non-redundant list of taxon names to play with in a minute."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Filling in the Blanks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unfortunately, there is a problem with this data. Some of the taxa were not detected at every one of the sampled sites, so the data for these sites do not include counts associated with those taxa. This means that if we were, say, to plot the data in bar charts, some would have fewer bars than others or the bars would be in different positions, rather than just having a gap (or zero-height bar) where the taxon wasn’t found. What you need to do to avoid this is create new entries with counts of zero for the missing taxa at each site. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
495
    "#### _Exercise 5.4_"
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Put the zero values in the data structure. To do this you will need to loop through the sites, and for each site, loop through the IDs in the full, non-redundant taxon list and if a taxon ID is not in the keys of the dictionary for the site, add it with an associated count of zero. Then you will need to check your program is working correctly. A good way to do that is described in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Formatting Data Structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you are building up data structures like this, they can get very complex and it’s difficult to keep track and be sure that you are putting everything in the right place. Fortunately, there is a Python module (part of the standard library), which lets you print out the data in a comprehensible way. Of course, you could just print the entire data structure in one statement and this works, but it can be hard to read - there is no formatting at all - and it often doesn’t really help.  The `pprint` module formats the data in a hierarchical way, making it easier to understand. At the top of your program, you need to import the `pprint` module with:"
   ]
  },
  {
   "cell_type": "code",
521
   "execution_count": null,
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You then create a formatter that will do the work for you with:"
   ]
  },
  {
   "cell_type": "code",
539
   "execution_count": null,
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pp = pprint.PrettyPrinter(indent=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, when you want to check a data structure, you can just do the following and get a nice readable printout of your data structure:"
   ]
  },
  {
   "cell_type": "code",
557
   "execution_count": null,
558
559
560
   "metadata": {
    "collapsed": false
   },
561
   "outputs": [],
562
563
564
565
566
567
568
569
570
571
572
573
574
575
   "source": [
    "variable = studentNumbers # this is the dictionary from Worksheet 2\n",
    "pp.pprint(variable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare this output to the way that the same dictionary is displayed by the default `print` function:"
   ]
  },
  {
   "cell_type": "code",
576
   "execution_count": null,
577
578
579
   "metadata": {
    "collapsed": false
   },
580
   "outputs": [],
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
   "source": [
    "print studentNumbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I hope you'll agree that the `pprint` version is much easier to interpret by eye."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
596
    "#### _Exercise 5.5_"
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the `pprint` module to dump out the contents of you data structure and check that the data corresponds with what you thought it should look like."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Plotting Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a number of options available for plotting data in Python. For many years the standard approach was to use a library called `pyplot` from the module `matplotlib`, which closely resembles the plotting interface of the mathematical programming language _MatLab_. The `matplotlib` module is very powerful and extremely flexible, and it is still widely used, but I find the interface a little hard to work with and it is often confusing for beginners. Over recent years, the range of options for plotting data in Python has expanded, with several new modules being introduced that make it easy to create many standard types of plot."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we will be using the `bokeh` plotting library, which makes it easy to create attractive, interactive plots that render in HTML. To do this, you will need to have the `bokeh` module installed. `bokeh` isn’t included in the standard installation of Python, but it is included as part of the Anaconda distribution that I recommended at the start of the course. If you're not using Anaconda, don't worry: you can easily install `bokeh` using the Python package manager `pip`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A couple of notes before we begin:   \n",
    "\n",
    "- __If you are using the Anaconda Python distribution you don't need to follow the next few steps!__ \n",
    "- To install modules, you will need to have administrator priviledges for the computer that you're working on. \n",
    "- If at any point you are unsure about how to follow these instructions, you should ask for help.  \n",
    "\n",
    "First of all, you should make sure that you have `pip` installed. To do this, you need to open a terminal/command prompt (_not the Python shell_) and type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```Bash\n",
    "pip help\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have `pip` installed, you should see some helpful output listing all the available options for running the package manager. If not, you will get an (equally helpful) error message. To install `pip` go [here](https://pip.pypa.io/en/stable/installing/) and follow the instructions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To install `bokeh` and `pandas` with `pip`, you simply have to run the commands"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```Bash\n",
    "pip install bokeh\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "at the command line and respond to any prompts from the package manager. That’s it. (If you are working on a different operating system and/or distribution, ask for help and we will find a way for you to install the packages that you need.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you have the module installed you can return to the Python prompt and type:"
   ]
  },
  {
   "cell_type": "code",
688
   "execution_count": null,
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from bokeh.charts import Bar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This statement imports only the function `Bar` from the `charts` library of the module `bokeh`. The `charts` library of `bokeh` provides a collection of functions that make it very easy to plot common types of figure - in this case a bar chart. In addition to this function, we will need two more - one to tell Python which file we want to write the HTML output to, and another to tell Python to create the figure. Both of these functions are also available through the `bokeh.charts` library, and we can import them with the same syntax as above."
   ]
  },
  {
   "cell_type": "code",
706
   "execution_count": null,
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from bokeh.charts import Bar, output_file, output_notebook, show \n",
    "#This is the same as importing each function on a separate line, but saves us some space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Different Approaches to Importing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important note at this point. Here, we have imported three functions from the `charts` library. If we had already defined a variable or function with the same name as one of these, that would be overwritten by this import. This means that you always need to be careful when importing, and check that you don't clobber some import part of your program in the process. If you are feeling reckless, you could, for example, type (__don't!__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```Python\n",
    "from bokeh.charts import *\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to import every function from the `charts` library. Don’t do that. _Ever._ One day it will trip you up and it will take weeks to find out exactly what you have done wrong. Not that I’m speaking from experience, or anything..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead, if you want to import everything in the `bokeh.charts` library, you should do so like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```Python\n",
    "from bokeh import charts\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can access the functions from `charts` in the rest of your program by adding the library name as a prefix. For example, to invoke the `show` function, you would type:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```Python\n",
    "charts.show()\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This way, any functions, objects, or variables that you have already defined will be kept separate from those imported from `bokeh.charts` and won't be overwritten. The separate collections kept under the name of their libraries are known as _namespaces_: here we have imported `charts` in its own namespace, meaning that we have to use the prefix before the name of any of its functions etc that we want to use. This is a much safer approach and can also make your code easier to understand when you or someone else comes back and reads it later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One last thing on importing: `charts` is a pretty short name for a library, but if you are working with one with a much longer name and you want to save yourself some typing, you can use the `as` statement to specify an abbreviation that Python will use to refer to the library you have imported. For example:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```Python\n",
    "from someModule import reallyLongLibraryName as rlln\n",
    "\n",
    "# Now you can invoke the functions from reallyLongLibraryName with the shortened prefix\n",
    "myVar = rlln.some_cool_function(argument1, argument2)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use this approach when importing `pandas`: shortening the name of the module to `pd` is very common amongst the Python community and you will often see this is documentation, tutorials etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Simple bokeh Plots"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, we now have `bokeh` loaded and ready to go, and we can get on with trying to plot a series of bar charts of our data. In the end, we will generate some fairly pretty plots, one for each site where data was collected. Each individual bar chart will be created with the `Bar` function that we have just imported. This function has a lot of options that control the appearance of the chart that is produced, but at its simplest it requires only a dictionary object containing the data to be plotted:"
   ]
  },
  {
   "cell_type": "code",
827
   "execution_count": null,
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data = dict(x=list(range(0,21)), y=list(range(0,21)))\n",
    "myFirstPlot = Bar(data, 'x', 'y')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To view this figure, we need to use the other two functions that we imported. I will use `output_notebook()` here, to render the plot directly into the Jupyter Notebook that these course materials are written in."
   ]
  },
  {
   "cell_type": "code",
846
   "execution_count": null,
847
848
849
   "metadata": {
    "collapsed": false
   },
850
   "outputs": [],
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
   "source": [
    "output_notebook()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are working within the Python/IPython shell, you should use `output_file()`, a function that specifies the name of the file to which the figure is written."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "```Python\n",
    "output_file('myFirstPlot.html') # call the file whatever you like, but you should use a .html extension\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And `show()` is used to tell Python that you are done creating the plot and want to output it:"
   ]
  },
  {
   "cell_type": "code",
882
   "execution_count": null,
883
884
885
   "metadata": {
    "collapsed": false
   },
886
   "outputs": [],
887
888
889
890
891
892
893
894
895
896
897
898
899
   "source": [
    "show(myFirstPlot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! We've produced our first `bokeh` figure! You might have noticed that the scaling of axes has been taken care of for you, and there is a toolbar along the top of the plot. Included in these tools is panning and scroll zooming, alowing you to zoom in and out of the plot and navigate around to better interrogate plotted data. This can be really helpful, but can also get a little annoying when we want our view of the data to remain static. That's ok: it's really easy to switch off. Let's do that, at the same time as adding a title and axis names to the plot:"
   ]
  },
  {
   "cell_type": "code",
900
   "execution_count": null,
901
902
903
   "metadata": {
    "collapsed": false
   },
904
   "outputs": [],
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
   "source": [
    "myFirstPlot = Bar(data, 'x', 'y', title='My First Plot', tools='', xlabel='number', ylabel='height')\n",
    "show(myFirstPlot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hopefully, you can understand what we did there. As well as the `data` object, containing the values that we wanted to plot out, we passed several additional arguments to the `Bar` function. These arguments alterred the figure that was output. The arguments all have logical names - `title`, `xlabel` etc - and there are other arguments that we could specify. If you want to learn about them, you can find all of the `bokeh.charts` documentation [here](http://bokeh.pydata.org/en/latest/docs/user_guide/charts.html#userguide-charts)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `charts` library also contains several other functions for common types of plot, and `bokeh` provides functions to display these plots together. For example, below we will create a scatter plot and display that alongside our bar chart in the output:"
   ]
  },
  {
   "cell_type": "code",
926
   "execution_count": null,
927
928
929
   "metadata": {
    "collapsed": false
   },
930
   "outputs": [],
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
   "source": [
    "from bokeh.charts import Scatter\n",
    "from bokeh.io import hplot\n",
    "data2 = pd.DataFrame({'X': [0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7],\n",
    "                      'Y': [1,2,1,2,1,2,1,2,3,1,2,3,3,3,2,1],\n",
    "                      'S': ['A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B']})\n",
    "print(data2)\n",
    "mySecondPlot = Scatter(data2, x='X', y='Y', color='S', legend='top_right')\n",
    "layout = hplot(myFirstPlot, mySecondPlot)\n",
    "show(layout)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These examples should have given you an understanding of the basics of creating and laying out charts with `bokeh`. Now, you will apply this knowledge to plot out the data that was read and stored from the file earlier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
954
    "#### _Exercise 5.6_"
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot all of the data that you read from the file earlier into a single file of bar plots, one for each site. This is challenging, but take it bit by bit and you should be able to do it.  \n",
    "* You will need to start by declaring the name of the output HTML file.\n",
    "* Next, start with a loop over the sites. \n",
    "* In that loop, you need to \n",
    "  * gather the data for the bar heights (the counts),\n",
    "  * then call the `Bar` function, passing in the row heights, a list of labels for the bars (the taxon IDs) and the site name as a title for the chart. \n",
    "* Once you have it working like that, try changing the program so that the sites and taxon IDs are in alphabetical order. \n",
    "* Then try changing the color of the bars, so that each taxon is represented in a different color. Make sure that this coloring is consistent across your plots.  \n",
    "\n",
    "You might need to refer to the `help()` documentation and the online user guide linked above for the plotting functions to achieve everything listed above, but with a little time and effort, you should be able to get your plots to look as below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are really adventurous, plot all of the data on a single set of axes, with the data interleaved and the bars for different sites in different colours."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
984
    "After you have finished on this exercise, or if you get really stuck and need to look at a solution, take a look at [this notebook](http://nbviewer.ipython.org/github/tobyhodges/ITPP/blob/v2/Exercise5_6WalkthroughBokeh.ipynb), which runs through my way of producing the site plots with a different bar color for each taxon."
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Files are opened with the `open()` command, and this returns a file object.\n",
    "* Methods of the file object, such as `.readline()` or `.readlines()` can be used to get data from the file.\n",
    "* Files can also be used as iterable data type in `for` statements (and other contexts).\n",
    "* Python doesn’t convert data types automatically, so you need to use functions like `str()` and `int()` to convert between strings and numbers.\n",
    "* Python modules provide additional functionality for the language, and can perform many common data analysis tasks. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}