diff --git a/Tutorial_HTM_2016.Rmd b/Tutorial_HTM_2016.Rmd index 7cfcea7c6ec1eea75912f4fa3e6ca7a4b3971727..8f92f21294ffcb1addbff42e965b306460789cb6 100755 --- a/Tutorial_HTM_2016.Rmd +++ b/Tutorial_HTM_2016.Rmd @@ -1,6 +1,6 @@ --- title: "Visual Exploration of High-Throughput-Microscopy Data" -author: "Bernd Klaus, Andrzej Oles, Mike Smith" +author: "Bernd Klaus, Andrzej OleÅ›, Mike Smith" date: "`r doc_date()`" bibliography: HTM_2016.bib output: @@ -59,7 +59,6 @@ each single cell. These classification results have been obtained using a machin learning algorithm based on the original image features. The data produced is similar to the one in @Neumann_2010: Each cell is classified into a mitotic phenotype class. -<!--  --> # Annotation import @@ -77,7 +76,7 @@ head(plate_map) # Importing the raw data We will now import the raw data. This data is stored in a variant of the [HDF5 format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) called -["CellH5"](http://www.cellh5.org/), +"[CellH5](http://www.cellh5.org/)", which defines a more restricted sub-format designed specifically to store data from high content screens. More information can be found in the paper by @Sommer_2013. @@ -106,7 +105,7 @@ the screen plate in the columns and the counts for the respective classes in the rows. This is a typical example of a "wide" data table, where the variables -contained in the data set spread across multiple columns. +contained in the data set spread across multiple columns (here we only show the first six ones). ```{r import_data_table, dependson="readCellH5"} raw_data <- sapply(c5_pos, @@ -115,7 +114,7 @@ raw_data <- sapply(c5_pos, table(predictions) }) -head(raw_data) +raw_data[, 1:6] ``` @@ -496,7 +495,7 @@ assembled in the R object `DNase`, which conveniently comes with base R. `conc`, the protein concentration that was used; and `density`, the measured optical density. -```{r figredobasicplottingwithggplot, fig.width = 3.5, fig.height = 5} +```{r figredobasicplottingwithggplot, fig.width = 6, fig.height = 9} ggplot(DNase, aes(x = conc, y = density, color = Run)) + geom_point() ``` @@ -507,7 +506,7 @@ Then we told `ggplot` via the aesthetics `aes` argument which variables we want on the $x$- and $y$-axes, respectively and mapped the run number to the color aesthetic. Finally, we stated that we want the plot to use points, by adding the result -of calling the function `geom\_point`. +of calling the function `geom_point`. ## Principal component analysis (PCA) to for data visualization diff --git a/Tutorial_HTM_2016.html b/Tutorial_HTM_2016.html index 3cd31006b938729dfcca467bd69ce473d2c1a7b6..43c8a0dc8fd153802876de8e909c5261e5c8a4b7 100644 --- a/Tutorial_HTM_2016.html +++ b/Tutorial_HTM_2016.html @@ -8,7 +8,7 @@ <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="pandoc" /> -<meta name="author" content="Bernd Klaus, Andrzej Oles, Mike Smith" /> +<meta name="author" content="Bernd Klaus, Andrzej OleÅ›, Mike Smith" /> <meta name="date" content="2016-10-22" /> @@ -62,7 +62,7 @@ document.addEventListener("DOMContentLoaded", function() { <div id="header"> <h1 class="title">Visual Exploration of High-Throughput-Microscopy Data</h1> -<h4 class="author"><em>Bernd Klaus, Andrzej Oles, Mike Smith</em></h4> +<h4 class="author"><em>Bernd Klaus, Andrzej OleÅ›, Mike Smith</em></h4> <h4 class="date"><em>22 October 2016</em></h4> </div> @@ -111,7 +111,6 @@ rmarkdown::render('Tutorial_HTM_2016.Rmd', BiocStyle::pdf_document()) <div id="about-the-tutorial" class="section level1"> <h1><span class="header-section-number">2</span> About the tutorial</h1> <p>In this tutorial, we will import a single plate from a high content screen performed on 96 well plates. The input data that we are going to use are class labels for each single cell. These classification results have been obtained using a machine learning algorithm based on the original image features. The data produced is similar to the one in <span class="citation">Neumann et al. (2010)</span>: Each cell is classified into a mitotic phenotype class.</p> -<!--  --> </div> <div id="annotation-import" class="section level1"> <h1><span class="header-section-number">3</span> Annotation import</h1> @@ -128,7 +127,7 @@ rmarkdown::render('Tutorial_HTM_2016.Rmd', BiocStyle::pdf_document()) </div> <div id="importing-the-raw-data" class="section level1"> <h1><span class="header-section-number">4</span> Importing the raw data</h1> -<p>We will now import the raw data. This data is stored in a variant of the <a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">HDF5 format</a> called <a href="http://www.cellh5.org/">“CellH5â€</a>, which defines a more restricted sub-format designed specifically to store data from high content screens. More information can be found in the paper by <span class="citation">Sommer et al. (2013)</span>.</p> +<p>We will now import the raw data. This data is stored in a variant of the <a href="https://en.wikipedia.org/wiki/Hierarchical_Data_Format">HDF5 format</a> called “<a href="http://www.cellh5.org/">CellH5</a>â€, which defines a more restricted sub-format designed specifically to store data from high content screens. More information can be found in the paper by <span class="citation">Sommer et al. (2013)</span>.</p> <p>In the code below, we use the <a href="https://github.com/CellH5/cellh5-R">cellh5</a> R–package to import the data. The file <code>_all_positions.ch5</code> contains links to the other <code>ch5</code> files that contain the full data of the plate. We are only interested in the predictions produced by the machine learning algorithm, so we only extract this part of the file.</p> <pre class="sourceCode r"><code class="sourceCode r">path <-<span class="st"> </span><span class="kw">file.path</span>(switch(.Platform$OS.type, <span class="dt">unix =</span> <span class="st">"/g"</span>, <span class="dt">windows =</span> <span class="st">"Z:"</span>), <span class="st">"embo2016/htm2016/P12_Visual_Exploration/hdf5/_all_positions.ch5"</span>) @@ -139,98 +138,25 @@ predictions <-<span class="st"> </span><span class="kw">C5Predictions</span>( <div id="tabulate-the-raw-data" class="section level1"> <h1><span class="header-section-number">5</span> Tabulate the raw data</h1> <p>We now tabulate the raw data: we compute how many cells are assigned to each class for each well. The result is a data matrix, which contains the wells of the screen plate in the columns and the counts for the respective classes in the rows.</p> -<p>This is a typical example of a “wide†data table, where the variables contained in the data set spread across multiple columns.</p> +<p>This is a typical example of a “wide†data table, where the variables contained in the data set spread across multiple columns (here we only show the first six ones).</p> <pre class="sourceCode r"><code class="sourceCode r">raw_data <-<span class="st"> </span><span class="kw">sapply</span>(c5_pos, function(pos){ predictions <-<span class="st"> </span><span class="kw">C5Predictions</span>(c5f, pos, <span class="dt">mask =</span> <span class="st">"primary__primary"</span>, <span class="dt">as =</span> <span class="st">"name"</span>) <span class="kw">table</span>(predictions) }) -<span class="kw">head</span>(raw_data)</code></pre> -<pre><code> WA01_P1 WA02_P1 WA03_P1 WA04_P1 WA05_P1 WA06_P1 WA07_P1 WA08_P1 - ana 332 914 657 551 237 483 482 198 - apo 5435 3757 2413 1951 949 1945 1584 4569 - artefact 1059 1288 333 357 645 233 159 227 - binuclear 2604 1534 680 377 3514 224 464 53 - inter 21865 45615 29005 24143 12293 17698 22938 3404 - map 2910 2142 921 726 445 513 633 1344 - WA09_P1 WA10_P1 WA11_P1 WA12_P1 WB01_P1 WB02_P1 WB03_P1 WB04_P1 - ana 462 235 231 371 876 82 794 566 - apo 2510 2464 4912 3130 2368 7196 3504 1698 - artefact 587 266 651 1019 463 149 902 396 - binuclear 321 227 224 119 624 126 907 584 - inter 17779 12945 6258 12330 41367 5824 32848 25141 - map 752 460 712 512 1233 1624 1442 825 - WB05_P1 WB06_P1 WB07_P1 WB08_P1 WB09_P1 WB10_P1 WB11_P1 WB12_P1 - ana 549 526 574 412 419 283 232 395 - apo 1188 1066 1460 1352 1429 4461 2347 3621 - artefact 357 191 576 541 192 352 212 706 - binuclear 370 380 322 143 1331 210 220 272 - inter 26142 27145 30655 19599 19791 3907 14211 11680 - map 657 506 810 442 551 495 539 818 - WC01_P1 WC02_P1 WC03_P1 WC04_P1 WC05_P1 WC06_P1 WC07_P1 WC08_P1 - ana 597 691 422 465 651 588 314 522 - apo 1571 1893 1232 1925 5161 1213 911 1808 - artefact 572 763 393 374 277 418 233 200 - binuclear 247 254 230 342 492 264 207 215 - inter 33542 29535 21891 21153 20998 24261 15189 25146 - map 696 730 413 580 1644 460 375 460 - WC09_P1 WC10_P1 WC11_P1 WC12_P1 WD01_P1 WD02_P1 WD03_P1 WD04_P1 - ana 307 305 347 500 582 660 289 611 - apo 1549 1693 1187 6346 971 1844 3825 1248 - artefact 261 405 335 716 354 521 306 602 - binuclear 240 119 188 206 868 226 21 244 - inter 11454 12725 13813 6049 28882 28703 3381 28199 - map 566 358 409 1206 325 521 1200 404 - WD05_P1 WD06_P1 WD07_P1 WD08_P1 WD09_P1 WD10_P1 WD11_P1 WD12_P1 - ana 380 266 499 325 368 363 412 317 - apo 1237 1980 1362 6111 1387 1187 2293 2205 - artefact 203 362 295 217 185 164 165 340 - binuclear 125 1784 294 18 190 207 812 207 - inter 12969 16765 23615 1060 16208 17472 13893 10730 - map 330 1767 382 1116 412 393 467 383 - WE01_P1 WE02_P1 WE03_P1 WE04_P1 WE05_P1 WE06_P1 WE07_P1 WE08_P1 - ana 745 680 507 154 514 411 217 402 - apo 1183 2117 805 1724 1045 973 1184 1290 - artefact 320 451 287 386 325 331 413 211 - binuclear 406 340 208 3129 525 113 4043 157 - inter 34054 30180 28076 7680 22809 22116 12057 9260 - map 594 762 354 1701 457 362 511 622 - WE09_P1 WE10_P1 WE11_P1 WE12_P1 WF01_P1 WF02_P1 WF03_P1 WF04_P1 - ana 310 490 288 419 765 428 481 480 - apo 1220 2642 1766 2462 1846 1039 917 1447 - artefact 298 382 217 819 555 237 154 152 - binuclear 233 690 252 283 432 377 1014 235 - inter 13417 22400 14652 15133 33687 20029 20915 22107 - map 466 648 575 456 643 285 478 471 - WF05_P1 WF06_P1 WF07_P1 WF08_P1 WF09_P1 WF10_P1 WF11_P1 WF12_P1 - ana 535 419 885 598 605 470 384 213 - apo 1540 1187 4333 1574 1746 2634 3053 2160 - artefact 362 193 2198 235 312 617 448 538 - binuclear 285 168 3292 256 222 329 739 206 - inter 28177 19900 31721 25656 26475 18373 13451 10599 - map 819 401 3434 612 811 606 1260 241 - WG01_P1 WG02_P1 WG03_P1 WG04_P1 WG05_P1 WG06_P1 WG07_P1 WG08_P1 - ana 595 545 281 325 402 175 520 439 - apo 1641 857 804 1020 944 4421 1741 1015 - artefact 356 383 217 125 174 301 372 323 - binuclear 359 265 150 252 200 136 259 270 - inter 27414 27078 13163 16167 18310 8163 24188 20097 - map 485 353 259 352 495 1590 564 379 - WG09_P1 WG10_P1 WG11_P1 WG12_P1 WH01_P1 WH02_P1 WH03_P1 WH04_P1 - ana 111 286 300 547 467 501 376 161 - apo 4381 2936 2617 4427 3493 1415 659 4606 - artefact 451 275 378 593 430 379 245 153 - binuclear 72 203 316 273 265 215 143 196 - inter 1950 8388 11799 9959 21343 21073 17503 3143 - map 1770 353 455 607 465 444 188 2457 - WH05_P1 WH06_P1 WH07_P1 WH08_P1 WH09_P1 WH10_P1 WH11_P1 WH12_P1 - ana 464 470 378 541 397 407 366 457 - apo 1470 1073 854 752 2079 3193 2891 5644 - artefact 242 157 298 485 561 529 500 884 - binuclear 185 192 317 365 234 105 319 697 - inter 21745 18543 16349 27280 18633 11491 12876 8544 - map 511 502 410 435 573 397 600 912</code></pre> +raw_data[, <span class="dv">1</span>:<span class="dv">6</span>]</code></pre> +<pre><code> WA01_P1 WA02_P1 WA03_P1 WA04_P1 WA05_P1 WA06_P1 + ana 332 914 657 551 237 483 + apo 5435 3757 2413 1951 949 1945 + artefact 1059 1288 333 357 645 233 + binuclear 2604 1534 680 377 3514 224 + inter 21865 45615 29005 24143 12293 17698 + map 2910 2142 921 726 445 513 + meta 238 495 334 363 166 280 + polylobed 2726 712 368 226 553 46 + prometa 2489 1319 989 656 293 756 + telo 1480 4045 1860 1088 232 808</code></pre> </div> <div id="reshaping-the-screen-data-and-joining-the-plate-annotation" class="section level1"> <h1><span class="header-section-number">6</span> Reshaping the screen data and joining the plate annotation</h1> @@ -452,8 +378,8 @@ heatmap</code></pre> <p>Bar charts however, often represent the mean or a count statistic, while histograms are bar charts where the bars represent the binned count or density statistics and so on.</p> <p>Let’s start by creating a simple plot: Data from an enzyme-linked immunosorbent assay (ELISA) assay. The assay was used to quantify the activity of the enzyme deoxyribonuclease (DNase), which degrades DNA. The data are assembled in the R object <code>DNase</code>, which conveniently comes with base R. <code>DNase</code> is a dataframe whose columns are <code>Run</code>, the assay run; <code>conc</code>, the protein concentration that was used; and <code>density</code>, the measured optical density.</p> <pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(DNase, <span class="kw">aes</span>(<span class="dt">x =</span> conc, <span class="dt">y =</span> density, <span class="dt">color =</span> Run)) +<span class="st"> </span><span class="kw">geom_point</span>() </code></pre> -<p><img src="" /></p> -<p>We just wrote our first <code>sentence</code> using the grammar of graphics. Let us deconstruct this sentence. First, we specified the dataframe that contains the data, <code>DNase</code>. Then we told <code>ggplot</code> via the aesthetics <code>aes</code> argument which variables we want on the <span class="math">\(x\)</span>- and <span class="math">\(y\)</span>-axes, respectively and mapped the run number to the color aesthetic. Finally, we stated that we want the plot to use points, by adding the result of calling the function <code>geom\_point</code>.</p> +<p><img src="" /></p> +<p>We just wrote our first <code>sentence</code> using the grammar of graphics. Let us deconstruct this sentence. First, we specified the dataframe that contains the data, <code>DNase</code>. Then we told <code>ggplot</code> via the aesthetics <code>aes</code> argument which variables we want on the <span class="math">\(x\)</span>- and <span class="math">\(y\)</span>-axes, respectively and mapped the run number to the color aesthetic. Finally, we stated that we want the plot to use points, by adding the result of calling the function <code>geom_point</code>.</p> </div> <div id="principal-component-analysis-pca-to-for-data-visualization" class="section level2"> <h2><span class="header-section-number">9.4</span> Principal component analysis (PCA) to for data visualization</h2> @@ -496,13 +422,16 @@ V=2\times \mbox{ Beets }+ 1\times \mbox{Carrots } +\frac{1}{2} \mbox{ Gala}+ \fr [17] BiocStyle_2.0.3 loaded via a namespace (and not attached): - [1] Rcpp_0.12.5 formatR_1.4 RColorBrewer_1.1-2 plyr_1.8.4 - [5] tools_3.3.1 zlibbioc_1.18.0 digest_0.6.10 evaluate_0.10 - [9] gtable_0.2.0 DBI_0.5-1 ggrepel_0.6.3 yaml_2.1.13 - [13] parallel_3.3.1 grid_3.3.1 R6_2.2.0 foreign_0.8-67 - [17] magrittr_1.5 scales_0.4.0 htmltools_0.3.5 assertthat_0.1 - [21] mnormt_1.5-5 colorspace_1.2-7 labeling_0.3 stringi_1.1.2 - [25] lazyeval_0.2.0 munsell_0.4.3</code></pre> + [1] Rcpp_0.12.5 RColorBrewer_1.1-2 BiocInstaller_1.22.3 + [4] formatR_1.4 plyr_1.8.4 tools_3.3.1 + [7] zlibbioc_1.18.0 digest_0.6.10 evaluate_0.9 + [10] gtable_0.2.0 DBI_0.5-1 ggrepel_0.6.3 + [13] yaml_2.1.13 parallel_3.3.1 grid_3.3.1 + [16] R6_2.2.0 foreign_0.8-67 magrittr_1.5 + [19] scales_0.4.0 codetools_0.2-15 htmltools_0.3.5 + [22] assertthat_0.1 mnormt_1.5-5 colorspace_1.2-7 + [25] labeling_0.3 stringi_1.1.1 lazyeval_0.2.0 + [28] munsell_0.4.3</code></pre> <div class="references"> <h1>References</h1> <p>Neumann, Beate, Thomas Walter, Jean-Karim H<span>’e</span>ric <span>’e</span>, Jutta Bulkescher, Holger Erfle, Christian Conrad, Phill Rogers, et al. 2010. “Phenotypic Profiling of the Human Genome by Time-Lapse Microscopy Reveals Cell Division Genes.†<em>Nature</em> 464 (7289). Springer Nature: 721–27. doi:<a href="http://dx.doi.org/10.1038/nature08869">10.1038/nature08869</a>.</p>