diff --git a/Tutorial_HTM_2016.Rmd b/Tutorial_HTM_2016.Rmd index fb9874730c59f503f7de806d705ea07fe8b3b8ad..2c77fcd27dd969df08cc5cb43b241485f08bbda3 100755 --- a/Tutorial_HTM_2016.Rmd +++ b/Tutorial_HTM_2016.Rmd @@ -40,7 +40,7 @@ cache = TRUE) devtools::install_github("CellH5/cellh5-R") ``` -```{r required packages and data, echo = TRUE, message=FALSE} +```{r required_packages_and_data, echo = TRUE, message=FALSE} library(rmarkdown) library(tidyverse) library(openxlsx) @@ -304,23 +304,104 @@ is sensible. # Using ggplot to create a PCA plot for the data -```{r} + +Before we can compute a PCA, we have to make sure that the variables that +we have obtained for every single well are normalized. As our data consists +of the number of cells each phenotypic category a straigforward normalization +consits of transforming the counts into percentages by dividing the data +for each well by its total number of cells. + +## The chaining operator + +For this, we use the chaining (or piping) operator `%>%`. +$x \% \textgreater \% f(y) $ is simply $f(x, y)$, so one can use +it to rewrite multiple operations in such a way that they can be read +from you can read from left--to--right, +top--to--bottom. A simple example will make that clear: We create +two vectors and calculate Euclidian distance between them. Instead of +the usual way: + +```{r chainingSimpleExample_1} +x1 <- 1:5; x2 <- 2:6 +sqrt(sum((x1-x2)^2)) +``` + +We can use the piping operatror + +```{r chainingSimpleExample_2} +(x1-x2)^2 %>% +sum() %>% +sqrt() +``` + +Which makes the set of operations much easier to digest and understand. + +## Grouping, summarizing and data transformation + +Having tidy data frames, we can employ a __split--apply--combine__ +strategy for data analysis. What this means is that wer first group our +individual observations by one or more factors (_split_ operation), then +_apply_ computations to each group and then combine the results into new +data frame that contains one line per group + +In the code chunk below, we use the`group\_by()` function +to splot the dataset into groups according to the well ID. +We then apply the function `sum()` to the counts of each well and +use`summarize()` to obtain a table of counts per well. + + +We can now join the table containing the sums to the original +data and compute compute percentage using the sums. + +As PCA works best on data that is on normal distribution (z--score) scale, +we perform at [logit](https://en.wikipedia.org/wiki/Logit) transformation +to turn the percentages +into z--scores. This is similar in spirit to a log transformation on +intensity values. + + +```{r grouping_and_summarizing, dependson="reshape"} no_cells_per_well <- input_data %>% group_by(well) %>% summarize(no_cells = sum(count)) -data_with_sums <- left_join(input_data, no_cells_per_well) +head(no_cells_per_well) -# size_factors <- no_cells_per_well$no_cells / geometric.mean(no_cells_per_well$no_cells) +data_with_sums <- left_join(input_data, no_cells_per_well) data_for_PCA <- mutate(data_with_sums, perc = count / no_cells, z_score = logit(perc)) +head(data_for_PCA) +``` + +## creating the PCA plot using ggplot2 + +We are now ready to create the PCA plot. For this, we first need to turn +the input data into a wide data frame again by spreading the z--scores across +columns. + +We can then use the function `prcomp()` to compute the actual principal components. +We also create a vector genes, giving us the gene each of our siRNAs is targeting. + +We then create a ggplot object by mapping the first principal component to the +x-- and the second one to the y--axis. We use the gene names as plotting symbols +and color the names according to to the gene name (As we have multiple empty wells +as well as multiple siRNAs targeting the same gene). + +Furthemore, we specify that the aspect ratio of x and y axis should be 1, so that +the units are same on both axes. This is important for a correct interpreation +of the PCA plot. + + +```{r PCA, dependson="grouping_and_summarizing"} data_for_PCA <- data_for_PCA %>% select(class, well, z_score) %>% spread(key = class, value = z_score) + + PCA <- prcomp(data_for_PCA[, -1], center = TRUE, scale. = TRUE) @@ -341,8 +422,12 @@ pl <- (ggplot(dataGG, aes(x = PC1, + ggtitle("Principal component plot") ) +pl ``` +We can see that the control wells cluster together. Note that it is easy to turn +this plot into an interactive version using `ggplotly` from the `r CRANpkg("plotly")`. + # Heatmap of apoptosis proportions diff --git a/Tutorial_HTM_2016.html b/Tutorial_HTM_2016.html index 66ce6224cdde78bb7d6be1beb80930a7a5ac3b00..03ef1fb52087c667c0b923d8c3e6eecc818f9709 100644 --- a/Tutorial_HTM_2016.html +++ b/Tutorial_HTM_2016.html @@ -79,7 +79,7 @@ document.addEventListener("DOMContentLoaded", function() { <div id="header"> <h1 class="title">Visual Exploration of High–Throughput–Microscopy Data</h1> <h4 class="author"><em>Bernd Klaus, Andrzej Oles, Mike Smith</em></h4> -<h4 class="date"><em>10 Oktober 2016</em></h4> +<h4 class="date"><em>13 Oktober 2016</em></h4> </div> <h1>Contents</h1> @@ -94,7 +94,11 @@ document.addEventListener("DOMContentLoaded", function() { <li><a href="#reshaping-the-screen-data-and-joining-the-plate-annotation"><span class="toc-section-number">7</span> Reshaping the screen data and joining the plate annotation</a></li> <li><a href="#plotting-in-r-ggplot2"><span class="toc-section-number">8</span> Plotting in R: ggplot2</a></li> <li><a href="#principal-component-analysis-pca-to-for-data-visualization"><span class="toc-section-number">9</span> Principal component analysis (PCA) to for data visualization</a></li> -<li><a href="#using-ggplot-to-create-a-pca-plot-for-the-data"><span class="toc-section-number">10</span> Using ggplot to create a PCA plot for the data</a></li> +<li><a href="#using-ggplot-to-create-a-pca-plot-for-the-data"><span class="toc-section-number">10</span> Using ggplot to create a PCA plot for the data</a><ul> +<li><a href="#the-chaining-operator"><span class="toc-section-number">10.1</span> The chaining operator</a></li> +<li><a href="#grouping-summarizing-and-data-transformation"><span class="toc-section-number">10.2</span> Grouping, summarizing and data transformation</a></li> +<li><a href="#creating-the-pca-plot-using-ggplot2"><span class="toc-section-number">10.3</span> creating the PCA plot using ggplot2</a></li> +</ul></li> <li><a href="#heatmap-of-apoptosis-proportions"><span class="toc-section-number">11</span> Heatmap of apoptosis proportions</a></li> <li><a href="#session-information"><span class="toc-section-number">12</span> Session information</a></li> <li><a href="#references">References</a></li> @@ -246,21 +250,73 @@ V=2\times \mbox{ Beets }+ 1\times \mbox{Carrots } +\frac{1}{2} \mbox{ Gala}+ \fr </div> <div id="using-ggplot-to-create-a-pca-plot-for-the-data" class="section level1"> <h1><span class="header-section-number">10</span> Using ggplot to create a PCA plot for the data</h1> +<p>Before we can compute a PCA, we have to make sure that the variables that we have obtained for every single well are normalized. As our data consists of the number of cells each phenotypic category a straigforward normalization consits of transforming the counts into percentages by dividing the data for each well by its total number of cells.</p> +<div id="the-chaining-operator" class="section level2"> +<h2><span class="header-section-number">10.1</span> The chaining operator</h2> +<p>For this, we use the chaining (or piping) operator <code>%>%</code>. $x % % f(y) $ is simply <span class="math inline">\(f(x, y)\)</span>, so one can use it to rewrite multiple operations in such a way that they can be read from you can read from left–to–right, top–to–bottom. A simple example will make that clear: We create two vectors and calculate Euclidian distance between them. Instead of the usual way:</p> +<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x1 <-<span class="st"> </span><span class="dv">1</span>:<span class="dv">5</span>; x2 <-<span class="st"> </span><span class="dv">2</span>:<span class="dv">6</span> +<span class="kw">sqrt</span>(<span class="kw">sum</span>((x1-x2)^<span class="dv">2</span>))</code></pre></div> +<pre><code> [1] 2.24</code></pre> +<p>We can use the piping operatror</p> +<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">(x1-x2)^<span class="dv">2</span> %>% +<span class="kw">sum</span>() %>% +<span class="kw">sqrt</span>()</code></pre></div> +<pre><code> [1] 2.24</code></pre> +<p>Which makes the set of operations much easier to digest and understand.</p> +</div> +<div id="grouping-summarizing-and-data-transformation" class="section level2"> +<h2><span class="header-section-number">10.2</span> Grouping, summarizing and data transformation</h2> +<p>Having tidy data frames, we can employ a <strong>split–apply–combine</strong> strategy for data analysis. What this means is that wer first group our individual observations by one or more factors (<em>split</em> operation), then <em>apply</em> computations to each group and then combine the results into new data frame that contains one line per group</p> +<p>In the code chunk below, we use the<code>group\_by()</code> function to splot the dataset into groups according to the well ID. We then apply the function <code>sum()</code> to the counts of each well and use<code>summarize()</code> to obtain a table of counts per well.</p> +<p>We can now join the table containing the sums to the original data and compute compute percentage using the sums.</p> +<p>As PCA works best on data that is on normal distribution (z–score) scale, we perform at <a href="https://en.wikipedia.org/wiki/Logit">logit</a> transformation to turn the percentages into z–scores. This is similar in spirit to a log transformation on intensity values.</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">no_cells_per_well <-<span class="st"> </span>input_data %>% <span class="st"> </span><span class="kw">group_by</span>(well) %>% <span class="st"> </span><span class="kw">summarize</span>(<span class="dt">no_cells =</span> <span class="kw">sum</span>(count)) -data_with_sums <-<span class="st"> </span><span class="kw">left_join</span>(input_data, no_cells_per_well)</code></pre></div> +<span class="kw">head</span>(no_cells_per_well)</code></pre></div> +<pre><code> # A tibble: 6 × 2 + well no_cells + <chr> <int> + 1 A01_01 13089 + 2 A02_01 22185 + 3 A03_01 25407 + 4 A04_01 20361 + 5 A05_01 18708 + 6 A06_01 24389</code></pre> +<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">data_with_sums <-<span class="st"> </span><span class="kw">left_join</span>(input_data, no_cells_per_well)</code></pre></div> <pre><code> Joining, by = "well"</code></pre> -<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># size_factors <- no_cells_per_well$no_cells / geometric.mean(no_cells_per_well$no_cells) </span> - -data_for_PCA <-<span class="st"> </span><span class="kw">mutate</span>(data_with_sums, <span class="dt">perc =</span> count /<span class="st"> </span>no_cells, +<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">data_for_PCA <-<span class="st"> </span><span class="kw">mutate</span>(data_with_sums, <span class="dt">perc =</span> count /<span class="st"> </span>no_cells, <span class="dt">z_score =</span> <span class="kw">logit</span>(perc)) -data_for_PCA <-<span class="st"> </span>data_for_PCA %>%<span class="st"> </span> +<span class="kw">head</span>(data_for_PCA)</code></pre></div> +<pre><code> class well count Well Site Row Column siRNA.ID Gene.Symbol Group + 1 Anaphase A01_01 810 A01 1 A 1 s7424 INCENP target + 2 Apoptosis A01_01 1480 A01 1 A 1 s7424 INCENP target + 3 Elongated A01_01 242 A01 1 A 1 s7424 INCENP target + 4 Interphase A01_01 6529 A01 1 A 1 s7424 INCENP target + 5 MAP A01_01 585 A01 1 A 1 s7424 INCENP target + 6 Metaphase A01_01 193 A01 1 A 1 s7424 INCENP target + no_cells perc z_score + 1 13089 0.0619 -2.71861 + 2 13089 0.1131 -2.05974 + 3 13089 0.0185 -3.97193 + 4 13089 0.4988 -0.00474 + 5 13089 0.0447 -3.06219 + 6 13089 0.0147 -4.20198</code></pre> +</div> +<div id="creating-the-pca-plot-using-ggplot2" class="section level2"> +<h2><span class="header-section-number">10.3</span> creating the PCA plot using ggplot2</h2> +<p>We are now ready to create the PCA plot. For this, we first need to turn the input data into a wide data frame again by spreading the z–scores across columns.</p> +<p>We can then use the function <code>prcomp()</code> to compute the actual principal components. We also create a vector genes, giving us the gene each of our siRNAs is targeting.</p> +<p>We then create a ggplot object by mapping the first principal component to the x– and the second one to the y–axis. We use the gene names as plotting symbols and color the names according to to the gene name (As we have multiple empty wells as well as multiple siRNAs targeting the same gene).</p> +<p>Furthemore, we specify that the aspect ratio of x and y axis should be 1, so that the units are same on both axes. This is important for a correct interpreation of the PCA plot.</p> +<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">data_for_PCA <-<span class="st"> </span>data_for_PCA %>%<span class="st"> </span> <span class="st"> </span><span class="kw">select</span>(class, well, z_score) %>% <span class="st"> </span><span class="kw">spread</span>(<span class="dt">key =</span> class, <span class="dt">value =</span> z_score) + + PCA <-<span class="st"> </span><span class="kw">prcomp</span>(data_for_PCA[, -<span class="dv">1</span>], <span class="dt">center =</span> <span class="ot">TRUE</span>, <span class="dt">scale. =</span> <span class="ot">TRUE</span>) @@ -279,7 +335,12 @@ pl <-<span class="st"> </span>(<span class="kw">ggplot</span>(dataGG, <span c +<span class="st"> </span><span class="kw">geom_text</span>(<span class="kw">aes</span>(<span class="dt">label =</span> genes), <span class="dt">size =</span> <span class="kw">I</span>(<span class="dv">4</span>)) +<span class="st"> </span><span class="kw">coord_equal</span>() +<span class="st"> </span><span class="kw">ggtitle</span>(<span class="st">"Principal component plot"</span>) - )</code></pre></div> + ) + +pl</code></pre></div> +<p><img src="" /></p> +<p>We can see that the control wells cluster together. Note that it is easy to turn this plot into an interactive version using <code>ggplotly</code> from the <em><a href="http://cran.fhcrc.org/web/packages/plotly/index.html">plotly</a></em>.</p> +</div> </div> <div id="heatmap-of-apoptosis-proportions" class="section level1"> <h1><span class="header-section-number">11</span> Heatmap of apoptosis proportions</h1>