Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
H
htm_course_2016
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Bernd Klaus
htm_course_2016
Commits
2dc0136a
Commit
2dc0136a
authored
8 years ago
by
Bernd Klaus
Browse files
Options
Downloads
Patches
Plain Diff
added section on PCA analysis, group-apply-combine and data transformations
parent
69a79794
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
2
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
Tutorial_HTM_2016.Rmd
+89
-4
89 additions, 4 deletions
Tutorial_HTM_2016.Rmd
Tutorial_HTM_2016.html
+69
-8
69 additions, 8 deletions
Tutorial_HTM_2016.html
with
158 additions
and
12 deletions
Tutorial_HTM_2016.Rmd
+
89
−
4
View file @
2dc0136a
...
...
@@ -40,7 +40,7 @@ cache = TRUE)
devtools::install_github("CellH5/cellh5-R")
```
```{r required
packages
and
data, echo = TRUE, message=FALSE}
```{r required
_
packages
_
and
_
data, echo = TRUE, message=FALSE}
library(rmarkdown)
library(tidyverse)
library(openxlsx)
...
...
@@ -304,23 +304,104 @@ is sensible.
# Using ggplot to create a PCA plot for the data
```{r}
Before we can compute a PCA, we have to make sure that the variables that
we have obtained for every single well are normalized. As our data consists
of the number of cells each phenotypic category a straigforward normalization
consits of transforming the counts into percentages by dividing the data
for each well by its total number of cells.
## The chaining operator
For this, we use the chaining (or piping) operator `%>%`.
$x \% \textgreater \% f(y) $ is simply $f(x, y)$, so one can use
it to rewrite multiple operations in such a way that they can be read
from you can read from left--to--right,
top--to--bottom. A simple example will make that clear: We create
two vectors and calculate Euclidian distance between them. Instead of
the usual way:
```{r chainingSimpleExample_1}
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))
```
We can use the piping operatror
```{r chainingSimpleExample_2}
(x1-x2)^2 %>%
sum() %>%
sqrt()
```
Which makes the set of operations much easier to digest and understand.
## Grouping, summarizing and data transformation
Having tidy data frames, we can employ a __split--apply--combine__
strategy for data analysis. What this means is that wer first group our
individual observations by one or more factors (_split_ operation), then
_apply_ computations to each group and then combine the results into new
data frame that contains one line per group
In the code chunk below, we use the`group\_by()` function
to splot the dataset into groups according to the well ID.
We then apply the function `sum()` to the counts of each well and
use`summarize()` to obtain a table of counts per well.
We can now join the table containing the sums to the original
data and compute compute percentage using the sums.
As PCA works best on data that is on normal distribution (z--score) scale,
we perform at [logit](https://en.wikipedia.org/wiki/Logit) transformation
to turn the percentages
into z--scores. This is similar in spirit to a log transformation on
intensity values.
```{r grouping_and_summarizing, dependson="reshape"}
no_cells_per_well <- input_data %>%
group_by(well) %>%
summarize(no_cells = sum(count))
data_with_sums <- left_join(input_data,
no_cells_per_well)
head(
no_cells_per_well)
# size_factors <- no_cells_per_well$no_cells / geometric.mean(no_cells_per_well$no_cells)
data_with_sums <- left_join(input_data, no_cells_per_well)
data_for_PCA <- mutate(data_with_sums, perc = count / no_cells,
z_score = logit(perc))
head(data_for_PCA)
```
## creating the PCA plot using ggplot2
We are now ready to create the PCA plot. For this, we first need to turn
the input data into a wide data frame again by spreading the z--scores across
columns.
We can then use the function `prcomp()` to compute the actual principal components.
We also create a vector genes, giving us the gene each of our siRNAs is targeting.
We then create a ggplot object by mapping the first principal component to the
x-- and the second one to the y--axis. We use the gene names as plotting symbols
and color the names according to to the gene name (As we have multiple empty wells
as well as multiple siRNAs targeting the same gene).
Furthemore, we specify that the aspect ratio of x and y axis should be 1, so that
the units are same on both axes. This is important for a correct interpreation
of the PCA plot.
```{r PCA, dependson="grouping_and_summarizing"}
data_for_PCA <- data_for_PCA %>%
select(class, well, z_score) %>%
spread(key = class, value = z_score)
PCA <- prcomp(data_for_PCA[, -1], center = TRUE, scale. = TRUE)
...
...
@@ -341,8 +422,12 @@ pl <- (ggplot(dataGG, aes(x = PC1,
+ ggtitle("Principal component plot")
)
pl
```
We can see that the control wells cluster together. Note that it is easy to turn
this plot into an interactive version using `ggplotly` from the `r CRANpkg("plotly")`.
# Heatmap of apoptosis proportions
...
...
This diff is collapsed.
Click to expand it.
Tutorial_HTM_2016.html
+
69
−
8
View file @
2dc0136a
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment