Commit 4bd3e6ac authored by Bernd Klaus's avatar Bernd Klaus

refined LOESS section, added background on normalization

parent a5082e43
......@@ -36,7 +36,7 @@ library("vsn")
library("matrixStats")
library("pheatmap")
library("tidyverse")
#library("printr")
theme_set(theme_gray(base_size = 18))
......@@ -94,13 +94,21 @@ load(file.path(data_dir, "mtec_cell_anno.RData"))
load(file.path(data_dir, "mtec_gene_anno.RData"))
load(file.path(data_dir, "tras.RData"))
## ----mtec_count_table----------------------------------------------------
## ----mtec_count_table, results="asis"------------------------------------
mtec_counts
## ----mtec_cell_anno, results="asis"--------------------------------------
mtec_cell_anno
## ----trasENSEMBL, results="asis"-----------------------------------------
tras
mtec_gene_anno
## ----tidy_count----------------------------------------------------------
mtec_counts_tidy <- gather(mtec_counts, key = "cell_id", value = "count",
-ensembl_id) %>%
dplyr::mutate(is_tra = ensembl_id %in% tras$gene.ids,
dplyr::mutate(
is_tra = ensembl_id %in% tras$gene.ids,
is_detected = count > 0) %>%
left_join(mtec_cell_anno,
by = c("cell_id" = "cellID"))
......@@ -128,7 +136,7 @@ scatter_tra
scatter_tra +
geom_rug(alpha = I(0.2))
## ----ex_geom-------------------------------------------------------------
## ----ex_geom, echo=FALSE-------------------------------------------------
ggplot(tra_detected, aes(x = total_detected, y = tra))
scatter_tra +
......@@ -140,33 +148,46 @@ lm_tra <- lm(tra ~ total_detected, data = tra_detected)
lm_tra
## ----loessExampleLinFit, dependson="fit_model"---------------------------
sim_data <- tibble(x = seq(from=1, to=10, length.out=100),
y = x^3 +x^2 + rnorm(100,mean=0, sd=60))
# create_data
y <- seq(from=1, to=10, length.out=100)
a <- y^3 +y^2 + rnorm(100,mean=0, sd=30)
dataL <- data.frame(a=a, y=y)
qplot(y, a, data = dataL)
ggplot(aes(x, y), data = sim_data) +
geom_point() +
geom_smooth(method = "lm")
# linear fit
linreg <- lm(a~y, data = dataL)
## ----loessExampleFit, dependson="loessExampleLinFit"---------------------
(qplot(y, a, data = dataL) +
geom_abline(slope = tidy(linreg, quick = TRUE)[2,2],
intercept = tidy(linreg, quick = TRUE)[1,2]))
tidy(linreg)
sim_data$locFit <- predict(locfit(y~lp(x, nn=0.5, deg=1), data=sim_data),
newdata = sim_data$x)
dataL$LinReg <- predict(linreg)
ggplot(aes(x, y), data = sim_data) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("Linear vs. local regression") +
geom_line(aes(x = x, y = locFit), color = "coral3")
## ----loessExampleFit, dependson="loessExampleLinFit"---------------------
dataL$locFit <- predict(locfit(y~lp(a, nn=0.5, deg=1), data=dataL),
newdata = dataL$a)
## ----loessExercise, dependson="loessExampleLinFit", echo=FALSE-----------
sim_data$locFit <- predict(locfit(y~lp(x, nn=0.1, deg=1), data=sim_data),
newdata = sim_data$x)
ggplot(aes(x, y), data = sim_data) +
geom_point() +
geom_smooth() +
ggtitle("Linear vs. local regression") +
geom_line(aes(x = x, y = locFit), color = "coral3")
## ----compCells-----------------------------------------------------------
(qplot(a, y, data = dataL, main = "Linear vs. local regression")
+ geom_line(aes(x = a, y = locFit), color = "dodgerblue3")
+ geom_line(aes(x = a, y = LinReg), color = "coral3"))
ggplot(aes(x = cell_id, y = log(count)),
data = filter(mtec_counts_tidy,
cell_id %in% c("cell77", "cell6S35"),
is_detected == TRUE)) +
geom_boxplot()
## ----import_crc----------------------------------------------------------
......
......@@ -9,12 +9,19 @@ output:
self_contained: true
toc_float: false
code_download: true
df_print: paged
BiocStyle::pdf_document2:
toc: true
highlight: tango
df_print: paged
---
<!--<
To compile this document
graphics.off();rm(list=ls());rmarkdown::render('graphics_bioinf.Rmd');purl('graphics_bioinf.Rmd')
graphics.off();rm(list=ls());rmarkdown::render('graphics_bioinf.Rmd',
output_format = "all");purl('graphics_bioinf.Rmd')
rmarkdown::render('graphics_bioinf.Rmd', output_format = BiocStyle::pdf_document2())
-->
```{r options, include=FALSE}
......@@ -64,7 +71,7 @@ library("vsn")
library("matrixStats")
library("pheatmap")
library("tidyverse")
#library("printr")
theme_set(theme_gray(base_size = 18))
......@@ -124,36 +131,47 @@ save(tras, file = file.path(data_dir, "tras.RData"),
# mTEC single cell-RNA-Seq data
```{r import_data}
load(file.path(data_dir, "mtec_counts.RData"))
load(file.path(data_dir, "mtec_cell_anno.RData"))
load(file.path(data_dir, "mtec_gene_anno.RData"))
load(file.path(data_dir, "tras.RData"))
```
We summarize single-cell transcriptomes in the form of count matrices in which each row
represents a gene and each column represents one cell. The matrix is filled
with the number of sequenced fragments whose genomic alignment overlaps with the
genomic coordinates of the genes. For this, we only use cells for which more
than 40% of the sequenced fragments could be uniquely assigned to the Mouse
reference genome. We also discarded 28 cells from the batch 4 (since they were
from another cell type that was sequenced together with some of the mTECs).
It is sensible to only use highly variable genes in the analysis of
single cell data (HVG), the original publication identified more than
9,000 as highly variable using a method published in Brennecke et. al. 2013.
We retain those genes for further analysis.
genomic coordinates of the genes. It looks like this:
We also import a table with genes annotated as tissue restricted antigens (tras)
from Shinto et. al. 2013 and annotation for mouse genes from ENSEMBL.
```{r mtec_count_table, results="asis"}
mtec_counts
```
The annotation of the cells is stored in a table called `mtec_cell_anno`
```{r import_data}
load(file.path(data_dir, "mtec_counts.RData"))
load(file.path(data_dir, "mtec_cell_anno.RData"))
load(file.path(data_dir, "mtec_gene_anno.RData"))
load(file.path(data_dir, "tras.RData"))
```{r mtec_cell_anno, results="asis"}
mtec_cell_anno
```
# Graphics in R
It is sensible to only use highly variable genes (HVG) in the analysis of
single cell data. The original publication identified more than
9,000 as highly variable using a method published in Brennecke et. al. 2013
and we only retain those genes for further analysis.
# ggplot2{#gg}
We also import a table with genes annotated as tissue restricted antigens (`tras`)
from Shinto et. al. 2013 and annotation for mouse genes from ENSEMBL (`mtec_gene_anno`).
```{r trasENSEMBL, results="asis"}
tras
mtec_gene_anno
```
We will use this data to explore plotting in R.
# Graphics in R -- ggplot2{#gg} to assess gene detection rates
`r CRANpkg("ggplot2")` is a package by Hadley Wickham that implements the idea of
*grammar of graphics* -- a concept created by Leland Wilkinson in his book of the same name. Comprehensive documentation for the package can
......@@ -161,42 +179,40 @@ be found on <http://ggplot2.tidyverse.org/>. The online documentation includes
example use cases for each of the graphic types that are introduced in this lab (and
many more) and is an invaluable resource when creating figures.
Single cell RNA-Seq data has many zero due to a limited capture efficiency
and biases in the reverse transcription. It is thus interesting to see how
Single cell RNA-Seq data commonly has many zero counts due to a limited capture efficiency
and biases in the reverse transcription step. It is thus interesting to see how
the total number of genes expressed (at least one count) relates to
the number of TRA encoding genes expressed.
The single cell data also contains mTEC cells select based on surface markers,
we exclude those and make sure to inspect only protein coding genes.
In order to do this, we first turn the count table:
```{r mtec_count_table}
mtec_counts
```
In order to do this, we first turn the count table
which contains the counts for each cell in separate columns into a tidy
data frame. We `gather` all columns except for the first one and then
add a column telling us whether a specific genes is coding for a TRA
and whether it was detected or not.
use the `tras` table to add columns telling us whether a specific
genes is coding for a TRA and whether it was detected or not.
Then, we join the annotation to the tidy table
Then, we join the annotation of the cells `mtec_cell_anno` to the tidy table:
```{r tidy_count}
mtec_counts_tidy <- gather(mtec_counts, key = "cell_id", value = "count",
-ensembl_id) %>%
dplyr::mutate(is_tra = ensembl_id %in% tras$gene.ids,
dplyr::mutate(
is_tra = ensembl_id %in% tras$gene.ids,
is_detected = count > 0) %>%
left_join(mtec_cell_anno,
by = c("cell_id" = "cellID"))
```
Now we can compute the number of TRA genes detected per cell and filter
After joining we can filter
the non--protein coding genes as well as the cells that have been processed
using FACS based on a surface marker.
using FACS based on a surface marker. The column `is_tra` contains a logical
vector, we replace `TRUE` and `FALSE` by actual category names.
Finally, we obtain a tidy table with two columns per cell: the number of
tras and non-tras expressed. There is a subtle change in observations:
we go from genes within a single cell as our unit to single cells.
We then compute the number of TRA / non--TRA genes detected per cell using
the function `tally` that performs groupwise counting on grouped data frames.
Following this, we spread the `is_tra` into to columns and a column with the total
count of the detected genes.
```{r tra_per_cell}
......@@ -212,7 +228,11 @@ tra_detected
```
We visualize this relationship in a scatterplot with ggplot2.
Finally, after spreading the `is_tra` we obtain a tidy table with two columns
per cell: the number of tras and non-tras expressed genes. We add a column
holding the total number of detected genes by computing their sum.
We now visualize this relationship in a scatterplot with ggplot2.
```{r travsall, fig.cap="Total number of genes vs TRA"}
scatter_tra <- ggplot(tra_detected, aes(x = total_detected, y = tra))+
......@@ -225,7 +245,7 @@ scatter_tra
We just wrote our first "sentence" using the grammar of graphics.
Let us deconstruct this sentence.
First, we specified the data frame that contains the data, `mtec_counts_tidy`.
First, we specified the data frame that contains the data, `tra_detected`.
Then we told `ggplot` via the `aes`. This stands for `aesthetic` argument
(Aesthetics are visual property of the objects in your plot.)
and tells `r CRANpkg("ggplot2")` which variables
......@@ -252,7 +272,7 @@ We set (instead of mapped) the alpha value to 0.2, this increases the
transparency of the rugs and alleviates overplotting.
## Exercise: Geometries and aspect ratios
### Exercise: Geometries and aspect ratios
1. What happens if you omit the calls to `geom_point()` and
`coord_equal()` ?
......@@ -262,7 +282,7 @@ out how to add a smoothing curve to the plot. Can you change the smoother
to a regression line?
```{r ex_geom}
```{r ex_geom, echo=FALSE}
ggplot(tra_detected, aes(x = total_detected, y = tra))
scatter_tra +
......@@ -311,7 +331,8 @@ y_i = a + b x_{i} + \varepsilon_i; \quad
We can of course always add more predictors to the linear function. The coefficient
\(b\) is called the __slope__ and \(a\) is called the __intercept__ .
We can fit a linear regression via a call to the function `lm()`. The regression
We can fit a linear regression via a call to the function `lm()`,
which stands for "linear model". The regression
model is specified using R's formula notation.
```{r regresssion_tra}
......@@ -322,7 +343,7 @@ lm_tra
As we can see, the estimated
slope is ~ `r tidy(lm_tra, quick = TRUE)[2, "estimate"]`, indicating
that we have a proportion of `r tidy(lm_tra, quick = TRUE)[2, "estimate"]`
tra expressing genes on average per cell for the highly variable genes.
tra expressing genes on average per cell.
This is in line with the supplementary figure 1 of the original publication,
which uses the full set of expressed genes, not just the highly variable
ones.
......@@ -330,7 +351,6 @@ ones.
# Local regression (LOESS)
Local regression is a commonly used approach for fitting flexible non--linear
functions, which involves computing many local linear regression fits and combining
them. Local regression is a very useful technique both for data visualization and
......@@ -339,27 +359,17 @@ but it usually feasible with today's hardware. We illustrate the local regressio
using the `r CRANpkg("locfit") ` package on simulated data.
We first fit a linear regression line to simulated
data that follows a polynomial trend and see that it does not really
data that follows a polynomial trend \(y\) and see that it does not really
fit well.
```{r loessExampleLinFit, dependson="fit_model"}
sim_data <- tibble(x = seq(from=1, to=10, length.out=100),
y = x^3 +x^2 + rnorm(100,mean=0, sd=60))
# create_data
y <- seq(from=1, to=10, length.out=100)
a <- y^3 +y^2 + rnorm(100,mean=0, sd=30)
dataL <- data.frame(a=a, y=y)
qplot(y, a, data = dataL)
# linear fit
linreg <- lm(a~y, data = dataL)
(qplot(y, a, data = dataL) +
geom_abline(slope = tidy(linreg, quick = TRUE)[2,2],
intercept = tidy(linreg, quick = TRUE)[1,2]))
tidy(linreg)
ggplot(aes(x, y), data = sim_data) +
geom_point() +
geom_smooth(method = "lm")
dataL$LinReg <- predict(linreg)
```
We now use the function `locfit` to perform a local regression on the
......@@ -371,29 +381,75 @@ The lower this percentage, the more closely the line will follow the data points
```{r loessExampleFit, dependson="loessExampleLinFit"}
dataL$locFit <- predict(locfit(y~lp(a, nn=0.5, deg=1), data=dataL),
newdata = dataL$a)
sim_data$locFit <- predict(locfit(y~lp(x, nn=0.5, deg=1), data=sim_data),
newdata = sim_data$x)
ggplot(aes(x, y), data = sim_data) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("Linear vs. local regression") +
geom_line(aes(x = x, y = locFit), color = "coral3")
```
### Exercise: Local fit experiments
1. Check the help for `geom_smooth` to find out what `r CRANpkg("ggplot2")`
uses as default smoothing methods.
2. Play with the `nn` parameter of the locfit in the plot, how does affect the fit?
```{r loessExercise, dependson="loessExampleLinFit", echo=FALSE}
(qplot(a, y, data = dataL, main = "Linear vs. local regression")
+ geom_line(aes(x = a, y = locFit), color = "dodgerblue3")
+ geom_line(aes(x = a, y = LinReg), color = "coral3"))
sim_data$locFit <- predict(locfit(y~lp(x, nn=0.1, deg=1), data=sim_data),
newdata = sim_data$x)
ggplot(aes(x, y), data = sim_data) +
geom_point() +
geom_smooth() +
ggtitle("Linear vs. local regression") +
geom_line(aes(x = x, y = locFit), color = "coral3")
```
# Normalization of bulk and single cell RNA--Seq data
Before we can discuss further analysis of our data set, we need to
"normalize" or "calibrate" the data: Transforming it so that the data
from multiple samples / single cells is comparable to each other. You
can roughly think of this process as transforming the raw count data
such that common units are used. To illustrate this, let's look
at "cell77" and "cell6S35" by creating a boxplot of their log counts
of detected genes:
```{r compCells}
ggplot(aes(x = cell_id, y = log(count)),
data = filter(mtec_counts_tidy,
cell_id %in% c("cell77", "cell6S35"),
is_detected == TRUE)) +
geom_boxplot()
# Normalization of bulk and single cell data
```
Notice how `r CRANpkg("ggplot2")` automatically groups the data
by cell, computes the neccessary values and
then displays the boxplot
We can see that their medians are quite different: "cell77" has much
higher counts on average for the detected genes than "cell6S35" so that the counts
for the two cells are not comparable without proper rescaling of their
counts. The neccessary rescaling factors are called "size factors" and we
discuss them next.
## Size factors for RNA--Seq data
Before exploring normalization of single cell data, we look at this issue
in bulk RNA--Seq: Next generation sequencing data is often analyzed in the
form of read counts assigned to genomic bins. Such bins typically correspond to
all the exons of a gene, or an isoform for RNA--Seq data.
Before exploring size factors for single cell data, we look at this issue
in a bulk RNA--Seq data set.
The aim of normalization is to remove between--sample differences
As we know, the aim of normalization is to remove between--sample differences
related to library sizes and other systematic technical effects.
A popular method for the normalization in sequencing data is global scaling:
all counts for a given sample are scaled by factor such that the scaled
......@@ -405,7 +461,9 @@ __normalized sample counts__ = __raw sample counts__ / __ sample size factor__
## Normalization of a Colectoral Cancer example data set
We will illustrate this using a dataset from the [recount2 resource](http://dx.doi.org/10.1038/nbt.3838). [Löhr et. al., 2013](https://doi.org/10.1371/journal.pone.0067461) have
We will illustrate this using a dataset from the [recount2 resource](http://dx.doi.org/10.1038/nbt.3838).
[Löhr et. al., 2013](https://doi.org/10.1371/journal.pone.0067461) have
applied RNA-Seq on colorectal normal,
tumor and metastasis tissues from 8 patients to generate gene
expression patterns. They were particularly interested
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment