Commit c2e9e1ac authored by Bernd Klaus's avatar Bernd Klaus

started section on var stabilization

parent 9386b0c8
This diff is collapsed.
......@@ -48,27 +48,20 @@ library("stringr")
library("pheatmap")
library("matrixStats")
library("purrr")
library("fdrtool")
library("readr")
library("gtools")
library("factoextra")
library("magrittr")
library("entropy")
library("forcats")
library("plotly")
library("corrplot")
library("car")
library("forcats")
library("openxlsx")
library("readxl")
library("limma")
library("Single.mTEC.Transcriptomes")
library("DESeq2")
library("tibble")
library("broom")
library("locfit")
library("recount")
library("psych")
library("vsn")
library("matrixStats")
theme_set(theme_gray(base_size = 18))
......@@ -466,6 +459,12 @@ counts_crc <- assay(crc_data)
counts_crc[1:5, 1:5]
```
We filter all the features that have less than 5 counts on average:
```{r crc_filtering}
counts_crc <- counts_crc[rowMeans(counts_crc) >= 5, ]
dim(counts_crc)
```
The annotation for the genes looks like this:
......@@ -554,7 +553,7 @@ will be sufficient for this data set.
## Exercise: Adapt the boxplot
Experiment with the color variable and try to create faceted plots using
`r facet_wrap()` or `r facet_grid()` : Do you see a pattern if you color / wrap the
`facet_wrap()` or `facet_grid()` : Do you see a pattern if you color / wrap the
boxplots by tissue and or patient ?
......@@ -616,7 +615,22 @@ We can see tha the size factors computed are the scaled medians
on the raw count scale, however they are not identical.
margin^[Note that it is nor recommended to scale sequencing libraries by
the total library size: it is not robust and artificially limits the range
of the data to 0--1]
of the data to 0--1]. We can now normalize the data:
```{r norm_data}
counts_crc_tidy <- mutate(counts_crc_tidy, count_norm = count / sf )
```
## Exercise: Data normalization
Normalize the data using the computed size factors and create boxplots
of the normalized data
```{r data_norm_plot}
```
## Size factors for single cell data
......@@ -662,7 +676,54 @@ ggplot(mtec_cell_anno, aes(x = sizeFactor, y = scran_sf)) +
# Variance stabilization
## Confounding factors
The variance of count data depends on the its mean, i.e. the higher
the counts, the higher their variance. This is readily seen in a
plot of variance against the mean. We use the function `meanSdPlot`
from the `r Biocpkg("vsn") ` package to create such a plot for the
normalized RNA--Seq data.
This function accepts an expression matrix as an input, so we need
to turn our tidy data table into an expression matrix first. The
complicated code in the middle is need to assign the colum `ensembl_id`
as rownames. Here, the dot `.` refers to the current data.
```{r createNormCounts}
counts_norm_crc_mat <- select(counts_crc_tidy, sample_id, ensembl_id, count_norm) %>%
spread(key = sample_id, value = count_norm) %>%
{
tmp <-as.data.frame(.)
rownames(tmp) <- tmp$ensembl_id
tmp
} %>%
select(-ensembl_id) %>%
as.matrix()
meanSdPlot(counts_norm_crc_mat)
```
overdispersion, i.e. the variance is greater than than the mean.
In fact, it is known that a standard Poisson model can only account for the technical
noise in RNA--Seq data. In the Poisson model the variance is equal to the
mean, while in RNA--Seq data the variance is greater than the mean.
A popular way to model this is to use a Negative--Binomial-distribution (NB),
which includes an
additional parameter dispersion parameter $\alpha$ such that $E(NB) = \mu$ and
\begin{gather*}
\text{Var[NB($\mu$, $\alpha$)]} = \mu + \alpha \cdot \mu^2
\end{gather*}
Hence, the variance is greater than the mean. \Biocpkg{DESeq2} uses the NB
model and fits dispersion values (see below).
# Confounding factors, Testing, Machine Learning, move to new doc!
* ZINBA Wave
......
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment