Commit efda91cc authored by Bernd Klaus's avatar Bernd Klaus

introduced NB, mention DESeq2 var stab capabilities

parent c2e9e1ac
......@@ -303,7 +303,8 @@ counts_norm_crc_mat <- select(counts_crc_tidy, sample_id, ensembl_id, count_norm
select(-ensembl_id) %>%
as.matrix()
meanSdPlot(counts_norm_crc_mat)
meanSdPlot(counts_norm_crc_mat)$gg +
ylim(c(0, 10))
## ----session_info, cache = FALSE-----------------------------------------
sessionInfo()
......
......@@ -699,31 +699,64 @@ counts_norm_crc_mat <- select(counts_crc_tidy, sample_id, ensembl_id, count_norm
select(-ensembl_id) %>%
as.matrix()
meanSdPlot(counts_norm_crc_mat)
```
meanSdPlot(counts_norm_crc_mat)$gg +
ylim(c(0, 2000))
```
## The Negative binomial distribution for RNA--Seq data
overdispersion, i.e. the variance is greater than than the mean.
In fact, it is known that a standard Poisson model can only account for the technical
noise in RNA--Seq data. In the Poisson model the variance is equal to the
mean, while in RNA--Seq data the variance is greater than the mean.
A popular model for RNA--Seq data is the Negative--Binomial-distribution (NB),
with mean \(E(NB) = \mu\) and variance:
A popular way to model this is to use a Negative--Binomial-distribution (NB),
which includes an
additional parameter dispersion parameter $\alpha$ such that $E(NB) = \mu$ and
\begin{gather*}
\[
\text{Var[NB($\mu$, $\alpha$)]} = \mu + \alpha \cdot \mu^2
\end{gather*}
\]
Hence, the variance depends on the mean, and the dispersion
parameter \(\alpha\) controls the relationship between mean and variance.
It has been shown (Marioni 2008) that in RNA--Seq data alpha is greater than
zero, unless there is only technical noise This is
called overdispersion, i.e. the variance is greater than than the mean.
(For \(\alpha = 0\), the NB distribution becomes the Poisson distribution)
Single cell RNA-Seq data contains many zeros, due to limited capture
efficiency, thus zero inflated models (ref MAST, Risso 2016) have been
proposed to account for this.
## Variance stabilization by log + pseudocount
A simple variance stabilization method for count data is a log + a pseudocount.
Note that for negative binomially distributed data:
Hence, the variance is greater than the mean. \Biocpkg{DESeq2} uses the NB
model and fits dispersion values (see below).
\[
\text{Var(log(counts + pseudocount))} \approx = \frac{1}{counts + pseudocount} + \alpha
\]
where $\alpha$ is the dispersion. The formula shows that for genes with
a large mean, the variance will no longer depend on the mean and essentially
be equal to alpha since \(\frac{1}{counts + pseudocount}\) is very close to zero.
However, adding pseudocounts will artificially
lower the the variance for lowly expressed genes, possibly leading to
a higher rate of false positive calls in a differential expression analysis.
In general, are carefull modelling of the variance mean realtionship is
preferable to ad--hoc solutions.
The package `r Biocpkg("DESeq2")` provides variance stabilizations that
are tailored to RNA--Seq data. We use its function
`varianceStabilizingTransformation` to obtain normalize and variance--stabilized
count data.
```{r varStabCountData}
```
# Confounding factors, Testing, Machine Learning, move to new doc!
# move to new doc: Confounding factors, Statistical Testing, Machine Learning
* ZINBA Wave
......
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment