Commit 9e73bbed authored by Bernd Klaus's avatar Bernd Klaus

added first section on factor analysis

parent a71e4242
......@@ -110,7 +110,7 @@ res_dds_batch <- results(dds_batch)
res_dds_batch
## ----rnaSeqPvalHist------------------------------------------------------
res_dds_batch <- as_data_frame(res_dds_batch)
res_dds_batch <- as.data.frame(res_dds_batch)
ggplot(res_dds_batch, aes(x = pvalue)) +
geom_histogram(binwidth = 0.05 )
......@@ -161,6 +161,27 @@ ggplot(tibble(pv), aes(x = pv)) +
subtitle = "variance of null distribution too low")
## ----holm----------------------------------------------------------------
pv <- 2 - 2 * pnorm(abs(z), sd = sd.true)
pv_holm <- p.adjust(pv, method = "holm")
head(pv_holm[pv_holm < 0.1])
z[pv_holm < 0.1]
## ----holmrnaseq----------------------------------------------------------
res_dds_batch$padj_holm <- p.adjust(res_dds_batch$pvalue,
method = "holm")
table(res_dds_batch$padj_holm < 0.1)
res_dds_batch[res_dds_batch$padj_holm < 0.1,]
## ----bh------------------------------------------------------------------
pv <- 2 - 2 * pnorm(abs(z), sd = sd.true)
pv_fdr <- p.adjust(pv, method = "fdr")
table(pv_fdr < 0.1)
z[pv_fdr < 0.1]
## ----rna_fdr-------------------------------------------------------------
summary(results(dds_batch))
## ----remove_batch_sim----------------------------------------------------
cleaned_data <- removeBatchEffect(rld_batch,
......@@ -187,15 +208,24 @@ clean_pca_plot <- ggplot(cleaned_data_pca$pca, aes(x = PC1, y = PC2,
clean_pca_plot
## ----chgdesign-----------------------------------------------------------
design(dds_batch) <- ~ sex + condition
dds_batch_sex <- DESeq(dds_batch)
res_sex <- results(dds_batch_sex)
resultsNames(dds_batch_sex)
## ----pwrblocking---------------------------------------------------------
summary(res_sex)
## ----session_info, cache = FALSE-----------------------------------------
sessionInfo()
## ----unloaAll, echo=FALSE, message=FALSE, eval = FALSE-------------------
##
## pkgs <- loaded_packages() %>%
## filter(package != "devtools") %>%
## {.$path}
##
## walk(pkgs, unload)
pkgs <- loaded_packages() %>%
filter(package != "devtools") %>%
{.$path}
walk(pkgs, unload)
......@@ -22,7 +22,7 @@ bibliography: stat_methods_bioinf.bib
<!--
To compile this document
graphics.off();rm(list=ls());rmarkdown::render('factor_ana_testing_ml.Rmd',
output_format = "all");purl('factor_ana_testing_ml.Rmd')
output_format = "BiocStyle::html_document");purl('factor_ana_testing_ml.Rmd')
rmarkdown::render('factor_ana_testing_ml.Rmd', output_format =
BiocStyle::pdf_document(toc = TRUE, highlight = "tango",
......@@ -479,10 +479,7 @@ of batch effects in one form or another have been reported among most,
if not all, high-throughput technologies for a review see the paper
by @Leek_2010.
## Tackeling known batches
## Removing known batches
A simple way to remove known
batches is to fit a regression model that includes the batches to the data, and
......@@ -522,18 +519,69 @@ clean_pca_plot
As we can see, the sex effect has been removed. The function `ComBat`
from the `r Biocpkg("sva")` implements a more sophistaced method for the removal
of known batches that adjusts the variances in a addition to removal of batch
specific means. For additional details see e.g.
specific means.^For additional details see e.g.
[this page](http://genomicsclass.github.io/book/pages/adjusting_with_linear_models.html).
## Including known batches in a linear model
Rather than removing know batches, which has many complications and should only
be done with extreme caution.
While the batch effect removal methods often seem to be magic procedure,
they have to be applied carefully. In particular, using the cleaned data
directly can lead to false positive results and exagarrated effects.
The best thing is avoid the use of the cleaned data directly (except for
visualisation purposes) and rather include the estimated latent variables
or batch effects into a linear model [@Nygaard_2015, @Jaffe_2015].
We change the design of our `DESeqDataSet` in order to achieve this and
include sex as a "blocking factor" rather than regressing it out:
```{r chgdesign}
design(dds_batch) <- ~ sex + condition
dds_batch_sex <- DESeq(dds_batch)
res_sex <- results(dds_batch_sex)
resultsNames(dds_batch_sex)
```
As we can see, the sex has been included in the linear model. This way,
we partition the mean for every gene into a fold--change and a sex component,
disecting these two sources of variability.
In fact, it can be shown that running a model like this means that we only
compute fold changes within one sex, rather than using all the data at once.
In experimental design terminology, this is called "blocking" and sex would
be a "blocking factor".
Let's see whether we actually profit from an increased power:
```{r pwrblocking}
summary(res_sex)
```
Indeed, we do: the number of differentially expressed genes detected increases
to `r sum(res_sex$padj < 0.1) ` from `r sum(results(dds_batch)$padj < 0.1)`
previously.
# Tackling "unknown" batches and other unwanted variation
## Key ingredient: Factor analysis
[Factor analysis](https://en.wikipedia.org/wiki/Factor_analysis#Statistical%20model), ^read about it in [Cosma Shalizi's book](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/)
Factor analysis^read about it in [Cosma Shalizi's great book](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/)
is the modelling of data via a small number of latent random variables called
factors. We can use ideas from Factor analysis to tackle batch effects / unwanted
variation.
variation. The basic factor analysis takes a \(n \times p\) data matrix (n samples and p
features) and models it via a sum of \(q\) probabilistic factors \(f\) plus an
error term \(E\).
\[
X^{[n \times p]} = \sum_{i = 1}^q f_i^{[n \times 1]} w_i^{[1 \times p]} + E^{[n \times p]}
\]
So just like in PCA, the data is modelled as a weighted sum of new variables.
However, the main and essential difference to PCA is now that the
factors just like the error term are modelled as __random__ rather than deterministic.
## Extending factor analysis to include known covariates
Essentially we want to fit the following model to our data:
......@@ -575,14 +623,6 @@ Z = (1,1,-1,-1,1,1,-1,-1)^T
## Best practices and caveats in batch effect removal
While the batch effect removal methods often seem to be magic procedure,
they have to be applied carefully. In particular, using the cleaned data
directly can lead to false positive results and exagarrated effects.
The best thing is avoid the use of the cleaned data directly (except for
visualisation purposes) and rather include the estimated latent variables
or batch effects into a linear model. Two recent papers disucss these caveats
for known and unknown batches respectively.
......
This diff is collapsed.
......@@ -44,6 +44,11 @@ citep("10.12688/f1000research.12122.1")
write.bibtex(file = "stat_methods_bioinf.bib")
knitcitations::cleanbib()
# Nygaard et. al, 2015
citep("10.1093/biostatistics/kxv027")
# Jaffe et. al., 2015
citep("10.1186/s12859-015-0808-5")
# van der Maaten and Hinton, 2008
......
@Misc{1,
doi = {10.1163/1872-9037_afco_asc_558},
url = {https://doi.org/10.1163/1872-9037_afco_asc_558},
publisher = {Brill Academic Publishers},
title = {{JSTOR}},
year = {2017},
}
@Article{Brennecke_2013,
doi = {10.1038/nmeth.2645},
url = {https://doi.org/10.1038/nmeth.2645},
......@@ -68,6 +76,19 @@
journal = {Nature Biotechnology},
}
@Article{Jaffe_2015,
doi = {10.1186/s12859-015-0808-5},
url = {https://doi.org/10.1186/s12859-015-0808-5},
year = {2015},
month = {nov},
publisher = {Springer Nature},
volume = {16},
number = {1},
author = {Andrew E. Jaffe and Thomas Hyde and Joel Kleinman and Daniel R. Weinbergern and Joshua G. Chenoweth and Ronald D. McKay and Jeffrey T. Leek and Carlo Colantuoni},
title = {Practical impacts of genomic data {\textquotedblleft}cleaning{\textquotedblright} on biological discovery using surrogate variable analysis},
journal = {{BMC} Bioinformatics},
}
@Article{Kruskal_1964,
doi = {10.1007/bf02289565},
url = {https://doi.org/10.1007/bf02289565},
......@@ -123,6 +144,18 @@
journal = {Genome Biology},
}
@Article{Nygaard_2015,
doi = {10.1093/biostatistics/kxv027},
url = {https://doi.org/10.1093/biostatistics/kxv027},
year = {2015},
month = {aug},
publisher = {Oxford University Press ({OUP})},
pages = {kxv027},
author = {Vegard Nygaard and Einar Andreas R{\o}dland and Eivind Hovig},
title = {Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses},
journal = {Biostatistics},
}
@Article{Perraudeau_2017,
doi = {10.12688/f1000research.12122.1},
url = {https://doi.org/10.12688/f1000research.12122.1},
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment