Rather than removing know batches, which has many complications and should only
be done with extreme caution.
While the batch effect removal methods often seem to be magic procedure,
they have to be applied carefully. In particular, using the cleaned data
directly can lead to false positive results and exagarrated effects.
The best thing is avoid the use of the cleaned data directly (except for
visualisation purposes) and rather include the estimated latent variables
or batch effects into a linear model [@Nygaard_2015, @Jaffe_2015].
We change the design of our `DESeqDataSet` in order to achieve this and
include sex as a "blocking factor" rather than regressing it out:
```{r chgdesign}
design(dds_batch) <- ~ sex + condition
dds_batch_sex <- DESeq(dds_batch)
res_sex <- results(dds_batch_sex)
resultsNames(dds_batch_sex)
```
As we can see, the sex has been included in the linear model. This way,
we partition the mean for every gene into a fold--change and a sex component,
disecting these two sources of variability.
In fact, it can be shown that running a model like this means that we only
compute fold changes within one sex, rather than using all the data at once.
In experimental design terminology, this is called "blocking" and sex would
be a "blocking factor".
Let's see whether we actually profit from an increased power:
```{r pwrblocking}
summary(res_sex)
```
Indeed, we do: the number of differentially expressed genes detected increases
to `r sum(res_sex$padj < 0.1) ` from `r sum(results(dds_batch)$padj < 0.1)`
previously.
# Tackling "unknown" batches and other unwanted variation
## Key ingredient: Factor analysis
[Factor analysis](https://en.wikipedia.org/wiki/Factor_analysis#Statistical%20model), ^read about it in [Cosma Shalizi's book](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/)
Factor analysis^read about it in [Cosma Shalizi's great book](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/)
is the modelling of data via a small number of latent random variables called
factors. We can use ideas from Factor analysis to tackle batch effects / unwanted
variation.
variation. The basic factor analysis takes a \(n \times p\) data matrix (n samples and p
features) and models it via a sum of \(q\) probabilistic factors \(f\) plus an
author={Andrew E. Jaffe and Thomas Hyde and Joel Kleinman and Daniel R. Weinbergern and Joshua G. Chenoweth and Ronald D. McKay and Jeffrey T. Leek and Carlo Colantuoni},
title={Practical impacts of genomic data {\textquotedblleft}cleaning{\textquotedblright} on biological discovery using surrogate variable analysis},