- 1 Required packages and other preparations
- 2 Statistical testing for high–throughput data
- 3 Tackling unwanted variation and batch effects in high–throughput data
- 4 Factor analysis to tackle unwanted variation
- 5 Latent factors for sc–RNA Seq: The ZINB-WaVE model
- 6 Clustering of the normalized and cleaned sc–data
- 7 Checking the clustering using machine learning
- 8 Neural networks
- 9 Session Info
- References

**LAST UPDATE AT**

` [1] "Mon Apr 16 10:10:11 2018"`

```
library("readxl")
library("BiocStyle")
library("knitr")
library("MASS")
library("RColorBrewer")
library("stringr")
library("pheatmap")
library("matrixStats")
library("purrr")
library("readr")
library("magrittr")
library("entropy")
library("forcats")
library("DESeq2")
library("broom")
library("tidyverse")
library("limma")
library("ggthemes")
library("corpcor")
library("sva")
library("zinbwave")
library("clusterExperiment")
library("clue")
library("sda")
library("crossval")
library("randomForest")
library("keras")
theme_set(theme_solarized(base_size = 18))
data_dir <- file.path("data")
```

We first import the mTEC data again.

```
load(file.path(data_dir, "mtec_counts.RData"))
load(file.path(data_dir, "mtec_cell_anno.RData"))
load(file.path(data_dir, "mtec_gene_anno.RData"))
load(file.path(data_dir, "tras.RData"))
```

```
compute_pca <- function(data_mat, ntop = 500, ...){
pvars <- rowVars(data_mat)
select <- order(pvars, decreasing = TRUE)[seq_len(min(ntop,
length(pvars)))]
PCA <- prcomp(t(data_mat)[, select], center = TRUE, scale. = FALSE)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
return(list(pca = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
PC3 = PCA$x[,3], PC4 = PCA$x[,4], ...),
perc_var = percentVar,
selected_vars = select))
}
```

In high throughput data we often measure multiple features (e.g. genes) for which we then perform statistical tests (e.g. for differential expression). Choosing a p–value significance cutoff \(\alpha\) of e.g. 10%, classical testing theory tells us that we will get a false positive result in 10% of the time. This is no longer true if multiple tests are performed simultaneously. At any \(\alpha\) the probability of making no false rejection goes to zero quite rapidly:

\[ \underbrace{(1-\alpha)\cdot(1-\alpha)\cdot \dotso \cdot (1-\alpha)}_\text{many--times} \rightarrow 0 \]

For an \(\alpha\) of 10% and 50 tests, we get a probability of 0.9^50 = 0.005 to not make a false rejection. So it is clear that we need different error measures for the multiple testing situation. Before we can define them, we need to look at the different error Types.

Suppose you are testing a hypothesis that a fold change \(\beta\) equals zero versus the alternative that it does not equal zero. Let us assume that there are \(m_0\) non–differentially expressed genes out of \(m\) genes and that we declare \(R\) genes as differentially expressed. (i.e. reject the null hypothesis). These are four possible outcomes:

Type I error or false positive — Say that the fold change does not equal zero when it does

Type II error or false negative — Say that the fold change equals zero when it doesn’t

Just like ordinary significance testing tries to control the false positive rate, there are two types of rates commonly used in multiple testing procedures:

Family wise error rate (FWER) – The probability of at least one false positive \(\text{Pr}(FP \geq 1)\)

False discovery rate (FDR) – The rate of false positives among all rejected hypothesis \(E\left[\frac{FP}{FP + TP}\right]\)

The FWER is conceptually closely related to classical significance testing, as it controls the probability of making any false rejection. However, as we will also see below, using a procedure to control it is generally too conservative in large scale problems.

The FDR allows for a certain **rate of false positives** and is the “typical” error rate controlled in large scale multiple testing problems.

Consider the following simulated RNA–Seq data, where we have simulated two groups with 8 samples in each group and a 1000 genes. Each group contains 2 batches: males and females. On top of this, there is genetic information on the subjects which we will use later on in the discussion of batch effects and unwanted variation.

The data is stored in an `DESeqDataSet`

, and we apply a variance stabilization before we obtain the PCA plot.

```
load(file.path(data_dir, "dds_batch.RData"))
dds_batch <- DESeq(dds_batch)
rld_batch <- assay(rlogTransformation(dds_batch, blind=TRUE))
batch_pca <- compute_pca(rld_batch, ntop = 500,
condition = colData(dds_batch)$condition,
sex = colData(dds_batch)$sex,
gen = colData(dds_batch)$gen_back)
sd_ratio <- sqrt(batch_pca$perc_var[2] / batch_pca$perc_var[1])
batch_pca_plot <- ggplot(batch_pca$pca, aes(x = PC1, y = PC2,
color = condition,
shape = sex)) +
geom_point(size = 4) +
ggtitle("PC1 vs PC2, top variable genes") +
labs(x = paste0("PC1, VarExp:", round(batch_pca$perc_var[1],4)),
y = paste0("PC2, VarExp:", round(batch_pca$perc_var[2],4))) +
coord_fixed(ratio = sd_ratio) +
scale_colour_brewer(palette="Dark2")
batch_pca_plot
```

```
ggplot(batch_pca$pca, aes(sex, PC2, color = sex)) +
geom_jitter(height = 0, width = 0.1) +
ggtitle("PC2 by sex") +
geom_boxplot(alpha = 0.1) +
scale_color_fivethirtyeight()
```