factor_ana_testing_ml.Rmd 62.7 KB
 Bernd Klaus committed Nov 22, 2017 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 --- title: "Factor analysis, testing and machine learning for bioinformatics" author: "Bernd Klaus" date: "r doc_date()" output: BiocStyle::html_document: toc: true highlight: tango self_contained: true toc_float: false code_download: true df_print: paged toc_depth: 2 BiocStyle::pdf_document: toc: true toc_depth: 2 bibliography: stat_methods_bioinf.bib --- {r options, include=FALSE} library(knitr) options(digits=3, width=80) golden_ratio <- (1 + sqrt(5)) / 2 opts_chunk$set(echo=TRUE,tidy=FALSE,include=TRUE, dev=c('png', 'pdf', 'svg'), fig.height = 5, fig.width = 4 * golden_ratio, comment = ' ', dpi = 300,  Bernd Klaus committed Dec 06, 2017 36 cache = TRUE, message = FALSE)  Bernd Klaus committed Nov 22, 2017 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67  **LAST UPDATE AT** {r, echo=FALSE, cache=FALSE} print(date())  # Required packages and other preparations {r required_packages_and_data, echo = TRUE, cache=FALSE, message=FALSE} library("readxl") library("BiocStyle") library("knitr") library("MASS") library("RColorBrewer") library("stringr") library("pheatmap") library("matrixStats") library("purrr") library("readr") library("magrittr") library("entropy") library("forcats") library("DESeq2") library("broom") library("tidyverse") library("limma")  Bernd Klaus committed Nov 29, 2017 68 69 library("ggthemes") library("corpcor")  Bernd Klaus committed Dec 04, 2017 70 71 72 73 library("sva") library("zinbwave") library("clusterExperiment") library("clue")  Bernd Klaus committed Dec 06, 2017 74 75 76 library("sda") library("crossval") library("randomForest")  Bernd Klaus committed Jan 31, 2018 77 library("keras")  Bernd Klaus committed Nov 22, 2017 78   Bernd Klaus committed Dec 06, 2017 79 theme_set(theme_solarized(base_size = 18))  Bernd Klaus committed Nov 22, 2017 80 81   Bernd Klaus committed Dec 06, 2017 82 data_dir <- file.path("data")  Bernd Klaus committed Nov 22, 2017 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121   ## mTEC single cell-RNA-Seq data import We first import the mTEC data again. {r import_data} load(file.path(data_dir, "mtec_counts.RData")) load(file.path(data_dir, "mtec_cell_anno.RData")) load(file.path(data_dir, "mtec_gene_anno.RData")) load(file.path(data_dir, "tras.RData"))  ## Function to compute a PCA {r compute_pca} compute_pca <- function(data_mat, ntop = 500, ...){ pvars <- rowVars(data_mat) select <- order(pvars, decreasing = TRUE)[seq_len(min(ntop, length(pvars)))] PCA <- prcomp(t(data_mat)[, select], center = TRUE, scale. = FALSE) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) return(list(pca = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], ...), perc_var = percentVar, selected_vars = select)) }  # Statistical testing for high--throughput data  Bernd Klaus committed Dec 06, 2017 122 In high throughput data we often measure multiple features (e.g. genes) for  Bernd Klaus committed Nov 22, 2017 123 124 125 126 127 128 129 130 131 132 133 134 135 which we then perform statistical tests (e.g. for differential expression). Choosing a p--value significance cutoff $$\alpha$$ of e.g. 10%, classical testing theory tells us that we will get a false positive result in 10% of the time. This is no longer true if multiple tests are performed simultaneously. At any $$\alpha$$ the probability of making no false rejection goes to zero quite rapidly: $\underbrace{(1-\alpha)\cdot(1-\alpha)\cdot \dotso \cdot (1-\alpha)}_\text{many--times} \rightarrow 0$ For an $$\alpha$$ of 10% and 50 tests, we get a probability of 0.9^50 = r 0.9^50 to not make a false rejection. So it is clear that we need different error measures for the multiple testing situation. Before  Bernd Klaus committed Dec 06, 2017 136 we can define them, we need to look at the different error Types.  Bernd Klaus committed Nov 22, 2017 137 138 139  ## Error types and error rates  Bernd Klaus committed Dec 11, 2017 140 141 Suppose you are testing a hypothesis that a fold change$\beta$equals zero versus  Bernd Klaus committed Nov 22, 2017 142 143 the alternative that it does not equal zero. Let us assume that there are$m_0$non--differentially expressed genes out of$m$genes and that we declare  Bernd Klaus committed Dec 06, 2017 144 $R$genes as differentially expressed. (i.e. reject the null hypothesis).  Bernd Klaus committed Nov 22, 2017 145 146 These are four possible outcomes:  Bernd Klaus committed Dec 06, 2017 147 ![](error_rates.png)  Bernd Klaus committed Nov 22, 2017 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169  * Type I error or false positive --- Say that the fold change does not equal zero when it does * Type II error or false negative --- Say that the fold change equals zero when it doesn't Just like ordinary significance testing tries to control the false positive rate, there are two types of rates commonly used in multiple testing procedures: * Family wise error rate (FWER) -- The probability of at least one false positive $$\text{Pr}(FP \geq 1)$$ * False discovery rate (FDR) -- The rate of false positives among all rejected hypothesis $$E\left[\frac{FP}{FP + TP}\right]$$ The FWER is conceptually closely related to classical significance testing, as it controls the probability of making any false rejection. However, as we will also see below, using a procedure to control it is generally too conservative in large scale problems.  Bernd Klaus committed Dec 06, 2017 170 171 The FDR allows for a certain **rate of false positives** and is the "typical" error rate controlled in large scale  Bernd Klaus committed Nov 22, 2017 172 173 174 175 multiple testing problems. ## Statistical testing with RNA--Seq data  Bernd Klaus committed Dec 06, 2017 176 Consider the following simulated RNA--Seq data, where we  Bernd Klaus committed Nov 22, 2017 177 have simulated two groups with 8 samples in each group and a 1000 genes.  Bernd Klaus committed Dec 06, 2017 178 Each group contains 2 batches: males and females. On top of this, there  Bernd Klaus committed Nov 22, 2017 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 is genetic information on the subjects which we will use later on in the discussion of batch effects and unwanted variation. The data is stored in an DESeqDataSet, and we apply a variance stabilization before we obtain the PCA plot. {r sim_rna_seq} load(file.path(data_dir, "dds_batch.RData")) dds_batch <- DESeq(dds_batch) rld_batch <- assay(rlogTransformation(dds_batch, blind=TRUE)) batch_pca <- compute_pca(rld_batch, ntop = 500, condition = colData(dds_batch)$condition, sex = colData(dds_batch)$sex, gen = colData(dds_batch)$gen_back) sd_ratio <- sqrt(batch_pca$perc_var[2] / batch_pca$perc_var[1]) batch_pca_plot <- ggplot(batch_pca$pca, aes(x = PC1, y = PC2, color = condition, shape = sex)) + geom_point(size = 4) + ggtitle("PC1 vs PC2, top variable genes") + labs(x = paste0("PC1, VarExp:", round(batch_pca$perc_var[1],4)), y = paste0("PC2, VarExp:", round(batch_pca$perc_var[2],4))) + coord_fixed(ratio = sd_ratio) + scale_colour_brewer(palette="Dark2") batch_pca_plot  Bernd Klaus committed Nov 29, 2017 212 213 214 215 216 217 218 219  ggplot(batch_pca$pca, aes(sex, PC2, color = sex)) + geom_jitter(height = 0, width = 0.1) + ggtitle("PC2 by sex") + geom_boxplot(alpha = 0.1) + scale_color_fivethirtyeight()  Bernd Klaus committed Nov 22, 2017 220 221 222   Bernd Klaus committed Dec 06, 2017 223 224 PC2 captures the sex--related batch effect in our data. However, PC1 quite clearly separates  Bernd Klaus committed Nov 22, 2017 225 the two conditions so we do expect some differentially expressed genes.  Bernd Klaus committed Nov 29, 2017 226 227 228  ### Exercise: sex related batch effect  Bernd Klaus committed Dec 06, 2017 229 230 231 Use a t--test to compare PC2 between the sex batches. {r pc2ttest, results='hide', echo=FALSE}  Bernd Klaus committed Nov 29, 2017 232 233 234 235 t.test(PC2 ~ sex, data = (batch_pca$pca))   Bernd Klaus committed Nov 22, 2017 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 r Biocpkg("DESeq2")  performs a test based on generalized linear model (glm). A generalized linear model is conceptually identical to an ordinary linear model. However, it allows for non--normally distributed responses. r Biocpkg("DESeq2")  uses the negative binomial distribution to model count data. We can conveniently obtain the results of the test like so: {r rna_seq_pval} load(file.path(data_dir, "dds_batch.RData")) dds_batch <- DESeq(dds_batch) res_dds_batch <- results(dds_batch) res_dds_batch  An import diagnostic plot is the histogram of p--values. {r rnaSeqPvalHist} res_dds_batch <- as.data.frame(res_dds_batch) ggplot(res_dds_batch, aes(x = pvalue)) +  Bernd Klaus committed Dec 06, 2017 255  geom_histogram(binwidth = 0.025, boundary = 0 ) +  Bernd Klaus committed Dec 11, 2017 256 257  ggtitle("p-value histogram for simulated RNA-Seq data") + geom_hline(yintercept = 40, color = "chartreuse3")  Bernd Klaus committed Nov 22, 2017 258 259 260 261 262   We see a more or less uniform background, corresponding to genes that are not differentially expressed and a peak near zero, which represents  Bernd Klaus committed Dec 11, 2017 263 264 265 the differentially expressed genes. Separating the background from the peak at zero by eye--balling, we would expect around 40 differently expressed genes. We do however appreciate the strong dependencies between the values,  Bernd Klaus committed Jan 31, 2018 266 seen as little "hills" in the histogram.  Bernd Klaus committed Dec 11, 2017 267   Bernd Klaus committed Dec 06, 2017 268 269 We will now explore typical p--value histograms^[See also David Robinson's [detailed blog post](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/) on that matter.]  Bernd Klaus committed Nov 22, 2017 270 271 272  ## Typical p--value histograms  Bernd Klaus committed Nov 29, 2017 273 274 In order to simulate typical p--value histograms for high--throughput data, we use the mixture model:  Bernd Klaus committed Nov 22, 2017 275 276 277 278 279  $0.75\cdot N(0,1) + 0.25 \cdot N(2,1)$  Bernd Klaus committed Nov 29, 2017 280 The code below simulates $$m = 200$$ $$p$$--values from this model  Bernd Klaus committed Nov 22, 2017 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 , i.e. $$m_0 = 150$$ here and the null distribution is the standard normal distribution. {r simzscores} sd.true <- 1 eta0.true <- 0.75 get.random.zscore = function(m = 200) { m0 = m*eta0.true m1 = m-m0 z = c( rnorm(m0, mean = 0, sd = sd.true), rnorm(m1, mean = 2, sd = 1)) return(z) } set.seed(555) z <- get.random.zscore(200)  We compute two sided p--values using the correct null distribution. {r pvaluehistogram} pv <- 2 - 2*pnorm(abs(z), sd = sd.true) ggplot(tibble(pv), aes(x = pv)) +  Bernd Klaus committed Nov 29, 2017 310 311  geom_histogram(aes(fill = I(tableau_color_pal('tableau10light')(3)[1])), boundary = 0, binwidth = 0.025) +  Bernd Klaus committed Nov 22, 2017 312 313 314 315 316 317 318 319 320 321 322  xlab("p-values") + ggtitle("Histogram of p-values, correct null distribution")  We see that our $$p$$--values are uniformly distributed under the null hypotheses. Computing the $$p$$--values assuming a $$N(0,2)$$ null distribution changes the picture. {r pvaluehistogramwrongnull} pv <- 2 - 2*pnorm(abs(z), sd = 2) ggplot(tibble(pv), aes(x = pv)) +  Bernd Klaus committed Nov 29, 2017 323 324  geom_histogram(aes(fill = I(tableau_color_pal('tableau10light')(3)[2])), boundary = 0, binwidth = 0.025) +  Bernd Klaus committed Nov 22, 2017 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340  xlab("p-values") + ggtitle("Histogram of p-values", subtitle = "variance of null distribution too high")  If the assumed variance of the null distribution is too high, we often see hill--shaped $$p$$--value histogram. The p--values are to high on average and we cannot identify the true hits. If the variance is too low, we get an extreme enrichment in small p--values, while we potentially loose the uniform background. ### Exercise: p--value histograms Create a p--value histogram with standard deviation less than 1.  Bernd Klaus committed Dec 11, 2017 341 {r pvaluehistogramwrongnull2, results="hide", fig.show="hide", echo=FALSE}  Bernd Klaus committed Nov 22, 2017 342 343 344 pv <- 2 - 2*pnorm(abs(z), sd = 0.5) ggplot(tibble(pv), aes(x = pv)) +  Bernd Klaus committed Nov 29, 2017 345 346 347  geom_histogram(aes( fill = I(tableau_color_pal('tableau10light')(3)[3])), boundary = 0, binwidth = 0.025) +  Bernd Klaus committed Nov 22, 2017 348 349 350 351 352 353 354 355  xlab("p-values") + ggtitle("Histogram of p-values", subtitle = "variance of null distribution too low")  ## Causes of unusual p--value histograms  Bernd Klaus committed Dec 06, 2017 356 In practice, it can be very hard to tell what exactly cause an unusual histogram  Bernd Klaus committed Nov 22, 2017 357 shape. For example, it can happen if experimental groups show different variability  Bernd Klaus committed Dec 06, 2017 358 (e.g. homogeneous controls vs. heterogeneous knockouts).  Bernd Klaus committed Nov 22, 2017 359 360 361 362 As tools like r Biocpkg("DESeq2")  estimate only one variance parameter per gene, pooling samples together can lead to variance estimates that are to small / to large and hence a wrong null model.  Bernd Klaus committed Dec 06, 2017 363 On the other hand, unusual p--values histogram also arise due to unmodelled batch  Bernd Klaus committed Nov 22, 2017 364 365 366 367 368 369 370 effects. If batches are known, it can be instructive to compare them to each other (rather than the experimental groups). ## Controlling the FWER Traditionally, the FWER, so the probability to make at least one false rejection is controlled. This directly translates  Bernd Klaus committed Dec 06, 2017 371 the usual significance level $$\alpha$$ to situations with multiple tests.  Bernd Klaus committed Nov 22, 2017 372 373  The term "family" seems to be a bit unusual, but it simply means a collection  Bernd Klaus committed Dec 06, 2017 374 375 of tests that we consider jointly^[See @Shaffer_1995 for a great review]. A general method to control the FWER is Holm's method,  Bernd Klaus committed Nov 22, 2017 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 where the p--value cutoff $$p_k$$ is chosen such that for the ordered p--values $$p_1, \dotsc, p_m$$ with $$p_1 \leq p_2 \leq, \dotsc, \leq p_m$$ $p_i \leq \frac{\alpha}{(m-i+1)} \; \text{ for all } \; i \leq k$ Where $$\alpha$$ is the desired FWER level. This procedure is valid under arbitrary assumptions and more powerful than the traditionally used Bonferroni method, which simply multiplies all the p--values by $$m$$. ## Adjusted p--values using Holm's method Rather than using formal multiples testing procedures directly, in practice  Bernd Klaus committed Dec 06, 2017 392 people commonly look at so--called adjusted p--values. Given any test procedure,  Bernd Klaus committed Nov 22, 2017 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 the adjusted p--value corresponding to the test of a single hypothesis $$H_i$$ can be defined as the level of the entire test procedure at which $$H_i$$ would just be rejected, given the values of all test statistics involved. So basically, the adjusted p--value tells us about the the $$\alpha$$ level we would achieve if this p--value was chosen as a threshold. For the Holm method, we can compute adjusted p--values like so: $p_i^\text{Holm} = min\{1, \text{cummax}(m - i + 1) \cdot p_i\}$ So we essentially multiply all the p-values by $$(m - i + 1)$$ and carry forward the current maximum so that we makes sure that the adjusted  Bernd Klaus committed Dec 06, 2017 409 p--values do not decrease --- as required by Holm's rule.  Bernd Klaus committed Nov 22, 2017 410 411 412 413 414 415 416 417 418 419 420  We can conveniently compute Holm--adjusted p--values using the function p.adjust: {r holm} pv <- 2 - 2 * pnorm(abs(z), sd = sd.true) pv_holm <- p.adjust(pv, method = "holm") head(pv_holm[pv_holm < 0.1]) z[pv_holm < 0.1]   Bernd Klaus committed Dec 06, 2017 421 422 423 As we can see, the Holm correction is quite conservative, although we have 25 true positives in the data, we only find 3 of them (they have very large z--scores).  Bernd Klaus committed Nov 22, 2017 424 425 426 427 428 429 430 431 432 433 434 435 436 437  We can also now apply Holm's method to our RNA--Seq data: {r holmrnaseq} res_dds_batch$padj_holm <- p.adjust(res_dds_batch$pvalue, method = "holm") table(res_dds_batch$padj_holm < 0.1) res_dds_batch[res_dds_batch$padj_holm < 0.1,]  Where we only identify two genes as differentially expressed. ## Controlling the FDR  Bernd Klaus committed Dec 11, 2017 438 As we have seen, control of the FWER can be quite  Bernd Klaus committed Nov 22, 2017 439 440 strict. In a very influential paper, @Benjamini_1995 propose to control the False Discovery rate (FDR) instead of the FWER. The FDR is the rate of false  Bernd Klaus committed Dec 11, 2017 441 442 443 444 445 446 447 discoveries among all rejected null hypothesis. The adjusted p--values are: $p_i^\text{FDR} = min\{1, \text{cummin}\frac{m}{i} \cdot p_i\}$ These FDR adjusted p--values can also be computed via p.adjust:  Bernd Klaus committed Nov 22, 2017 448 449 450 451 452 453 454 455 456  {r bh} pv <- 2 - 2 * pnorm(abs(z), sd = sd.true) pv_fdr <- p.adjust(pv, method = "fdr") table(pv_fdr < 0.1) z[pv_fdr < 0.1]  Using an FDR--adjustment, we actually find r sum(pv_fdr < 0.1) significant  Bernd Klaus committed Dec 06, 2017 457 z--scores at an FDR of 10%. This means that on average a maximum of  Bernd Klaus committed Dec 11, 2017 458 r round(sum(pv_fdr < 0.1) * 0.1) in our list are false positives.  Bernd Klaus committed Nov 22, 2017 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481  r Biocpkg("DESeq2") conveniently computes FDR adjusted p--values for us, we can get on overview by using the function summary: {r rna_fdr} summary(results(dds_batch))  Here, we actually only identify a handful of differently expresses genes, which can be traced back to batch effects and unwanted variation as discussed next. # Tackling unwanted variation and batch effects in high--throughput data High throughput data is often affected by "unwanted variation", i.e. variation that is not due to the experimental design but technical and other factors ("batch effects", @Leek_2010). Classically, batch effects occur because measurements are affected by laboratory conditions, reagent lots, and personnel differences. This becomes a major problem when batch effects are confounded with an outcome of interest and lead to incorrect conclusions. Here we look at some relevant examples  Bernd Klaus committed Dec 06, 2017 482 of batch effects and discuss how to detect, interpret, model, and  Bernd Klaus committed Nov 22, 2017 483 484 adjust for batch effects.  Bernd Klaus committed Dec 11, 2017 485 Batch effects are the biggest challenge faced by omics research,  Bernd Klaus committed Nov 22, 2017 486 487 especially in the context of precision medicine. The presence of batch effects in one form or another have been reported among most,  Bernd Klaus committed Dec 06, 2017 488 if not all, high-throughput technologies [@Leek_2010].  Bernd Klaus committed Nov 22, 2017 489   Bernd Klaus committed Dec 11, 2017 490 ## Removing known batches via regression  Bernd Klaus committed Nov 22, 2017 491 492 493  A simple way to remove known batches is to fit a regression model that includes the batches to the data, and  Bernd Klaus committed Dec 06, 2017 494 then subtract the coefficients that belong the batch effects. The function  Bernd Klaus committed Nov 22, 2017 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516  removeBatchEffect  in limma implements this strategy. We specify a batch effect and then a design matrix that incorporates the effects that we do not want to remove (the experimental condition in this case). {r remove_batch_sim} cleaned_data <- removeBatchEffect(rld_batch, batch = batch_pca$pca$sex, design = model.matrix( ~ batch_pca$pca$condition)) cleaned_data_pca <- compute_pca(cleaned_data, condition = batch_pca$pca$condition, sex = batch_pca$pca$sex) sd_ratio <- sqrt(cleaned_data_pca$perc_var[2] / cleaned_data_pca$perc_var[1]) clean_pca_plot <- ggplot(cleaned_data_pca$pca, aes(x = PC1, y = PC2, color = condition, shape = sex)) + geom_point(size = 4) +  Bernd Klaus committed Nov 29, 2017 517  ggtitle("PC1 vs PC2, cleaned data, top variable genes") +  Bernd Klaus committed Nov 22, 2017 518 519 520  labs(x = paste0("PC1, VarExp:", round(cleaned_data_pca$perc_var[1],4)), y = paste0("PC2, VarExp:", round(cleaned_data_pca$perc_var[2],4))) + coord_fixed(ratio = sd_ratio) +  Bernd Klaus committed Nov 29, 2017 521 522 523 524 525 526  scale_colour_brewer(palette="Set2") clean_pca_plot ggplot(cleaned_data_pca$pca, aes(sex, PC2, color = sex)) + geom_jitter(height = 0, width = 0.1) +  Bernd Klaus committed Dec 11, 2017 527  ggtitle("PC2 by sex, cleaned data") +  Bernd Klaus committed Nov 29, 2017 528 529 530  geom_boxplot(alpha = 0.1) + scale_color_fivethirtyeight()  Bernd Klaus committed Nov 22, 2017 531 532 533    Bernd Klaus committed Nov 29, 2017 534 As we can see, the sex effect has been alleviated. The function ComBat  Bernd Klaus committed Dec 06, 2017 535 from the r Biocpkg("sva") implements a more sophisticated method for the removal  Bernd Klaus committed Nov 22, 2017 536 of known batches that adjusts the variances in a addition to removal of batch  Bernd Klaus committed Dec 06, 2017 537 538 specific means.^[For additional details see e.g. [this page](http://genomicsclass.github.io/book/pages/adjusting_with_linear_models.html).]  Bernd Klaus committed Nov 22, 2017 539   Bernd Klaus committed Nov 22, 2017 540 541 ## Including known batches in a linear model  Bernd Klaus committed Dec 06, 2017 542 While the batch effect removal methods often seem to be magic procedures,  Bernd Klaus committed Nov 22, 2017 543 they have to be applied carefully. In particular, using the cleaned data  Bernd Klaus committed Dec 06, 2017 544 directly can lead to false positive results and exaggerated effects.  Bernd Klaus committed Nov 22, 2017 545   Bernd Klaus committed Dec 06, 2017 546 547 The best thing is avoid the direct use of the cleaned data (except for visualization purposes) and rather include the estimated latent variables  Bernd Klaus committed Nov 29, 2017 548 549 or batch effects into a linear model as the removal might introduce new biases [@Nygaard_2015, @Jaffe_2015].  Bernd Klaus committed Nov 22, 2017 550 551 552 553 554 555 556 557 558 559 560 561 562  We change the design of our DESeqDataSet in order to achieve this and include sex as a "blocking factor" rather than regressing it out: {r chgdesign} design(dds_batch) <- ~ sex + condition dds_batch_sex <- DESeq(dds_batch) res_sex <- results(dds_batch_sex) resultsNames(dds_batch_sex)  As we can see, the sex has been included in the linear model. This way, we partition the mean for every gene into a fold--change and a sex component,  Bernd Klaus committed Dec 06, 2017 563 dissecting these two sources of variability.  Bernd Klaus committed Nov 22, 2017 564 565 566 567 568 569 570 571 572 573 574 575 576 In fact, it can be shown that running a model like this means that we only compute fold changes within one sex, rather than using all the data at once. In experimental design terminology, this is called "blocking" and sex would be a "blocking factor". Let's see whether we actually profit from an increased power: {r pwrblocking} summary(res_sex)  Indeed, we do: the number of differentially expressed genes detected increases to r sum(res_sex$padj < 0.1)  from r sum(results(dds_batch)$padj < 0.1) previously.  Bernd Klaus committed Nov 22, 2017 577   Bernd Klaus committed Dec 11, 2017 578 # Factor analysis to tackle unwanted variation  Bernd Klaus committed Nov 22, 2017 579   Bernd Klaus committed Dec 11, 2017 580 ## Factor analysis  Bernd Klaus committed Nov 22, 2017 581   Bernd Klaus committed Dec 06, 2017 582 Factor analysis^[read about it in [Cosma Shalizi's great book](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/)]  Bernd Klaus committed Nov 22, 2017 583 584 is the modelling of data via a small number of latent random variables called factors. We can use ideas from Factor analysis to tackle batch effects / unwanted  Bernd Klaus committed Nov 22, 2017 585 586 587 588 589 variation. The basic factor analysis takes a $$n \times p$$ data matrix (n samples and p features) and models it via a sum of $$q$$ probabilistic factors $$f$$ plus an error term $$E$$. $ Bernd Klaus committed Nov 29, 2017 590 591 X^{[n \times p]} = \sum_{i = 1}^q f_i^{[n \times 1]} w_i^{[1 \times p]} + E^{[n \times p]} \\ = Fw + E  Bernd Klaus committed Nov 22, 2017 592 593 $  Bernd Klaus committed Dec 06, 2017 594 So just like in PCA, the data is modeled as a weighted sum of new variables.  Bernd Klaus committed Nov 22, 2017 595 However, the main and essential difference to PCA is now that the  Bernd Klaus committed Dec 06, 2017 596 597 factors just like the error term are modeled as __random__ rather than deterministic. (The error terms being independent from the factors and from each other)  Bernd Klaus committed Nov 29, 2017 598 599 Thus, you can think of factors in factor analysis as random versions of principal components. If we assume that  Bernd Klaus committed Nov 22, 2017 600   Bernd Klaus committed Nov 29, 2017 601 602 603 604 605 606 607 $f_{ij} \sim N(0,1) \text{ for } j = 1, \dotsc, n$ and that the factors are independent across samples as well as variables we get that marginally every variable follows a multivariate normal distribution with mean zero and correlation $$\varphi_i + w^Tw$$. $ Bernd Klaus committed Dec 06, 2017 608 X_i \sim \mathbf N(0, \varphi_i + w^Tw)  Bernd Klaus committed Nov 29, 2017 609 $  Bernd Klaus committed Dec 11, 2017 610 611 Where $$\varphi_{i}$$ are called the "specific variances" and represent the variances of the error terms; $$w$$ is the factor loading matrix.  Bernd Klaus committed Nov 29, 2017 612 This equation forms the basis of maximum likelihood estimation in  Bernd Klaus committed Dec 11, 2017 613 614 615 616 factor analysis. Factor analysis is often motivated as "preserving correlations" between the input variables. However, the term factor analysis is general used for models which include sums of random variables.  Bernd Klaus committed Nov 29, 2017 617   Bernd Klaus committed Jan 23, 2018 618 619 620 621 For advanced factor analysis methods in 'omics research see: @Buettner_2017 and @Argelaguet_2017.  Bernd Klaus committed Nov 29, 2017 622 623 624 625 626 627 628 629 ## Factor analysis to tackle batch effects We can now try factor analysis on our data set: We will use the variance stabilized data to estimate latent factors. The function factanal will compute those for us. We will try to estimate 4 factors. The function requires a centered and scaled data matrix or a correlation / covariance matrix as input. Since the function uses maximum likelihood estimation to obtain the factor loadings, it requires the  Bernd Klaus committed Dec 06, 2017 630 inverse of the correlation matrix. However, as we have 1000 genes and  Bernd Klaus committed Nov 29, 2017 631 only 16 samples, the usual correlation estimate cannot be inverted and  Bernd Klaus committed Dec 06, 2017 632 we need to use a shrinkage estimator of the correlation matrix [@Sch_fer_2005]  Bernd Klaus committed Nov 29, 2017 633 634 as implemented in r CRANpkg("corpcor").  Bernd Klaus committed Dec 11, 2017 635 The variables need to be in the  Bernd Klaus committed Nov 29, 2017 636 637 638 columns, so we need to transpose the rld matrix. Traditionally, the loadings are of primary interest in factor analysis, if a correlation matrix is given, the function will not even compute the factor  Bernd Klaus committed Dec 06, 2017 639 scores. Thus we need to compute them explicitly using a regression method.  Bernd Klaus committed Nov 29, 2017 640 641 642 643 644 645 646 647  {r runfactor} factor_ana <- function(X, ntop = 500, ...){ cen_scaled_input <- scale(t(X))  Bernd Klaus committed Dec 11, 2017 648  pvars <- colVars(cen_scaled_input)  Bernd Klaus committed Nov 29, 2017 649 650  select <- order(pvars, decreasing = TRUE)[seq_len(min(ntop, length(pvars)))]  Bernd Klaus committed Dec 11, 2017 651  cen_scaled_input <- cen_scaled_input[, select]  Bernd Klaus committed Nov 29, 2017 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669  cor_matrix <- cor.shrink(cen_scaled_input) rld_batch_fac <- factanal(covmat = cor_matrix, factors = 4, n.obs = ncol(cen_scaled_input), rotation = "none") loadings <- rld_batch_fac$loadings[, 1:4] factors <- as_data_frame(cen_scaled_input %*% solve(cor_matrix, loadings)) %>% add_column(...) return(factors) }   Bernd Klaus committed Dec 11, 2017 670 671 We can now estimate the factors by maximum likelihood and create a 2d scatterplot  Bernd Klaus committed Nov 29, 2017 672 673 674 of them. {r runfactanal}  Bernd Klaus committed Nov 22, 2017 675   Bernd Klaus committed Dec 11, 2017 676 factors_rld_batch <- factor_ana(rld_batch, ntop = 1000,  Bernd Klaus committed Nov 29, 2017 677 678 679  sex = batch_pca$pca$sex, condition = batch_pca$pca$condition, gen = batch_pca$pca$gen)  Bernd Klaus committed Nov 22, 2017 680   Bernd Klaus committed Nov 29, 2017 681 682 683 684 batch_fac_plot <- ggplot(factors_rld_batch, aes(x = Factor1, y = Factor2, color = condition, shape = sex)) + geom_point(size = 4) +  Bernd Klaus committed Dec 11, 2017 685  ggtitle("Factor 1 vs Factor 2, all genes") +  Bernd Klaus committed Nov 29, 2017 686 687  coord_fixed(ratio = sd_ratio) + scale_colour_tableau(palette = "tableau10")  Bernd Klaus committed Nov 22, 2017 688   Bernd Klaus committed Nov 29, 2017 689 batch_fac_plot  Bernd Klaus committed Nov 22, 2017 690   Bernd Klaus committed Nov 29, 2017 691   Bernd Klaus committed Nov 22, 2017 692   Bernd Klaus committed Dec 11, 2017 693 694 As we can see it is similar to the PCA plot for the data. Although factor 2 seems to separate the different sexes less than PC2.  Bernd Klaus committed Dec 06, 2017 695   Bernd Klaus committed Dec 11, 2017 696 This does not have to be the case in general. In fact,  Bernd Klaus committed Nov 29, 2017 697 698 the term factor analysis is sometimes employed if PCA--like procedures are used and some factor analysis estimation algorithms use a PCA--like  Bernd Klaus committed Dec 06, 2017 699 algorithm as an initial solution .^[for a detailed description of both methods, I recommend Chap. 16 / 17  Bernd Klaus committed Dec 11, 2017 700 of [Shalizi's book](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/).]  Bernd Klaus committed Nov 22, 2017 701   Bernd Klaus committed Dec 11, 2017 702 ### Exercise: factor analysis  Bernd Klaus committed Nov 22, 2017 703   Bernd Klaus committed Dec 11, 2017 704 705 Try to make sense out of the factors: Check the relationship between Factors 1/2 and condition / sex graphically. Is there a factor which is related to  Bernd Klaus committed Nov 29, 2017 706 the genetic background?  Bernd Klaus committed Nov 22, 2017 707   Bernd Klaus committed Dec 11, 2017 708 {r facvspca, results="hide", fig.show="hide", echo=FALSE}  Bernd Klaus committed Nov 22, 2017 709   Bernd Klaus committed Dec 11, 2017 710 ggplot(factors_rld_batch, aes(condition, Factor1, color = condition)) +  Bernd Klaus committed Nov 29, 2017 711  geom_jitter(height = 0, width = 0.1) +  Bernd Klaus committed Dec 11, 2017 712  ggtitle("Factor 1 by condition") +  Bernd Klaus committed Nov 29, 2017 713 714  geom_boxplot(alpha = 0.1) + scale_color_fivethirtyeight()  Bernd Klaus committed Nov 22, 2017 715   Bernd Klaus committed Dec 11, 2017 716 t.test(Factor1 ~ condition, data = factors_rld_batch)  Bernd Klaus committed Nov 22, 2017 717 718   Bernd Klaus committed Dec 11, 2017 719 ggplot(factors_rld_batch, aes(sex, Factor2, color = sex)) +  Bernd Klaus committed Nov 29, 2017 720  geom_jitter(height = 0, width = 0.1) +  Bernd Klaus committed Dec 11, 2017 721  ggtitle("Factor 2 by sex") +  Bernd Klaus committed Nov 29, 2017 722 723  geom_boxplot(alpha = 0.1) + scale_color_fivethirtyeight()  Bernd Klaus committed Nov 22, 2017 724   Bernd Klaus committed Dec 11, 2017 725 t.test(Factor2 ~ sex, data = factors_rld_batch)  Bernd Klaus committed Nov 22, 2017 726 727   Bernd Klaus committed Nov 29, 2017 728 729 cor(as.numeric(batch_pca$pca$sex), factors_rld_batch$Factor2) cor(as.numeric(batch_pca$pca$condition), factors_rld_batch$Factor2)  Bernd Klaus committed Nov 22, 2017 730 731   Bernd Klaus committed Dec 11, 2017 732 cor(as.numeric(factors_rld_batch$gen), factors_rld_batch$Factor3)  Bernd Klaus committed Nov 29, 2017 733   Bernd Klaus committed Nov 22, 2017 734   Bernd Klaus committed Nov 29, 2017 735 ## Including estimated factors in a linear model  Bernd Klaus committed Nov 22, 2017 736   Bernd Klaus committed Dec 11, 2017 737 738 739 We have seen in the plot that factor 1 seems to capture the differences between groups, while factor 2 captures the sex differences. Factors 3 and 4 seem to be unrelated to the  Bernd Klaus committed Nov 29, 2017 740 design. Just like have included sex as a known batch before, we can now  Bernd Klaus committed Dec 11, 2017 741 additionally include the computed factors 3, and (or) 4  Bernd Klaus committed Nov 29, 2017 742 743 in the model to see whether this actually improves power.  Bernd Klaus committed Nov 22, 2017 744 745   Bernd Klaus committed Nov 29, 2017 746 747 748 {r factordesign} colData(dds_batch) <- cbind(colData(dds_batch), select(factors_rld_batch, 1:4))  Bernd Klaus committed Dec 11, 2017 749 design(dds_batch) <- ~ sex + Factor3 + Factor4 + condition  Bernd Klaus committed Nov 29, 2017 750 751 752 753 754 dds_batch_f <- DESeq(dds_batch) res_f <- results(dds_batch_f) summary(res_f) summary(res_sex)   Bernd Klaus committed Nov 22, 2017 755   Bernd Klaus committed Nov 29, 2017 756 757 758 759 The number of differentially expressed genes detected is now r sum(na.exclude(res_f$padj) < 0.1) compared to r sum(res_sex$padj < 0.1)  when only blocking for sex.  Bernd Klaus committed Nov 22, 2017 760 761   Bernd Klaus committed Nov 29, 2017 762 763 764 ## Extending factor analysis to include known covariates {r secondfactor}  Bernd Klaus committed Dec 11, 2017 765 ggplot(factors_rld_batch, aes(condition, Factor1, color = condition)) +  Bernd Klaus committed Nov 29, 2017 766 767 768 769 770 771 772  geom_jitter(height = 0, width = 0.1) + ggtitle("Factor2 by condition") + geom_boxplot(alpha = 0.1) + scale_color_fivethirtyeight() t.test(Factor2 ~ condition, data = factors_rld_batch)   Bernd Klaus committed Nov 22, 2017 773 774   Bernd Klaus committed Dec 11, 2017 775 In our factor analysis, we have seen that the first factor  Bernd Klaus committed Nov 29, 2017 776 777 captures the experimental design, and the other factors seemed to capture unwanted variation we wanted to control for.  Bernd Klaus committed Nov 22, 2017 778   Bernd Klaus committed Nov 29, 2017 779 Can we somehow formalize this? So can we fit a model like:  Bernd Klaus committed Nov 22, 2017 780   Bernd Klaus committed Dec 11, 2017 781 data = experimental design + latent factors + noise  Bernd Klaus committed Nov 29, 2017 782   Bernd Klaus committed Nov 22, 2017 783   Bernd Klaus committed Nov 29, 2017 784 785 786 that contains both our experimental setup as well as latent factors ? The answer is "yes", but there is an important difficulty: the latent factors  Bernd Klaus committed Dec 06, 2017 787 788 can be **confounded** with our experimental design.^[See [Wang et. al.](https://arxiv.org/abs/1508.04178) for an elaboration on that point.] Our simulated RNA--Seq  Bernd Klaus committed Nov 29, 2017 789 790 791 792 data contains sex and genetic background as known factors, we can run a multiple regression analysis of the experimental condition vector on the known factors to see how much confounding we have:  Bernd Klaus committed Dec 11, 2017 793 794 795 796 797 798 799 {r mult_reg_confounding, warning=FALSE} confounders <- cbind(as.numeric(colData(dds_batch)$sex), as.numeric(colData(dds_batch)$gen_back), as.numeric(colData(dds_batch)$condition)) colnames(confounders) <- c("sex", "gen_back", "condition")  Bernd Klaus committed Nov 29, 2017 800   Bernd Klaus committed Dec 11, 2017 801 802 803 tab_conf <- tidy(lm( confounders[, 1:2] ~ condition, data = as.data.frame(confounders))) %>% filter(term != "(Intercept)")  Bernd Klaus committed Nov 29, 2017 804   Bernd Klaus committed Dec 11, 2017 805 mutate_if(tab_conf, is_numeric, round, digits = 3)  Bernd Klaus committed Nov 29, 2017 806 807 808 809  As we can see, there is no confounding between sex and condition, but between the genetic background and the condition. They cannot be easily  Bernd Klaus committed Dec 06, 2017 810 disentangled.^[see  Bernd Klaus committed Apr 17, 2018 811 also [Gerard and Stephens, 2017](http://dcgerard.github.io/research/2017/06/29/ruv.html)  Bernd Klaus committed Nov 29, 2017 812 813  Popular methods like Surrogate variable analysis, (r Biocpkg("sva"))  Bernd Klaus committed Dec 06, 2017 814 815 and RUV (r Biocpkg("RUVnormalize")  r Biocpkg("RUVseq") , r Biocpkg("RUVcorr") ) try to solve this problem by tricks like down weighting genes influenced by the experimental design (so with $$\mathbf{\beta}$$ near 0),  Bernd Klaus committed Nov 29, 2017 816 robust regression methods (r CRANpkg("cate") ) or using control genes  Bernd Klaus committed Dec 06, 2017 817 818 819 820 to estimate the latent factors. We will now run sva on our data, to see whether it can disentangle the latent factors from the experimental  Bernd Klaus committed Nov 29, 2017 821 conditions. It requires the definition of a null model, which contains  Bernd Klaus committed Dec 06, 2017 822 a priory variables we want to adjust for (like sex) and a full model, which  Bernd Klaus committed Dec 11, 2017 823 824 825 contains the experimental design + the adjustment variables. We use the rlog data as an input to sva. First, the total number of SVs is guessed, then they are  Bernd Klaus committed Nov 29, 2017 826 827 828 computed. {r sva_sim}  Bernd Klaus committed Dec 11, 2017 829 830 mod <- model.matrix(~ sex + condition, data = colData(dds_batch)) mod0 <- model.matrix(~ sex, data = colData(dds_batch))  Bernd Klaus committed Nov 29, 2017 831 832 833 834  n.sv <- num.sv(rld_batch, mod, method = "be", B = 100, vfilter = 500) sva_res <- sva(rld_batch, mod, mod0, n.sv = n.sv, vfilter = 500, B = 20)  Bernd Klaus committed Dec 11, 2017 835 ggplot(data_frame(genetic_back = factors_rld_batch$gen,  Bernd Klaus committed Nov 29, 2017 836  sv = as.vector(sva_res$sv)),  Bernd Klaus committed Dec 11, 2017 837  aes(genetic_back, sv, color = genetic_back)) +  Bernd Klaus committed Nov 29, 2017 838 839 840 841 842  geom_jitter(height = 0, width = 0.1) + ggtitle("Surrogate variable by genetic background") + geom_boxplot(alpha = 0.1) + scale_color_fivethirtyeight()  Bernd Klaus committed Dec 11, 2017 843 anova(lm(as.vector(sva_res$sv) ~ factors_rld_batch$gen))  Bernd Klaus committed Nov 29, 2017 844 845 846 847 848 849 850 851 852  Interestingly, the surrogate variable compute seems to capture the genetic background of our samples. Something that we could not achieve with PCA or factor analysis alone. Let's look at the power we have: {r svdesign} colData(dds_batch)$sv <- as.vector(sva_res$sv)  Bernd Klaus committed Jan 23, 2018 853 design(dds_batch) <- ~ sv + sex + condition  Bernd Klaus committed Nov 29, 2017 854 855 856 857 858 dds_batch_sva <- DESeq(dds_batch) res_sva <- results(dds_batch_sva) summary(res_sva)   Bernd Klaus committed Dec 06, 2017 859 Unfortunately, although sva is able to capture the genetic background, it  Bernd Klaus committed Nov 29, 2017 860 is too confounded with condition leaving us with zero detected genes.  Bernd Klaus committed Dec 11, 2017 861 Ultimately, genetic background and experimental conditions are too confounded  Bernd Klaus committed Jan 31, 2018 862 with each other, so that we cannot really disentangle the two. It is remarkable  Bernd Klaus committed Dec 11, 2017 863 though that sva can actually identify the genetic background from the data.  Bernd Klaus committed Nov 29, 2017 864   Bernd Klaus committed Dec 04, 2017 865 # Latent factors for sc--RNA Seq: The ZINB-WaVE model  Bernd Klaus committed Nov 29, 2017 866 867 868 869 870  @Brennecke_2015 used the method of @Buettner_2015 to regress out the influence of the cell cycle from the gene expression data. Briefly, @Buettner_2015 perform a factor analysis--like algorithm to estimate a latent factor on gene annotated to cell--cycle. Here, we want to use  Bernd Klaus committed Apr 17, 2018 871 872 873 the method of @Risso_2018 instead. There is also a more recent package, r Biocpkg("scone") , [@Cole_2018] available from som of the authors which deals directly with normalization of sc--RNA--Seq data.  Bernd Klaus committed Nov 29, 2017 874   Bernd Klaus committed Dec 06, 2017 875 They propose a zero--inflated Negative Binomial Model for (sc--) RNA-Seq  Bernd Klaus committed Nov 29, 2017 876 data implemented in the package r Biocpkg("zinbwave"). In contrast to  Bernd Klaus committed Dec 06, 2017 877 878 many other methods proposed for single cell RNA--Seq data, it uses the NB distribution, so a distribution  Bernd Klaus committed Nov 29, 2017 879 directly tailored to count data.  Bernd Klaus committed Nov 22, 2017 880   Bernd Klaus committed Dec 06, 2017 881 ![](zinba.png)  Bernd Klaus committed Nov 22, 2017 882 883 884 885  The model is quite comprehensive. 1. First, it models "additional zeros" as  Bernd Klaus committed Nov 29, 2017 886 single cell data RNA--Seq data is commonly thought to contain more zeros  Bernd Klaus committed Nov 22, 2017 887 888 than predicted by the NB distribution. As the additional zeros are believed to be due to technical factors such the efficiencies of reverse transcription,  Bernd Klaus committed Dec 06, 2017 889 these additional zeros are also known as "dropouts".^[On the other hand,  Bernd Klaus committed Apr 16, 2018 890 891 the NB model does not seem to be too bad for sc Data, see e.g. [Valentine Svensson 's] blog posts on this matter [here](http://www.nxn.se/valent/2017/11/16/droplet-scrna-seq-is-not-zero-inflated) and [here](http://www.nxn.se/valent/2018/1/30/count-depth-variation-makes-poisson-scrna-seq-data-negative-binomial).]  Bernd Klaus committed Nov 22, 2017 892   Bernd Klaus committed Nov 29, 2017 893 The proportion of zeros is modeled by $$\pi$$ while $$mu$$ are the counts.  Bernd Klaus committed Nov 22, 2017 894   Bernd Klaus committed Nov 29, 2017 895 2. The **green component** includes known sample (cell) level covariates such as  Bernd Klaus committed Nov 22, 2017 896 the experimental group a sample belongs to. This part is used to reflect  Bernd Klaus committed Nov 29, 2017 897 898 the experimental design and known batches. It includes a column of ones to account for gene specific baseline expression levels.  Bernd Klaus committed Nov 22, 2017 899   Bernd Klaus committed Dec 04, 2017 900 901 3. The **dark blue component** are gene level covariates such as gc--bias or gene--length corrections.  Bernd Klaus committed Nov 22, 2017 902   Bernd Klaus committed Nov 29, 2017 903 4. The **red component** contains a factor model which models unknown cell--level  Bernd Klaus committed Nov 22, 2017 904 905 906 "factors" (random variables) that explain everything that is not captured by known sample or gene level covariates.  Bernd Klaus committed Dec 04, 2017 907 The model also allows for sample--matrix sized offsets, which can be used to  Bernd Klaus committed Dec 11, 2017 908 include pre--computed cell specific size factors.  Bernd Klaus committed Dec 04, 2017 909   Bernd Klaus committed Nov 29, 2017 910 911 The estimation of the model works by penalized maximum likelihood, where sets of parameters are optimized at a time: After initialization, the  Bernd Klaus committed Dec 06, 2017 912 dispersion is optimized, then the cell specific components and then  Bernd Klaus committed Nov 29, 2017 913 the gene specific components.  Bernd Klaus committed Nov 22, 2017 914   Bernd Klaus committed Nov 29, 2017 915 Finally, it is made sure that the latent factors $$W$$ itself as well as  Bernd Klaus committed Dec 04, 2017 916 their loading vectors $$\alpha$$ are independent of each other,  Bernd Klaus committed Nov 29, 2017 917 just as it is assumed in ordinary factor analysis.  Bernd Klaus committed Nov 22, 2017 918   Bernd Klaus committed Dec 04, 2017 919 920 ## Running zinbwave  Bernd Klaus committed Dec 06, 2017 921 922 923 924 925 926 927 We will now run r Biocpkg("zinbwave") on the single cell data, we select only those cells that were not select based on a surface marker (The other cells have been used for validation in the original paper). As r Biocpkg("zinbwave") estimates intercept column parameters as well (by default) we can use the raw counts as input and don't need to provide our own normalization.  Bernd Klaus committed Dec 04, 2017 928 929 930 931 932 933  {r runzinbawave} no_marker_cells <- filter(mtec_cell_anno, SurfaceMarker == "None") count_matrix_nomarker <- as.matrix(mtec_counts[, no_marker_cells$cellID]) rownames(count_matrix_nomarker) <- mtec_counts$ensembl_id  Bernd Klaus committed Dec 06, 2017 934 #  Bernd Klaus committed Dec 04, 2017 935 # zinb <- zinbFit(count_matrix_nomarker,  Bernd Klaus committed Dec 06, 2017 936 937 # K=1, epsilon=1000) # save(zinb,file = file.path(data_dir, "zinb.RData"))  Bernd Klaus committed Dec 04, 2017 938 939 940 941 942 943 944 945 946   ## Obtaining normalized and corrected data We now obtain the normalized and corrected data. Note that in contrast to @Brennecke_2015 we do not estimate the latent factor based on genes annotated to cell cycle but simply let r Biocpkg("zinbwave") estimate one latent factor. We then extract the residuals from the model, i.e.  Bernd Klaus committed Dec 11, 2017 947 the green, blue and red parts subtracted from the original data and  Bernd Klaus committed Dec 04, 2017 948 949 950 951 store them in a SingleCellExperiment object to which we can attach the cell annotation as colData. {r zinbanormcorrected}  Bernd Klaus committed Dec 06, 2017 952 load(file.path(data_dir, "zinb.RData"))  Bernd Klaus committed Dec 04, 2017 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 mtec_norm <- zinbwave(SummarizedExperiment(count_matrix_nomarker), fitted_model=zinb, normalizedValues=TRUE, residuals = TRUE) # keep only the residuals in the summarized experiment assay(mtec_norm, 1) <- NULL assay(mtec_norm, 1) <- NULL colData(mtec_norm) <- DataFrame(no_marker_cells) rowData(mtec_norm) <- DataFrame(filter(mtec_gene_anno, ensembl_id %in% rownames(mtec_norm))) mtec_norm   Bernd Klaus committed Dec 06, 2017 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 ## Checking the normalization We check the normalization by creating boxplot for a random set of 25 cells. The boxplots are ordered by median expression. We can appreciate that the normalization has worked successfully. {r checkscnorm} mtec_norm_tidy <- as.data.frame(assay(mtec_norm)) %>% { colnames(.) = colData(mtec_norm)$cellID .[, sample(ncol(mtec_norm), 25)] } %>% gather(key = "cell", value = "expression") medians <- mtec_norm_tidy %>% group_by(cell) %>% summarize(med = median(expression)) %>% arrange(med) mtec_norm_tidy$cell %<>% factor(levels = medians$cell) pl <- ggplot(mtec_norm_tidy, aes(cell, expression, color = cell)) + geom_boxplot() + guides(color = FALSE) + ggtitle("zinbawave normalized/cleaned expression data [log]") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) pl #ggsave("norm.pdf", pl, width = 28, height = 14)  # Clustering of the normalized and cleaned sc--data  Bernd Klaus committed Dec 04, 2017 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011  We can now move on to the clustering of the single--cell data and compare our clusters to the ones from @Brennecke_2015. We will perform the clustering using the r Biocpkg("clusterExperiment")  package, which implements a comprehensive resampling based clustering strategy called RSEC (Resampling-based Sequential Ensemble Clustering). It first creates many clusterings using subsampling and different parameter values for the clustering algorithms. It then creates a consensus clustering from those, and finally merges clusters in this consensus clustering, which show a low proportion of DE--genes between them.  Bernd Klaus committed Dec 06, 2017 1012 1013 1014 1015 This avoid erroneous choices of the total number of clusters. ^[A detailed description of the individual steps can be found in the [package vignette](http://bioconductor.org/packages/release/bioc/vignettes/clusterExperiment/inst/doc/clusterExperimentTutorial.html#workflow)] The package is extremely feature rich, and we will only use as subset of the  Bernd Klaus committed Dec 11, 2017 1016 available features here to mimic what as done in @Brennecke_2015.  Bernd Klaus committed Dec 04, 2017 1017 1018  We first import the clustering and subset our data to include only  Bernd Klaus committed Dec 06, 2017 1019 1020 1021 those genes that were used for clustering in @Brennecke_2015. We also remove all the genes that were not assigned to a cluster in the original publication^[this "leftover" cluster has the number 12].  Bernd Klaus committed Dec 04, 2017 1022 1023  We then add the published clustering to the rowData so that we can  Bernd Klaus committed Dec 06, 2017 1024 1025 1026 compare it to our own clustering.  Bernd Klaus committed Dec 04, 2017 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 {r clusteringprep} load(file.path(data_dir, "genes_for_clustering.RData")) load(file.path(data_dir,"nomarkerCellsClustering.RData")) mtec_norm_sub <- mtec_norm[genes_for_clustering, ] rowData(mtec_norm_sub)$pub_cls <- cl_class_ids( nomarkerCellsClustering[["consensus"]]) leftover_genes <- rowData(mtec_norm_sub)$pub_cls == 12  Bernd Klaus committed Dec 06, 2017 1037 1038 1039 1040 data_for_cl <- assay(mtec_norm_sub[!leftover_genes,]) colnames(data_for_cl) <- colData(mtec_norm_sub)\$cellID   Bernd Klaus committed Dec 04, 2017 1041   Bernd Klaus committed Dec 06, 2017 1042 1043 1044 1045 1046 1047 r Biocpkg("ClusterExperiment")  will cluster the cells by default, while we want to cluster the genes. Thus, we need to feed the transposed data to the clusterSingle command. We run pam, a variant of k--means clustering with a cluster number of 11 and follow a subsampling strategy: the algorithm is run on subsamples of genes (70% by default) and then the genes which co--occur often within a cluster  Bernd Klaus committed Dec 11, 2017 1048 1049 1050 1051 across sub--sampling steps ar