Commit 80c7cb31 authored by Bernd Klaus's avatar Bernd Klaus

finished zinbawave part, started to look a sc clustering

parent 4d3fb36e
......@@ -65,7 +65,6 @@ library("factoextra")
library("magrittr")
library("entropy")
library("forcats")
library("forcats")
library("readxl")
library("DESeq2")
library("broom")
......@@ -81,7 +80,10 @@ library("limma")
library("ggthemes")
library("corpcor")
library("broom")
library("sva")
library("zinbwave")
library("clusterExperiment")
library("clue")
theme_set(theme_gray(base_size = 18))
......@@ -887,7 +889,7 @@ summary(res_sva)
Unfortunately, although sva is able to capture the genetic backround, it
is too confounded with condition leaving us with zero detected genes.
## Latent factors for sc--RNA Seq: The ZINB-WaVE model
# Latent factors for sc--RNA Seq: The ZINB-WaVE model
@Brennecke_2015 used the method of @Buettner_2015 to regress out
the influence of the cell cycle from the gene expression data. Briefly,
......@@ -919,22 +921,148 @@ the experimental group a sample belongs to. This part is used to reflect
the experimental design and known batches. It includes a column of ones to
account for gene specific baseline expression levels.
3. The **dark blue component** are gene level covariates such as gc--bias corrections
Here we can also include our pre--computed cell specific size factors.
3. The **dark blue component** are gene level covariates such as gc--bias or
gene--length corrections.
4. The **red component** contains a factor model which models unknown cell--level
"factors" (random variables) that explain everything that is not captured by
known sample or gene level covariates.
The model also allows for sample--matrix sized offsets, which can be used to
include our pre--computed cell specific size factors.
The estimation of the model works by penalized maximum likelihood, where
sets of parameters are optimized at a time: After initialization, the
dispersion is optimized, then the cell specific componenents and then
the gene specific components.
Finally, it is made sure that the latent factors \(W\) itself as well as
their loading vectors \(\alpha\) are approximately independent of each other,
their loading vectors \(\alpha\) are independent of each other,
just as it is assumed in ordinary factor analysis.
## Running zinbwave
We will now run
`r Biocpkg("zinbwave")` on the single cell data, we select only those cells
that were not select based on a surface marker. We also include the
scran--based size factors for normalization as offsets (on the negative
log scale as the counts are modeled on the log scale and we need to
devide by the size factors in order to normalize the data). However,
we do let `r Biocpkg("zinbwave")` estimate intercept column parameters
as well (the default) and this way estimate both cell and gene--specific
normalization factors.
```{r runzinbawave}
no_marker_cells <- filter(mtec_cell_anno, SurfaceMarker == "None")
count_matrix_nomarker <- as.matrix(mtec_counts[, no_marker_cells$cellID])
rownames(count_matrix_nomarker) <- mtec_counts$ensembl_id
O_mu <- t(matrix(rep(-log(no_marker_cells$scran_sf),
nrow(count_matrix_nomarker)),
nrow = nrow(count_matrix_nomarker),
ncol = ncol(count_matrix_nomarker),
byrow = TRUE))
# zinb <- zinbFit(count_matrix_nomarker,
# O_mu = O_mu,
# K=1, epsilon=1000)
# save(zinb,file = "zinb.RData")
```
## Obtaining normalized and corrected data
We now obtain the normalized and corrected data. Note that in contrast
to @Brennecke_2015 we do not estimate the latent factor based on genes
annotated to cell cycle but simply let `r Biocpkg("zinbwave")` estimate
one latent factor. We then extract the residuals from the model, i.e.
the green blue and red parts substracted from the original data and
store them in a `SingleCellExperiment` object to which we can attach
the cell annotation as `colData`.
```{r zinbanormcorrected}
load("zinb.RData")
mtec_norm <- zinbwave(SummarizedExperiment(count_matrix_nomarker),
fitted_model=zinb,
normalizedValues=TRUE,
residuals = TRUE)
# keep only the residuals in the summarized experiment
assay(mtec_norm, 1) <- NULL
assay(mtec_norm, 1) <- NULL
colData(mtec_norm) <- DataFrame(no_marker_cells)
rowData(mtec_norm) <- DataFrame(filter(mtec_gene_anno,
ensembl_id %in% rownames(mtec_norm)))
mtec_norm
```
## Clustering of the normalized and cleaned sc--data
We can now move on to the clustering of the single--cell data and compare
our clusters to the ones from @Brennecke_2015. We will perform the clustering
using the `r Biocpkg("clusterExperiment") ` package, which implements
a comprehensive resampling based clustering strategy called
RSEC (Resampling-based Sequential Ensemble Clustering). It first creates
many clusterings using subsampling and different parameter values for
the clustering algorithms.
It then creates a consensus clustering from those, and finally merges
clusters in this consensus clustering,
which show a low proportion of DE--genes between them.
This avoid errenous choices of the total number of clusters.
^A detailled description of the individual steps can be found in the [package vignette](http://bioconductor.org/packages/release/bioc/vignettes/clusterExperiment/inst/doc/clusterExperimentTutorial.html#workflow)
We first import the clustering and subset our data to include only
those genes that were used for clustering in the original publication.
We then add the published clustering to the `rowData` so that we can
compare it to our own clustering. RSEC is run using `pam` a variant of
k--means clustering.
sad
```{r clusteringprep}
load(file.path(data_dir, "genes_for_clustering.RData"))
load(file.path(data_dir,"nomarkerCellsClustering.RData"))
mtec_norm_sub <- mtec_norm[genes_for_clustering, ]
rowData(mtec_norm_sub)$pub_cls <- cl_class_ids(
nomarkerCellsClustering[["consensus"]])
leftover_genes <- rowData(mtec_norm_sub)$pub_cls == 12
data_for_rsec <- assay(mtec_norm_sub[!leftover_genes,])
colnames(data_for_rsec) <- colData(mtec_norm_sub)$cellID
cl_mtec <- clusterSingle(t(data_for_rsec),
subsample = TRUE,
isCount = FALSE,
clusterLabel = "mtec_sub",
dimReduce = "PCA",
ndims = 10,
mainClusterArgs = list(minSize = 10,
clusterFunction = "pam",
clusterArgs=list(k=11)),
subsampleArgs = list(clusterFunction = "pam",
clusterArgs=list(k=11),
largeDataset = TRUE,
random.seed = 1234,
ncores = 1,
resamp.num = 10))
#cl_mtec <- cluster::pam(data_for_rsec,k=11,cluster.only=TRUE)
#save(cl_mtec, file = "cl_mtec.RData")
```
```{r runscclustering}
```
<!-- Single cell data is commonly used to infer (developmental) hierarchies of single cells. -->
<!-- For a nice bioconductor package wrapping a lot of dimension reduction -->
<!-- techniques for single cell data, see `r Biocpkg("sincell") ` and the -->
......@@ -946,7 +1074,6 @@ just as it is assumed in ordinary factor analysis.
```{r session_info, cache = FALSE}
sessionInfo()
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment