Commit 249a86bd authored by Bernd Klaus's avatar Bernd Klaus

finished var transform section, now on to dim reduction and clustering

parent efda91cc
......@@ -304,7 +304,13 @@ counts_norm_crc_mat <- select(counts_crc_tidy, sample_id, ensembl_id, count_norm
as.matrix()
meanSdPlot(counts_norm_crc_mat)$gg +
ylim(c(0, 10))
ylim(c(0, 2000))
## ----varStabCountData----------------------------------------------------
counts_norm_trans_crc <- varianceStabilizingTransformation(counts_crc)
meanSdPlot(counts_norm_trans_crc)$gg +
ylim(c(0, 5))
## ----session_info, cache = FALSE-----------------------------------------
sessionInfo()
......
......@@ -613,7 +613,7 @@ counts_crc_tidy <- left_join(counts_crc_tidy,
We can see tha the size factors computed are the scaled medians
on the raw count scale, however they are not identical.
margin^[Note that it is nor recommended to scale sequencing libraries by
margin^[Note that it is not recommended to scale sequencing libraries by
the total library size: it is not robust and artificially limits the range
of the data to 0--1]. We can now normalize the data:
......@@ -624,8 +624,7 @@ counts_crc_tidy <- mutate(counts_crc_tidy, count_norm = count / sf )
## Exercise: Data normalization
Normalize the data using the computed size factors and create boxplots
of the normalized data
Create boxplots of the normalized data and compare it to the raw one.
```{r data_norm_plot}
......@@ -680,7 +679,7 @@ The variance of count data depends on the its mean, i.e. the higher
the counts, the higher their variance. This is readily seen in a
plot of variance against the mean. We use the function `meanSdPlot`
from the `r Biocpkg("vsn") ` package to create such a plot for the
normalized RNA--Seq data.
normalized bulk RNA--Seq data.
This function accepts an expression matrix as an input, so we need
to turn our tidy data table into an expression matrix first. The
......@@ -733,7 +732,7 @@ A simple variance stabilization method for count data is a log + a pseudocount.
Note that for negative binomially distributed data:
\[
\text{Var(log(counts + pseudocount))} \approx = \frac{1}{counts + pseudocount} + \alpha
\text{Var(log(counts + pseudocount))} \approx \frac{1}{counts + pseudocount} + \alpha
\]
where $\alpha$ is the dispersion. The formula shows that for genes with
......@@ -747,14 +746,291 @@ In general, are carefull modelling of the variance mean realtionship is
preferable to ad--hoc solutions.
The package `r Biocpkg("DESeq2")` provides variance stabilizations that
are tailored to RNA--Seq data. We use its function
are based on refined approximations using a fitted dispersion--mean
relationship. We use its function
`varianceStabilizingTransformation` to obtain normalize and variance--stabilized
count data.
count data. margin^[Note the size factors for the bulk RNA--Seq data set differ
widely. For this scenarion,`r Biocpkg("DESeq2")` recommends a different
transform called `rlog`.]
```{r varStabCountData}
counts_norm_trans_crc <- varianceStabilizingTransformation(counts_crc)
meanSdPlot(counts_norm_trans_crc)$gg +
ylim(c(0, 5))
```
We can see from the plot that the transformation has worked, the data
is now normalized and the variance--mean dependency has be aleviated.
<!-- \subsection{PCA for bulk RNA--Seq data} -->
<!-- A PCA represents the individual samples in our RNA--Seq data -->
<!-- data set by two or more ``artificial" variables, -->
<!-- which are called Principal Components (PCs). The components -->
<!-- are linear combinations of the genes and the linear -->
<!-- combination is chosen in such a way that the euclidean distance between the -->
<!-- individuals using centered and standardized variables is minimized. -->
<!-- If $X = (x_1, \dotsc, x_p)$ denotes the original data matrix -->
<!-- (i.e. our rlog--transformed data), then the columns of -->
<!-- the transformed data matrix, also called the ``principal components" -->
<!-- or "(PC) scores'' is simply given by: -->
<!-- $$ -->
<!-- \text{PC}_i= Xa_i, \quad i = 1, \dotsc, p -->
<!-- $$ -->
<!-- Where $a_i$ is a vector of length $p$. The $a_i$ vectors are commonly -->
<!-- called ``loadings" or ``principal axes". These loadings are usually -->
<!-- the eigenvectors of the correlation matrix of the data. -->
<!-- It can be shown that choosing the loadings in this way leads to a -->
<!-- transformation that generates principal components that have maximal -->
<!-- variance and preserve as good as possible the distances between the original -->
<!-- observations. -->
<!-- If we have many variables (genes) as in our case ($p = 1000$), it is advisable -->
<!-- to perform a selection of variables since this leads to more stable PCA results. -->
<!-- Here, we simply use the 500 most variable genes. -->
<!-- << PCA_simData, dependson="meanSdrlog" >>= -->
<!-- ntop = 500 -->
<!-- pvars <- rowVars(rldSim) -->
<!-- select <- order(pvars, decreasing = TRUE)[seq_len(min(ntop, -->
<!-- length(pvars)))] -->
<!-- PCA <- prcomp(t(rldSim)[, select], scale = F) -->
<!-- percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) -->
<!-- dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], -->
<!-- PC3 = PCA$x[,3], PC4 = PCA$x[,4], -->
<!-- sex = colData(dds)$sex, -->
<!-- condition = colData(dds)$condition) -->
<!-- (qplot(PC1, PC2, data = dataGG, color = condition, shape = sex, -->
<!-- main = "PC1 vs PC2, top variable genes", size = I(6)) -->
<!-- + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), -->
<!-- y = paste0("PC2, VarExp:", round(percentVar[2],4))) -->
<!-- + scale_colour_brewer(type="qual", palette=2) -->
<!-- ) -->
<!-- @ -->
<!-- We can see that the first PC separates the two experimental conditions, -->
<!-- while the second one splits the samples by sex. This shows that the PCA can -->
<!-- be used to visualize and detect batch effects. Note that the principal -->
<!-- components are commonly ordered according to ``informativeness'', i.e. -->
<!-- the percentage of the total variance in the data set they explain. So -->
<!-- a separation according to the first PC is stronger than a separation -->
<!-- according to the second one. In the simulated data, the experimental -->
<!-- group is more important the sample batch (= sex). -->
<!-- \subsection{Heatmaps and hierarchical clustering} -->
<!-- Another very common visualization technique is a heatmap. Analogous to a PCA -->
<!-- we can visualize the pairwise distances between the samples using a heatmap, -->
<!-- i.e. a false color representation of the distance between two samples. Here -->
<!-- we use the euclidean distance of all the genes per sample. This can be conveniently -->
<!-- computed using the function \Rfunction{dist}. -->
<!-- This heatmap is then ordered via hierarchical clustering. Hierarchical clustering -->
<!-- starts with as many clusters as there are samples and successively merges samples -->
<!-- that are close to each other. This merging process is commonly visualized as -->
<!-- a tree like graphic called a dendogramm. -->
<!-- <<SimData_heatmap_and_clustering, fig.height = 5, dependson="meanSdrlog">>= -->
<!-- dists <- as.matrix(dist(t(rldSim))) -->
<!-- rownames(dists) <- paste0(colData(dds)$sex, -->
<!-- "_", str_sub(rownames(colData(dds)), 7)) -->
<!-- colnames(dists) <- paste0(colData(dds)$condition, -->
<!-- "_", str_sub(rownames(colData(dds)), 7)) -->
<!-- hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) -->
<!-- heatmap.2(dists, trace="none", col = rev(hmcol)) -->
<!-- @ -->
<!-- Again we see that samples cluster mainly by experimental condition, but also -->
<!-- by batch. -->
<!-- \section{Multidimensional scaling (MDS)} -->
<!-- Multidimensional scaling is a technique alternative to PCA. -->
<!-- MDS takes a set of dissimilarities -->
<!-- and returns a set of points such that the distances between -->
<!-- the points are approximately equal to the dissimilarities. -->
<!-- We can think of ``squeezing'' a high-dimensional point cloud into -->
<!-- a small number of dimensions (2, perhaps 3) while -->
<!-- preserving as well as possible the inter--point distances. -->
<!-- Thus, it can be seen as a generalization of PCA in some sense, -->
<!-- since it allows to represent any distance, not only -->
<!-- the euclidean one. -->
<!-- \subsection{Classical MDS for single cell RNA--Seq data} -->
<!-- Here we get the cell--cycle corrected single cell RNA--seq data from -->
<!-- \href{http://dx.doi.org/10.1038/nbt.3102}{Buettner et. al.}. The authors -->
<!-- use a dataset studying the differentiation of T-cells. The authors found -->
<!-- that the cell cycle had a profound impact on the gene expression for this -->
<!-- data set and developed a factor analysis--type method (a regression--type model -->
<!-- where the coefficients are random and follow a distribution) -->
<!-- to remove the cell cycle effects. The data are given as log--transformed -->
<!-- counts. -->
<!-- We first import the data as found in the supplement of the article -->
<!-- and the compute the euclidean distances between the single cells. -->
<!-- Then, a classic MDS (with 2 dimensions in our case) -->
<!-- can be produced from this distance matrix. We compute the distance -->
<!-- matrix on the centered and scaled gene expression measurements per -->
<!-- cell as to make them comparable across cells. -->
<!-- This gives us a set of points that we can then plot. -->
<!-- <<r scalingSingleCell >>= -->
<!-- scData <- t(read.csv("http://www-huber.embl.de/users/klaus/nbt.3102-S7.csv", -->
<!-- row.names = 1)) -->
<!-- scData[1:10,1:10] -->
<!-- distEucSC <- dist(t(scale(scData))) -->
<!-- scalingEuSC <- as.data.frame(cmdscale(distEucSC, k = 2)) -->
<!-- names(scalingEuSC) <- c("MDS_Dimension_1", "MDS_Dimension_2") -->
<!-- head(scalingEuSC) -->
<!-- @ -->
<!-- In the non--linear PCA method used in the original paper -->
<!-- a clear two groups separation appears with regard to gata3, -->
<!-- a factor that is important in T\textunderscore H2 differentiation. We color the -->
<!-- cells in the MDS plot according to gata3 expression and observe -->
<!-- that the first MDS dimension separates two cell populations with -->
<!-- high and low gata3 expression. This is consistent with the results -->
<!-- reported in the original publication, although a different method -->
<!-- , namely classic MDS, to produce a low--dimensional representation -->
<!-- of the data was used. -->
<!-- << plotScalingSingleCell, dependson="scalingSingleCell" >>= -->
<!-- rownames(scData)[727] -->
<!-- gata3 <- cut(scale(scData[727,]), 5, -->
<!-- labels = c("very low", "low", "medium", "high", 'very high')) -->
<!-- # reverse label ordering -->
<!-- gata3 <- factor(gata3, rev(levels(gata3))) -->
<!-- scPlot <- qplot(MDS_Dimension_1, MDS_Dimension_2, data = scalingEuSC, -->
<!-- main = "Metric MDS", -->
<!-- color = gata3, size = I(3)) + scale_color_brewer(palette = "RdBu") -->
<!-- scPlot -->
<!-- # + scale_color_gradient2(mid = "#ffffb3") -->
<!-- # + scale_colour_manual(values = rev(brewer.pal(5, "RdBu")))) -->
<!-- gata3Colours <- invisible(print(scPlot))$data[[1]]$colour -->
<!-- @ -->
<!-- \subsection{Assessing the quality of the scaling and non--metric MDS} -->
<!-- A measure of how good the scaling is given by a comparison of the original -->
<!-- distances to the approximated distances. This comparison is commonly done using a -->
<!-- ``stress--function" that summarizes the difference between the low dimensional distances and -->
<!-- the original ones. Stress functions can also be used to compute a MDS. This is for example done in -->
<!-- so called non--metric MDS, where for example the stress function is defined in such a way that -->
<!-- small distances (Sammon scaling) or a general monotonic function of the observed distances -->
<!-- is used as a target (Kruskal's non--metric scaling). -->
<!-- Here we check the classical stress functions, which sums up the squared differences -->
<!-- and scales them to the squared sum of the original ones. -->
<!-- Additionally, we produce a ``Shepard'' plot, which plots the original distances against -->
<!-- the ones inferred from using the scaling solution. This allows to assess how accurately -->
<!-- the scaling represents the actual distances. -->
<!-- << checkScaling, dependson="scalingSingleCell" >>= -->
<!-- distMDS <- dist(scalingEuSC) -->
<!-- ## get stress -->
<!-- sum((distEucSC - distMDS)^2)/sum(distEucSC^2) -->
<!-- ord <- order(as.vector(distEucSC)) -->
<!-- dataGG <- data.frame(dOrg = as.vector(distEucSC)[ord], -->
<!-- dMDS = as.vector(distMDS)[ord]) -->
<!-- (qplot(dOrg, dMDS, data=dataGG, -->
<!-- main = "Shepard plot: original vs MDS distances") + geom_smooth()) -->
<!-- @ -->
<!-- We see that the the scaling is not extremely good. -->
<!-- \subsection{Kruskal's Scaling} -->
<!-- Kruskal's scaling method is implemented in the function \Rfunction{isoMDS} -->
<!-- from the \CRANpkg{MASS} package. We used the metric MDS solution as -->
<!-- a starting solution. -->
<!-- << Kruskal, dependson="scalingSingleCell" >>= -->
<!-- kruskalMDS <- isoMDS(distEucSC, y = as.matrix(scalingEuSC)) -->
<!-- #tressplot(kruskalMDS, distEucSC) -->
<!-- kruskalMDS$stress -->
<!-- ord <- order(as.vector(distEucSC)) -->
<!-- dataGGK <- data.frame(dOrg = as.vector(distEucSC)[ord], -->
<!-- dMDS = as.vector(dist(scores(kruskalMDS)))[ord]) -->
<!-- (qplot(dOrg, dMDS, data=dataGGK, -->
<!-- main = "Shepard plot for Kruskal: original vs MDS distances") + geom_smooth()) -->
<!-- @ -->
<!-- Now the Shepard plot looks much better, the scaling should reflect the original -->
<!-- distances in a better way. Let's repeat the plot of the single cells -->
<!-- << kruskalMDSPlot, dependson=c("Kruskal", "plotScalingSingleCell") >>= -->
<!-- scalingK <- as.data.frame(scores(kruskalMDS)) -->
<!-- names(scalingK) <- c("MDS_Dimension_1", "MDS_Dimension_2") -->
<!-- qplot(MDS_Dimension_1, MDS_Dimension_2, data = scalingK, -->
<!-- main = "Non--metric MDS", -->
<!-- color = gata3, size = I(3)) + scale_color_brewer(palette = "PuOr") -->
<!-- @ -->
<!-- Although the representation is better than the first, metric, MDS the overall -->
<!-- visual impression is similar. -->
<!-- Single cell data is commonly used to infer (developmental) hierarchies of single cells. -->
<!-- For a nice bioconductor package wrapping a lot of dimension reduction -->
<!-- techniques for single cell data, see \Biocpkg{sincell} and the -->
<!-- \href{http://dx.doi.org/10.1093/bioinformatics/btv368}{associated article}, -->
<!-- especially, supplementary table 1. -->
# move to new doc: Confounding factors, Statistical Testing, Machine Learning
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment