Commit 99e13b18 authored by Bernd Klaus's avatar Bernd Klaus

started t-sne section, modified scaling section to focus on Kruskal's scaling

parent fbf37dea
......@@ -42,7 +42,7 @@ library("readxl")
library("scran")
library("BiocStyle")
library("knitr")
library("tidyverse")
library("MASS")
library("RColorBrewer")
library("stringr")
library("pheatmap")
......@@ -63,7 +63,8 @@ library("psych")
library("vsn")
library("matrixStats")
library("pheatmap")
library("MASS")
library("tidyverse")
theme_set(theme_gray(base_size = 18))
......@@ -1006,9 +1007,9 @@ the euclidean distance.
# Dimensionality reduction 2: Classical MDS for single cell RNA--Seq data
# Dimensionality reduction 2: Non--Metric MDS for single cell RNA--Seq data
To demonstrate MDS we use the cell--cycle corrected single cell RNA--seq data from
Here, we use the cell--cycle corrected single cell RNA--seq data from
[Buettner et. al.](http://dx.doi.org/10.1038/nbt.3102). The authors
use a dataset studying the differentiation of T--cells. The authors found
that the cell cycle had a profound impact on the gene expression for this
......@@ -1018,13 +1019,35 @@ to remove the cell cycle effects. This is similar to our cleaning method used
for the heatmap above. The data are given as normalized, log--transformed counts.
We first import the data as found in the supplement of the article
and the compute the euclidean distances between the single cells.
and the compute a spearman correlation based distance between the single cells.
Then, a classic MDS (with 2 dimensions in our case)
can be produced from this distance matrix. Here we use the Manhattan distance
to compute the distance matrix.
The goal of a non--metric mutlidimensional scaling is to represent
these cell--to--cell distances in the high--dimensional gene--space in a lower
dimensional, ordinary Euclidean where the proximity of two cells
reflects the similarity of their gene expression values.
This gives us a set of points that we can then plot.
In other words, given distances \(d_{i,j}\) between two cells, we want to find
(for example) two dimensional vectors \(x\) and \(y\) such that:
\[
\sqrt{(x_i-x_j)^2+(y_i-y_j)^2} \approx \theta( d_{i,j})
\]
margin^[Metric MDS uses specific distances, such as the Eucledian (= Classic MDS)
distance, non--metric MDS can approximate any distance. That's why we use it
here]
Where theta is a monotone transformation of the input distances. This
goodness of fit can then be measured by the cost function known as __stress__
\[
\text{stress}(\hat d_{i,j}}) = \sqrt{\sum_{i \neq j}
\frac{ (\theta( d_{i,j}) - \hat d_{i,j})^2}{\hat d_{i,j}^2}}
\]
Where \(\hat d_{i,j}\) are the distances fitted by the non--metric scaling
algorithm. The stress function we use here goes back to [Kruskal, 1964](https://pdfs.semanticscholar.org/e041/08dc293c9cd7cabf32ee1524eaab0d4641b3.pdf).
Its optimization is implemented in the function `isoMDS` from the `r CRANpkg("MASS")`.
```{r scalingSingleCell}
......@@ -1036,15 +1059,18 @@ tcell_log_counts <- as.matrix(tcell_log_counts)
tcell_log_counts[1:5, 1:5]
dist_tcells <- get_dist(tcell_log_counts, method = "manhattan")
dist_tcells <- get_dist(tcell_log_counts, method = "spearman")
scaling_tcells <- as_tibble(cmdscale(dist_tcells, k = 2))
scaling_tcells <- as_tibble(isoMDS(dist_tcells, k = 2)$points)
colnames(scaling_tcells) <- c("MDS_dimension_1", "MDS_dimension_2")
scaling_tcells <- add_column(scaling_tcells, cell_id = labels(dist_tcells),
.before = "MDS_dimension_1")
```
The function prints the stress values as it optimizes the stress
interatively. As Kruskal's stress ranges from 0 to 1, as stress of
25 \% indicates a decent fit.
......@@ -1227,6 +1253,40 @@ from the `MASS` package.
# t-SNE for single cell data
t--SNE (t--Stochastic Neighbor Embedding)
[van der Maaten and Hinton, 2008](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)
is a visualization technique similar in principle to MDS
as its starting point are pairwise similarities between data points.
t--SNE starts by putting the pairwise Eucledian distances into
distributions and from this then computes for each pair \(i\) and \(j\) of
samples the the probability that they are neighbours.
These probabilities in the original, high--dimensional space margin^[based
on a Gaussian distribution] are then fitted to prababilities
margin^[based on a t--distribution]
in a lower (e.g. two dimensional) space. t--SNE will
then choose the representation in the lower dimensional space in
such a way that the two probabilites match closely.
The cost function that is optimized here is a Kullback--Leibler--Divergence
between the two probability distributions, which penalizes pushing points that are
close by apart but encourages points that are far apart to be pushed towards
each other. margin^[Note that this is different from the "stress" function
used in MDS, which treats all distances equally].
A t--distribution is chosen in the low dimensional space to alleviate
the "crowding problem": if there are many samples at a moderate distance from
a sample, they are likely to be pushed towards this sample,
destroying any local structure. margin^[They have a minimal influence on
the cost function individually, but there are many of them]
## t--
which then represents the embedding of the data.
popular for single cell data [visNE](http://dx.doi.org/10.1038/nbt.2594),
but challanging to choose the parameters <https://distill.pub/2016/misread-tsne/>
## Infering cell hierarchies
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment