Commit 4d15acfd authored by Bernd Klaus's avatar Bernd Klaus

finished theoretical tSNE section, cleaned up MDS section

parent 4615adb1
......@@ -1008,11 +1008,11 @@ the Euclidean distance.
# Dimensionality reduction 2: Non--metric Kruskal scaling for single cell RNA--Seq data
# Dimensionality reduction 2: Kruskal scaling for single cell RNA--Seq data
Here, we use the cell--cycle corrected single cell RNA--seq data from
[Buettner et. al.](http://dx.doi.org/10.1038/nbt.3102). The authors
use a dataset studying the differentiation of T--cells. The authors found
use a dataset studying the differentiation of T--cells. They found
that the cell cycle had a profound impact on the gene expression for this
data set and developed a factor analysis--type method (a regression--type model
where the coefficients are random and follow a distribution)
......@@ -1037,13 +1037,15 @@ In other words, given distances \(d_{i,j}\) between two cells, we want to find
margin^[There are many variants of MDS, For a review see Buja et. al., 2007]
Where theta is a monotone transformation of the input distances. This
Where theta is a monotone transformation of the input distances. Allowing
us to put represent the "typical range" of distance more faithfully. This
goodness of fit can then be measured by the cost function known as __stress__.
\[
\text{stress}(\hat d_{i,j}}) = \sqrt{\sum_{i \neq j}
\frac{ (\theta( d_{i,j}) - \hat d_{i,j})^2}{\hat d_{i,j}^2}}
\]
\text{STRESS}_{Kruskal} = \sqrt{
\frac{ \sum_{i \neq j} [\theta( d_{i,j}) - \hat d_{i,j}]^2}{ \sum_{i \neq j} \hat d_{i,j}^2}}
\] margin^[The stress denominator is only used for standardization and makes
sure that the stress is between 0 and 1.]
Where \(\hat d_{i,j}\) are the distances fitted by the non--metric distance scaling
algorithm. Our procedure is "non--metric" as we approximate
......@@ -1205,7 +1207,7 @@ ggplot(data_dist, aes(x = org_distance, y = mds_distance)) +
### Exercise: Metric--Sammon scaling
### Exercise: Sammon scaling
[Sammon (1969)](https://en.wikipedia.org/wiki/Sammon_mapping)
developed an alternaive __metric__ (distances fit directly)
......@@ -1213,12 +1215,16 @@ scaling algorithm which optimizes a different
stress function:
\[
E = \frac{1}{\sum\limits_{i<j} d_{ij}}\sum_{i<j}
\frac{( d_{ij} - \hat d_{ij})^2}{\hat d_{ij}}
\text{STRESS}_\text{Sammon}
= \frac{1}{\sum\limits_{i \neq j} d_{ij}}\sum_{i \neq j}
\frac{( d_{ij} - \hat d_{ij})^2}{ d_{ij}}
\]
which puts a higher weight on small distances, i.e. encourages local
structure to be conserved.
Sammon scaling is implemented in the function `sammon` from the `MASS` package.
which optimizes __weighted__ differences, where the weight
is the distance to be fitted. This pits
a higher weight on small distances. It therefore encourages local
structure to be faithfully represented, which is typically what
we want. Sammon scaling is implemented in the
function `sammon` from the `MASS` package.
Perfom Sammon scaling on the T cell data. What does the
the Shepard plot look like?
......@@ -1261,15 +1267,16 @@ ggplot(data_dist_sam, aes(x = org_distance, y = mds_distance)) +
# t-SNE for single cell data
# t--SNE (t--Stochastic Neighbor Embedding)
t--SNE (t--Stochastic Neighbor Embedding)
[van der Maaten and Hinton, 2008](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)
is a visualization technique similar in principle to MDS
as its starting point are pairwise similarities between data points.
t--SNE starts by putting the pairwise Eucledian distances into
distributions and from this then computes for each pair \(i\) and \(j\) of
t--SNE starts by putting the pairwise Eucledian distances between the samples
into distributions and from this then computes for each pair \(i\) and \(j\) of
samples the the probability that they are neighbours.
These probabilities in the original, high--dimensional space margin^[based
......@@ -1277,34 +1284,103 @@ on a Gaussian distribution] are then fitted to prababilities
margin^[based on a t--distribution]
in a lower (e.g. two dimensional) space. t--SNE will
then choose the representation in the lower dimensional space in
such a way that the two probabilites match closely.
The cost function that is optimized here is a Kullback--Leibler--Divergence
between the two probability distributions, which penalizes pushing points that are
close by apart but encourages points that are far apart to be pushed towards
such a way that the two probabilites match closely.
t--SNE has a tuning parameter called perplexity, which determines the
variance of the Gaussian distribution used in the original, high dimensional
space. This allows us to assign zero neighbourhood-probability to large
distances in the original data. Essentially treating them as unimportant.
margin^[This is conceptually similar to Sammon scaling, which also upweights
small distances. However, Sammon scaling does not downweigh large
distances.]
The cost function margin^[Kullback--Leibler--Divergence]
used to determine the goodness of fit between the two probabilities
contributes to faithfully representing local structure by discouraging
pushing points that are close together away from each other.
On the other hand, it encourages points that are far apart to be pushed towards
each other. margin^[Note that this is different from the "stress" function
used in classic non--metric MDS, which treats all distances equally but similar
to Sammon scaling].
used in Kruskal MDS, which treats all distances equally].
A t--distribution is chosen in the low dimensional space to alleviate
the "crowding problem": if there are many samples at a moderate distance from
a sample, they are likely to be pushed towards this sample,
The t--distribution is chosen in the low dimensional space is helpful
to alleviate the "crowding problem": if there are
many samples at a moderate distance from
a specific sample, they are likely to be pushed towards this sample,
destroying any local structure. margin^[They have a minimal influence on
the cost function individually, but there are many of them]
the cost function individually, but there are many of them].
In t-SNE disimilar pairs of samples with a high distance in the low
dimensional representation wont't have much influence on the
cost function optimization. This avoids "crowding" and has the
potential to reveal a more fine--grained clustering.
## t-SNE for real data: choice of tuning parameters
t-SNE looks like a very promising extension of MDS, and is very
popular in single cell RNA--Seq analysis
([visNE](http://dx.doi.org/10.1038/nbt.2594)), however it is
very hard to choose the various tuning parameters.
### Choosing a perplexity value is very tricky
The perplexitiy value, the perplexity
value can be interpreted as the effective number of neighbours of
a sample (e.g. a cell in single cell RNA-Seq)
__choosing it greater than the total number of samples will lead
to strange results__.
Unfortunately, this is not the end of story: __the perplexity
value is best chosen close to the number of samples a "typical"
cluster in the data contains__. However, this is of course unknown
beforehand. Additionally, if the data contain clusters of different
sizes, the perplexity will always be too low for some clusters and to
high for other clusters.
If the perplexitiy value is too low, there will often be a "clumps"
of data points, as t-SNE heavily exaggerates small distances leading
to "clusters" even in random data.
### Additional tuning parameters
The t-SNE map is initialized randomly and then optimized iteratively.
The optimization has many tuning parameters such as the number of
interations, a learning rate which determines how much the map
changes from one interation to the next and a "momentum" parameter
which determines how much information from previous iterations
is used.
However, it seems that choosing a number of iterations that
is high enough (approx 5000) will typically make t-SNE converge.
### t-SNE tuning take home message: good luck!
We can summarize the observations above as follows:
* Run at least 5000 iterations
* t-SNE exaggerates small distances and often shrinks large
ones so cluster areas and the distance between them might not mean anything
* Don't choose a perplexity greater than the number of samples
* Too small perplexity values lead to a "clumping" of points and wrong
clusters in random data
* Too high perplexity values lead to a distorted representation of the
differences between clusters
* Problem: There is no princpled way to choose the perplexity, as an
optimal value would be different for each unknown cluster in the data
## t--sne link
which then represents the embedding of the data.
More details about this can be found at: <https://distill.pub/2016/misread-tsne/>.
popular for single cell data [visNE](http://dx.doi.org/10.1038/nbt.2594),
but challanging to choose the parameters <https://distill.pub/2016/misread-tsne/>
In summary, while t-SNE has been used to reveal structure, especially the
perplexity parameter is hard to set. Any results of a t-SNE analysis should
thus be verified by an independent analysis (i.e. MDS or PCA).
## Infering cell hierarchies
<!-- ## Infering cell hierarchies -->
Single cell data is commonly used to infer (developmental) hierarchies of single cells.
For a nice bioconductor package wrapping a lot of dimension reduction
techniques for single cell data, see `r Biocpkg("sincell") ` and the
[associated article](http://dx.doi.org/10.1093/bioinformatics/btv368).
<!-- Single cell data is commonly used to infer (developmental) hierarchies of single cells. -->
<!-- For a nice bioconductor package wrapping a lot of dimension reduction -->
<!-- techniques for single cell data, see `r Biocpkg("sincell") ` and the -->
<!-- [associated article](http://dx.doi.org/10.1093/bioinformatics/btv368). -->
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment