Commit a5082e43 authored by Bernd Klaus's avatar Bernd Klaus

modified tSNE and MDS parts again a litte bit, added more details on perplexity

parent 4d15acfd
......@@ -50,43 +50,43 @@ data_dir <- file.path("data/")
## ----import_gene_expression, echo=FALSE, eval=FALSE----------------------
## # Here we import the gene expression data using only the subset of highly
## # variable genes, resave it
## load(file.path(data_dir, "deGenesNone.RData"))
## load(file.path(data_dir, "mTECdxd.RData"))
##
## mtec_counts <- counts(dxd)[deGenesNone, ]
##
## mtec_counts <- as_tibble(rownames_to_column(as.data.frame(mtec_counts),
## var = "ensembl_id"))
##
## mtec_cell_anno <- as_tibble(as.data.frame(colData(dxd))) %>%
## modify_if(.p = is.factor, as.character)
##
## data("biotypes")
## data("geneNames")
##
## mtec_gene_anno <- tibble(biotype) %>%
## add_column(ensembl_id = names(biotype),
## gene_name = geneNames,
## .before = "biotype")
##
## mtec_cell_anno <- colData(dxd)
##
## save(mtec_counts, file = file.path(data_dir, "mtec_counts.RData"),
## compress = "xz")
##
## save(mtec_cell_anno, file = file.path(data_dir, "mtec_cell_anno.RData"),
## compress = "xz")
##
##
## save(mtec_gene_anno, file = file.path(data_dir, "mtec_gene_anno.RData"),
## compress = "xz")
##
## tras <- as_tibble(tras)
##
## save(tras, file = file.path(data_dir, "tras.RData"),
## compress = "xz")
# Here we import the gene expression data using only the subset of highly
# variable genes, resave it
load(file.path(data_dir, "deGenesNone.RData"))
load(file.path(data_dir, "mTECdxd.RData"))
mtec_counts <- counts(dxd)[deGenesNone, ]
mtec_counts <- as_tibble(rownames_to_column(as.data.frame(mtec_counts),
var = "ensembl_id"))
mtec_cell_anno <- as_tibble(as.data.frame(colData(dxd))) %>%
modify_if(.p = is.factor, as.character)
data("biotypes")
data("geneNames")
mtec_gene_anno <- tibble(biotype) %>%
add_column(ensembl_id = names(biotype),
gene_name = geneNames,
.before = "biotype")
mtec_cell_anno <- colData(dxd)
save(mtec_counts, file = file.path(data_dir, "mtec_counts.RData"),
compress = "xz")
save(mtec_cell_anno, file = file.path(data_dir, "mtec_cell_anno.RData"),
compress = "xz")
save(mtec_gene_anno, file = file.path(data_dir, "mtec_gene_anno.RData"),
compress = "xz")
tras <- as_tibble(tras)
save(tras, file = file.path(data_dir, "tras.RData"),
compress = "xz")
## ----import_data---------------------------------------------------------
load(file.path(data_dir, "mtec_counts.RData"))
......
......@@ -1028,16 +1028,18 @@ these cell--to--cell distances in the high--dimensional gene--space in a lower
dimensional, ordinary Euclidean space where the proximity of two cells
reflects the similarity of their gene expression values.
In other words, given distances \(d_{i,j}\) between two cells, we want to find
(for example) two dimensional vectors \(x\) and \(y\) such that:
In other words, given distances \(d_{i,j}\) between two cells computed from
the original data, we want to find two (or higher)
dimensional vectors \(y \) such that:
\[
\sqrt{(x_i-x_j)^2+(y_i-y_j)^2} \approx \theta( d_{i,j})
\|y_i-y_j \|_E \approx \theta( d_{i,j})
\]
margin^[There are many variants of MDS, For a review see Buja et. al., 2007]
Where theta is a monotone transformation of the input distances. Allowing
Where \(\| . \|_E \) is the Eucledian distance and
theta is a monotone transformation of the input distances. Allowing
us to put represent the "typical range" of distance more faithfully. This
goodness of fit can then be measured by the cost function known as __stress__.
......@@ -1271,25 +1273,27 @@ ggplot(data_dist_sam, aes(x = org_distance, y = mds_distance)) +
t--SNE (t--Stochastic Neighbor Embedding)
[van der Maaten and Hinton, 2008](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)
is a visualization technique similar in principle to MDS
[van der Maaten and Hinton, 2008](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf),
[formulas on Wikipedia](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)
is a visualization technique similar in spirit to MDS
as its starting point are pairwise similarities between data points.
t--SNE starts by putting the pairwise Eucledian distances between the samples
into distributions and from this then computes for each pair \(i\) and \(j\) of
samples the the probability that they are neighbours.
t--SNE starts by putting the pairwise Eucledian distances between the samples \(x_i\)
into Gaussian kernels margin^[\(exp(-\| x_i - x_j \| / 2\sigma_i)\)]
and from this then computes for each pair \(i\) and \(j\) of
samples the probability \(p_ij\) that they would pick each other as neighbours.
These probabilities in the original, high--dimensional space margin^[based
on a Gaussian distribution] are then fitted to prababilities
margin^[based on a t--distribution]
on a Gaussian kernel] are then fitted to prababilities
margin^[based on a t--distribution with one degree of freedom = Cauchy distribution]
in a lower (e.g. two dimensional) space. t--SNE will
then choose the representation in the lower dimensional space in
choose the representation in the lower dimensional space in
such a way that the two probabilites match closely.
t--SNE has a tuning parameter called perplexity, which determines the
variance of the Gaussian distribution used in the original, high dimensional
space. This allows us to assign zero neighbourhood-probability to large
distances in the original data. Essentially treating them as unimportant.
distances in the original data, essentially treating them as unimportant.
margin^[This is conceptually similar to Sammon scaling, which also upweights
small distances. However, Sammon scaling does not downweigh large
distances.]
......@@ -1311,7 +1315,8 @@ destroying any local structure. margin^[They have a minimal influence on
the cost function individually, but there are many of them].
In t-SNE disimilar pairs of samples with a high distance in the low
dimensional representation wont't have much influence on the
cost function optimization. This avoids "crowding" and has the
cost function optimization margin^[for them, the gradient of the cost
function vanishes]. This avoids "crowding" and has the
potential to reveal a more fine--grained clustering.
......@@ -1322,11 +1327,30 @@ popular in single cell RNA--Seq analysis
([visNE](http://dx.doi.org/10.1038/nbt.2594)), however it is
very hard to choose the various tuning parameters.
### Choosing a perplexity value is very tricky
### Perplexed by perplexity
The perplexitiy value, the perplexity
value can be interpreted as the effective number of neighbours of
a sample (e.g. a cell in single cell RNA-Seq)
What does the perplexity value actually mean? If a sample
has a number of \(k\) neighbours that all have the exact same
probability of being a neighbour margin^[a uniform distribution],
the perplexity is \(k\). Thus,[van der Maaten and Hinton, 2008]()
interpret the perplexity as "__the effective number of neighbours
of a sample__".
In t--SNE, due to the Gaussian kernels the neighbouring samples
have differing probabilities of being a neighbour.
The perplexity depends on the bandwith \(\sigma_i\) of
the kernel. Both neighbouring samples
that are very close (\(p_{ij} \approx 1 \) and neighbours that are
very far away from a sample (\(p_{ij} \approx 0 \) don't increase
the perplexity of the sample very much. Thus, the the perplexity determines
how many "nearby" points are considered as neighbours, leading to smaller values
of \(\sigma_i\) in dense regions.
### Choosing the perplexity value is very tricky
As the perplexitiy value can be interpreted as the effective number of
neighbours (e.g. other cells in single cell RNA-Seq)
__choosing it greater than the total number of samples will lead
to strange results__.
......@@ -1341,6 +1365,10 @@ If the perplexitiy value is too low, there will often be a "clumps"
of data points, as t-SNE heavily exaggerates small distances leading
to "clusters" even in random data.
On the other hand, if the value is too high, samples that are far
apart will be considered as neighbours, leading to a distorted view
of the underlying geometry.
### Additional tuning parameters
The t-SNE map is initialized randomly and then optimized iteratively.
......@@ -1359,7 +1387,7 @@ We can summarize the observations above as follows:
* Run at least 5000 iterations
* t-SNE exaggerates small distances and often shrinks large
ones so cluster areas and the distance between them might not mean anything
ones, so cluster areas and the distance between them might not mean anything
* Don't choose a perplexity greater than the number of samples
* Too small perplexity values lead to a "clumping" of points and wrong
clusters in random data
......@@ -1369,7 +1397,7 @@ differences between clusters
optimal value would be different for each unknown cluster in the data
More details about this can be found at: <https://distill.pub/2016/misread-tsne/>.
More details about this can be found at this webpage: <https://distill.pub/2016/misread-tsne/>.
In summary, while t-SNE has been used to reveal structure, especially the
perplexity parameter is hard to set. Any results of a t-SNE analysis should
......
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment