Commit 4615adb1 authored by Bernd Klaus's avatar Bernd Klaus

cleaned up terminology confusion in MDS section, added ref to Buja 2007

parent 99e13b18
......@@ -14,7 +14,7 @@ library("readxl")
library("scran")
library("BiocStyle")
library("knitr")
library("tidyverse")
library("MASS")
library("RColorBrewer")
library("stringr")
library("pheatmap")
......@@ -35,7 +35,8 @@ library("psych")
library("vsn")
library("matrixStats")
library("pheatmap")
library("MASS")
library("tidyverse")
theme_set(theme_gray(base_size = 18))
......@@ -431,9 +432,9 @@ tcell_log_counts <- as.matrix(tcell_log_counts)
tcell_log_counts[1:5, 1:5]
dist_tcells <- get_dist(tcell_log_counts, method = "manhattan")
dist_tcells <- get_dist(tcell_log_counts, method = "spearman")
scaling_tcells <- as_tibble(cmdscale(dist_tcells, k = 2))
scaling_tcells <- as_tibble(isoMDS(dist_tcells, k = 2)$points)
colnames(scaling_tcells) <- c("MDS_dimension_1", "MDS_dimension_2")
scaling_tcells <- add_column(scaling_tcells, cell_id = labels(dist_tcells),
.before = "MDS_dimension_1")
......@@ -473,7 +474,7 @@ data_dist <- tibble(org_distance = as.vector(dist_tcells),
ggplot(data_dist, aes(x = org_distance, y = mds_distance)) +
geom_hex(binwidth = 100) +
geom_hex(binwidth = .05) +
geom_smooth(color = "grey10") +
scale_fill_distiller(palette = "YlOrRd", direction = 1) +
ggtitle(label = "Shepard plot",
......@@ -484,7 +485,7 @@ ggplot(data_dist, aes(x = org_distance, y = mds_distance)) +
## ----ex_sheppard_plot----------------------------------------------------
ggplot(data_dist, aes(x = org_distance, y = mds_distance)) +
stat_density2d(aes(fill = ..level..), geom = "polygon",
n = 100, h = c(200, 200)) +
n = 100, h = c(0.1, 0.1)) +
geom_smooth(color = "grey10") +
scale_fill_distiller(palette = "YlGn", direction = 1) +
ggtitle(label = "Shepard plot",
......@@ -492,6 +493,38 @@ ggplot(data_dist, aes(x = org_distance, y = mds_distance)) +
coord_equal()
## ----sammonScaling-------------------------------------------------------
scaling_tcells_sam <- as_tibble(sammon(dist_tcells)$points)
colnames(scaling_tcells_sam) <- c("MDS_dimension_1", "MDS_dimension_2")
scaling_tcells_sam <- add_column(scaling_tcells_sam, cell_id = labels(dist_tcells),
.before = "MDS_dimension_1")
mds_plot_tcells_sam <- ggplot(scaling_tcells_sam, aes(x = MDS_dimension_1,
MDS_dimension_2,
color = fct_rev(gata3_group))) +
geom_point(size = 3) +
ggtitle("Sammon Scaling of the T-cell single cell data") +
scale_color_brewer(palette = "RdBu", direction = 1) +
coord_equal()
mds_plot_tcells_sam
dist_tcells_mds_sam <- get_dist(select(scaling_tcells_sam,
MDS_dimension_1, MDS_dimension_2),
method = "euclidean")
data_dist_sam <- tibble(org_distance = as.vector(dist_tcells),
mds_distance = as.vector(dist_tcells_mds_sam))
ggplot(data_dist_sam, aes(x = org_distance, y = mds_distance)) +
geom_hex(binwidth = 0.1) +
geom_smooth(color = "grey10") +
scale_fill_distiller(palette = "RdPu", direction = 1) +
ggtitle(label = "Shepard plot Sammon",
subtitle = "Original vs MDS distances") +
coord_equal()
## ----session_info, cache = FALSE-----------------------------------------
sessionInfo()
......@@ -995,19 +995,20 @@ Multidimensional scaling is a dimensionality reduction
technique alternative to PCA.
MDS takes a matrix of dissimilarities
and returns a set of points such that the distances between
the points are approximately equal to the dissimilarities.
the points are approximately equal to the disimilarities
(or some function thereof).
We can think of "squeezing" a high--dimensional point cloud into
a small number of dimensions (2, perhaps 3) while
preserving as well as possible the inter--point distances.
Thus, it can be seen as a generalization of PCA in some sense,
since it allows to represent any dissimilarity, not only
the euclidean distance.
since it allows to represent any dissimilarity, not just
the Euclidean distance.
# Dimensionality reduction 2: Non--Metric MDS for single cell RNA--Seq data
# Dimensionality reduction 2: Non--metric Kruskal scaling for single cell RNA--Seq data
Here, we use the cell--cycle corrected single cell RNA--seq data from
[Buettner et. al.](http://dx.doi.org/10.1038/nbt.3102). The authors
......@@ -1020,10 +1021,11 @@ for the heatmap above. The data are given as normalized, log--transformed counts
We first import the data as found in the supplement of the article
and the compute a spearman correlation based distance between the single cells.
This distance is between 0 and 1.
The goal of a non--metric mutlidimensional scaling is to represent
The goal of a mutlidimensional scaling analysis is to represent
these cell--to--cell distances in the high--dimensional gene--space in a lower
dimensional, ordinary Euclidean where the proximity of two cells
dimensional, ordinary Euclidean space where the proximity of two cells
reflects the similarity of their gene expression values.
In other words, given distances \(d_{i,j}\) between two cells, we want to find
......@@ -1033,20 +1035,21 @@ In other words, given distances \(d_{i,j}\) between two cells, we want to find
\sqrt{(x_i-x_j)^2+(y_i-y_j)^2} \approx \theta( d_{i,j})
\]
margin^[Metric MDS uses specific distances, such as the Eucledian (= Classic MDS)
distance, non--metric MDS can approximate any distance. That's why we use it
here]
margin^[There are many variants of MDS, For a review see Buja et. al., 2007]
Where theta is a monotone transformation of the input distances. This
goodness of fit can then be measured by the cost function known as __stress__
goodness of fit can then be measured by the cost function known as __stress__.
\[
\text{stress}(\hat d_{i,j}}) = \sqrt{\sum_{i \neq j}
\frac{ (\theta( d_{i,j}) - \hat d_{i,j})^2}{\hat d_{i,j}^2}}
\]
Where \(\hat d_{i,j}\) are the distances fitted by the non--metric scaling
algorithm. The stress function we use here goes back to [Kruskal, 1964](https://pdfs.semanticscholar.org/e041/08dc293c9cd7cabf32ee1524eaab0d4641b3.pdf).
Where \(\hat d_{i,j}\) are the distances fitted by the non--metric distance scaling
algorithm. Our procedure is "non--metric" as we approximate
a monotone function of the distances margin^[Effectively using only their rank]
and not the distances itself.
The stress function we use here goes back to [Kruskal, 1964](https://pdfs.semanticscholar.org/e041/08dc293c9cd7cabf32ee1524eaab0d4641b3.pdf).
Its optimization is implemented in the function `isoMDS` from the `r CRANpkg("MASS")`.
......@@ -1109,7 +1112,7 @@ mds_plot_tcells
```
## Assessing the quality of the scaling and non--metric MDS
## Assessing the quality of the scaling
A measure of how good the scaling is given by a comparison of the original
distances to the approximated distances. This comparison is commonly done using a
......@@ -1144,7 +1147,7 @@ data_dist <- tibble(org_distance = as.vector(dist_tcells),
ggplot(data_dist, aes(x = org_distance, y = mds_distance)) +
geom_hex(binwidth = 100) +
geom_hex(binwidth = .05) +
geom_smooth(color = "grey10") +
scale_fill_distiller(palette = "YlOrRd", direction = 1) +
ggtitle(label = "Shepard plot",
......@@ -1176,7 +1179,7 @@ to the original distances.
<!-- ``` -->
## Exercise: shepard plot
## Exercise: Shepard plot
The code below creates the Shepard plot with a smoothing
density (`geom_density2d`). What happens
......@@ -1189,7 +1192,7 @@ to avoid overplotting?
```{r ex_sheppard_plot}
ggplot(data_dist, aes(x = org_distance, y = mds_distance)) +
stat_density2d(aes(fill = ..level..), geom = "polygon",
n = 100, h = c(200, 200)) +
n = 100, h = c(0.1, 0.1)) +
geom_smooth(color = "grey10") +
scale_fill_distiller(palette = "YlGn", direction = 1) +
ggtitle(label = "Shepard plot",
......@@ -1202,51 +1205,58 @@ ggplot(data_dist, aes(x = org_distance, y = mds_distance)) +
## Non--metric scaling
### Exercise: Metric--Sammon scaling
Instead of computing a lower dimensional representation of a distance matrix,
we could also try to optimize the fit on the Shepard plot, using the
metric MDS solution as a starting point.
Non--metric scaling is implemented in the function `isoMDS`
from the `MASS` package.
[Sammon (1969)](https://en.wikipedia.org/wiki/Sammon_mapping)
developed an alternaive __metric__ (distances fit directly)
scaling algorithm which optimizes a different
stress function:
<!-- ### Exercise -->
\[
E = \frac{1}{\sum\limits_{i<j} d_{ij}}\sum_{i<j}
\frac{( d_{ij} - \hat d_{ij})^2}{\hat d_{ij}}
\]
which puts a higher weight on small distances, i.e. encourages local
structure to be conserved.
Sammon scaling is implemented in the function `sammon` from the `MASS` package.
<!-- 1. Perfom non--metric scaling on the T cell data. Does the -->
<!-- the fit in the shepard plot improve? -->
Perfom Sammon scaling on the T cell data. What does the
the Shepard plot look like?
<!-- ```{r nonMetricMDS} -->
<!-- scaling_tcells_nm <- as_tibble(isoMDS(dist_tcells)$points) -->
<!-- colnames(scaling_tcells_nm) <- c("MDS_dimension_1", "MDS_dimension_2") -->
<!-- scaling_tcells_nm <- add_column(scaling_tcells_nm, cell_id = labels(dist_tcells), -->
<!-- .before = "MDS_dimension_1") -->
<!-- mds_plot_tcells_nm <- ggplot(scaling_tcells_nm, aes(x = MDS_dimension_1, -->
<!-- MDS_dimension_2, -->
<!-- color = fct_rev(gata3_group))) + -->
<!-- geom_point(size = 3) + -->
<!-- ggtitle("Non-metric MDS of the T-cell single cell data") + -->
<!-- scale_color_brewer(palette = "RdBu", direction = 1) + -->
<!-- coord_equal() -->
<!-- mds_plot_tcells_nm -->
<!-- dist_tcells_mds_nm <- get_dist(select(scaling_tcells_nm, -->
<!-- MDS_dimension_1, MDS_dimension_2), -->
<!-- method = "euclidean") -->
```{r sammonScaling}
scaling_tcells_sam <- as_tibble(sammon(dist_tcells)$points)
colnames(scaling_tcells_sam) <- c("MDS_dimension_1", "MDS_dimension_2")
scaling_tcells_sam <- add_column(scaling_tcells_sam, cell_id = labels(dist_tcells),
.before = "MDS_dimension_1")
<!-- data_dist_nm <- tibble(org_distance = as.vector(dist_tcells), -->
<!-- mds_distance = as.vector(dist_tcells_mds_nm)) -->
mds_plot_tcells_sam <- ggplot(scaling_tcells_sam, aes(x = MDS_dimension_1,
MDS_dimension_2,
color = fct_rev(gata3_group))) +
geom_point(size = 3) +
ggtitle("Sammon Scaling of the T-cell single cell data") +
scale_color_brewer(palette = "RdBu", direction = 1) +
coord_equal()
mds_plot_tcells_sam
dist_tcells_mds_sam <- get_dist(select(scaling_tcells_sam,
MDS_dimension_1, MDS_dimension_2),
method = "euclidean")
<!-- ggplot(data_dist_nm, aes(x = org_distance, y = mds_distance)) + -->
<!-- geom_hex(binwidth = 100) + -->
<!-- geom_smooth(color = "grey10") + -->
<!-- scale_fill_distiller(palette = "YlOrRd", direction = 1) + -->
<!-- ggtitle(label = "Shepard plot", -->
<!-- subtitle = "Original vs MDS distances") + -->
<!-- coord_equal() -->
data_dist_sam <- tibble(org_distance = as.vector(dist_tcells),
mds_distance = as.vector(dist_tcells_mds_sam))
<!-- ``` -->
ggplot(data_dist_sam, aes(x = org_distance, y = mds_distance)) +
geom_hex(binwidth = 0.1) +
geom_smooth(color = "grey10") +
scale_fill_distiller(palette = "RdPu", direction = 1) +
ggtitle(label = "Shepard plot Sammon",
subtitle = "Original vs MDS distances") +
coord_equal()
```
......@@ -1273,7 +1283,8 @@ The cost function that is optimized here is a Kullback--Leibler--Divergence
between the two probability distributions, which penalizes pushing points that are
close by apart but encourages points that are far apart to be pushed towards
each other. margin^[Note that this is different from the "stress" function
used in MDS, which treats all distances equally].
used in classic non--metric MDS, which treats all distances equally but similar
to Sammon scaling].
A t--distribution is chosen in the low dimensional space to alleviate
the "crowding problem": if there are many samples at a moderate distance from
......@@ -1281,7 +1292,7 @@ a sample, they are likely to be pushed towards this sample,
destroying any local structure. margin^[They have a minimal influence on
the cost function individually, but there are many of them]
## t--
## t--sne link
which then represents the embedding of the data.
......@@ -1298,7 +1309,7 @@ techniques for single cell data, see `r Biocpkg("sincell") ` and the
# move to new doc: Confounding factors, Statistical Testing, Machine Learning
# move to new doc: Confounding factors, Statistical Testing
* ZINBA Wave
......
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment