Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
stat_methods_bioinf
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Service Desk
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Operations
Operations
Incidents
Environments
Packages & Registries
Packages & Registries
Container Registry
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Bernd Klaus
stat_methods_bioinf
Commits
a5082e43
Commit
a5082e43
authored
Sep 22, 2017
by
Bernd Klaus
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
modified tSNE and MDS parts again a litte bit, added more details on perplexity
parent
4d15acfd
Changes
3
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
309 additions
and
340 deletions
+309
-340
graphics_bioinf.R
graphics_bioinf.R
+37
-37
graphics_bioinf.Rmd
graphics_bioinf.Rmd
+48
-20
graphics_bioinf.html
graphics_bioinf.html
+224
-283
No files found.
graphics_bioinf.R
View file @
a5082e43
...
...
@@ -50,43 +50,43 @@ data_dir <- file.path("data/")
## ----import_gene_expression, echo=FALSE, eval=FALSE----------------------
##
# Here we import the gene expression data using only the subset of highly
##
# variable genes, resave it
##
load(file.path(data_dir, "deGenesNone.RData"))
##
load(file.path(data_dir, "mTECdxd.RData"))
##
##
mtec_counts <- counts(dxd)[deGenesNone, ]
##
##
mtec_counts <- as_tibble(rownames_to_column(as.data.frame(mtec_counts),
##
var = "ensembl_id"))
##
##
mtec_cell_anno <- as_tibble(as.data.frame(colData(dxd))) %>%
##
modify_if(.p = is.factor, as.character)
##
##
data("biotypes")
##
data("geneNames")
##
##
mtec_gene_anno <- tibble(biotype) %>%
##
add_column(ensembl_id = names(biotype),
##
gene_name = geneNames,
##
.before = "biotype")
##
##
mtec_cell_anno <- colData(dxd)
##
##
save(mtec_counts, file = file.path(data_dir, "mtec_counts.RData"),
##
compress = "xz")
##
##
save(mtec_cell_anno, file = file.path(data_dir, "mtec_cell_anno.RData"),
##
compress = "xz")
##
##
##
save(mtec_gene_anno, file = file.path(data_dir, "mtec_gene_anno.RData"),
##
compress = "xz")
##
##
tras <- as_tibble(tras)
##
##
save(tras, file = file.path(data_dir, "tras.RData"),
##
compress = "xz")
# Here we import the gene expression data using only the subset of highly
# variable genes, resave it
load
(
file.path
(
data_dir
,
"deGenesNone.RData"
))
load
(
file.path
(
data_dir
,
"mTECdxd.RData"
))
mtec_counts
<-
counts
(
dxd
)[
deGenesNone
,
]
mtec_counts
<-
as_tibble
(
rownames_to_column
(
as.data.frame
(
mtec_counts
),
var
=
"ensembl_id"
))
mtec_cell_anno
<-
as_tibble
(
as.data.frame
(
colData
(
dxd
)))
%>%
modify_if
(
.p
=
is.factor
,
as.character
)
data
(
"biotypes"
)
data
(
"geneNames"
)
mtec_gene_anno
<-
tibble
(
biotype
)
%>%
add_column
(
ensembl_id
=
names
(
biotype
),
gene_name
=
geneNames
,
.before
=
"biotype"
)
mtec_cell_anno
<-
colData
(
dxd
)
save
(
mtec_counts
,
file
=
file.path
(
data_dir
,
"mtec_counts.RData"
),
compress
=
"xz"
)
save
(
mtec_cell_anno
,
file
=
file.path
(
data_dir
,
"mtec_cell_anno.RData"
),
compress
=
"xz"
)
save
(
mtec_gene_anno
,
file
=
file.path
(
data_dir
,
"mtec_gene_anno.RData"
),
compress
=
"xz"
)
tras
<-
as_tibble
(
tras
)
save
(
tras
,
file
=
file.path
(
data_dir
,
"tras.RData"
),
compress
=
"xz"
)
## ----import_data---------------------------------------------------------
load
(
file.path
(
data_dir
,
"mtec_counts.RData"
))
...
...
graphics_bioinf.Rmd
View file @
a5082e43
...
...
@@ -1028,16 +1028,18 @@ these cell--to--cell distances in the high--dimensional gene--space in a lower
dimensional
,
ordinary
Euclidean
space
where
the
proximity
of
two
cells
reflects
the
similarity
of
their
gene
expression
values
.
In
other
words
,
given
distances
\(
d_
{
i
,
j
}\)
between
two
cells
,
we
want
to
find
(
for
example
)
two
dimensional
vectors
\(
x
\)
and
\(
y
\)
such
that
:
In
other
words
,
given
distances
\(
d_
{
i
,
j
}\)
between
two
cells
computed
from
the
original
data
,
we
want
to
find
two
(
or
higher
)
dimensional
vectors
\(
y
\)
such
that
:
\[
\
sqrt
{(
x_i
-
x_j
)^
2
+(
y_i
-
y_j
)^
2
}
\
approx
\
theta
(
d_
{
i
,
j
})
\
|
y_i
-
y_j
\|
_E
\
approx
\
theta
(
d_
{
i
,
j
})
\]
margin
^[
There
are
many
variants
of
MDS
,
For
a
review
see
Buja
et
.
al
.,
2007
]
Where
theta
is
a
monotone
transformation
of
the
input
distances
.
Allowing
Where
\(\|
.
\|
_E
\)
is
the
Eucledian
distance
and
theta
is
a
monotone
transformation
of
the
input
distances
.
Allowing
us
to
put
represent
the
"typical range"
of
distance
more
faithfully
.
This
goodness
of
fit
can
then
be
measured
by
the
cost
function
known
as
__stress__
.
...
...
@@ -1271,25 +1273,27 @@ ggplot(data_dist_sam, aes(x = org_distance, y = mds_distance)) +
t--SNE (t--Stochastic Neighbor Embedding)
[van der Maaten and Hinton, 2008](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)
is a visualization technique similar in principle to MDS
[van der Maaten and Hinton, 2008](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf),
[formulas on Wikipedia](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)
is a visualization technique similar in spirit to MDS
as its starting point are pairwise similarities between data points.
t--SNE starts by putting the pairwise Eucledian distances between the samples
into distributions and from this then computes for each pair \(i\) and \(j\) of
samples the the probability that they are neighbours.
t--SNE starts by putting the pairwise Eucledian distances between the samples \(x_i\)
into Gaussian kernels margin^[\(exp(-\| x_i - x_j \| / 2\sigma_i)\)]
and from this then computes for each pair \(i\) and \(j\) of
samples the probability \(p_ij\) that they would pick each other as neighbours.
These probabilities in the original, high--dimensional space margin^[based
on a Gaussian
distribution
] are then fitted to prababilities
margin^[based on a t--distribution]
on a Gaussian
kernel
] are then fitted to prababilities
margin^[based on a t--distribution
with one degree of freedom = Cauchy distribution
]
in a lower (e.g. two dimensional) space. t--SNE will
then
choose the representation in the lower dimensional space in
choose the representation in the lower dimensional space in
such a way that the two probabilites match closely.
t--SNE has a tuning parameter called perplexity, which determines the
variance of the Gaussian distribution used in the original, high dimensional
space. This allows us to assign zero neighbourhood-probability to large
distances in the original data
. E
ssentially treating them as unimportant.
distances in the original data
, e
ssentially treating them as unimportant.
margin^[This is conceptually similar to Sammon scaling, which also upweights
small distances. However, Sammon scaling does not downweigh large
distances.]
...
...
@@ -1311,7 +1315,8 @@ destroying any local structure. margin^[They have a minimal influence on
the cost function individually, but there are many of them].
In t-SNE disimilar pairs of samples with a high distance in the low
dimensional representation wont'
t
have
much
influence
on
the
cost
function
optimization
.
This
avoids
"crowding"
and
has
the
cost
function
optimization
margin
^[
for
them
,
the
gradient
of
the
cost
function
vanishes
].
This
avoids
"crowding"
and
has
the
potential
to
reveal
a
more
fine
--
grained
clustering
.
...
...
@@ -1322,11 +1327,30 @@ popular in single cell RNA--Seq analysis
([
visNE
](
http
://
dx
.
doi
.
org
/
10.1038
/
nbt
.2594
)),
however
it
is
very
hard
to
choose
the
various
tuning
parameters
.
###
Choosing
a
perplexity
value
is
very
trick
y
###
Perplexed
by
perplexit
y
The
perplexitiy
value
,
the
perplexity
value
can
be
interpreted
as
the
effective
number
of
neighbours
of
a
sample
(
e
.
g
.
a
cell
in
single
cell
RNA
-
Seq
)
What
does
the
perplexity
value
actually
mean
?
If
a
sample
has
a
number
of
\(
k
\)
neighbours
that
all
have
the
exact
same
probability
of
being
a
neighbour
margin
^[
a
uniform
distribution
],
the
perplexity
is
\(
k
\).
Thus
,[
van
der
Maaten
and
Hinton
,
2008
]()
interpret
the
perplexity
as
"__the effective number of neighbours
of a sample__"
.
In
t
--
SNE
,
due
to
the
Gaussian
kernels
the
neighbouring
samples
have
differing
probabilities
of
being
a
neighbour
.
The
perplexity
depends
on
the
bandwith
\(\
sigma_i
\)
of
the
kernel
.
Both
neighbouring
samples
that
are
very
close
(\(
p_
{
ij
}
\
approx
1
\)
and
neighbours
that
are
very
far
away
from
a
sample
(\(
p_
{
ij
}
\
approx
0
\)
don
't increase
the perplexity of the sample very much. Thus, the the perplexity determines
how many "nearby" points are considered as neighbours, leading to smaller values
of \(\sigma_i\) in dense regions.
### Choosing the perplexity value is very tricky
As the perplexitiy value can be interpreted as the effective number of
neighbours (e.g. other cells in single cell RNA-Seq)
__choosing it greater than the total number of samples will lead
to strange results__.
...
...
@@ -1341,6 +1365,10 @@ If the perplexitiy value is too low, there will often be a "clumps"
of data points, as t-SNE heavily exaggerates small distances leading
to "clusters" even in random data.
On the other hand, if the value is too high, samples that are far
apart will be considered as neighbours, leading to a distorted view
of the underlying geometry.
### Additional tuning parameters
The t-SNE map is initialized randomly and then optimized iteratively.
...
...
@@ -1359,7 +1387,7 @@ We can summarize the observations above as follows:
* Run at least 5000 iterations
* t-SNE exaggerates small distances and often shrinks large
ones
so
cluster
areas
and
the
distance
between
them
might
not
mean
anything
ones
,
so cluster areas and the distance between them might not mean anything
* Don'
t
choose
a
perplexity
greater
than
the
number
of
samples
*
Too
small
perplexity
values
lead
to
a
"clumping"
of
points
and
wrong
clusters
in
random
data
...
...
@@ -1369,7 +1397,7 @@ differences between clusters
optimal
value
would
be
different
for
each
unknown
cluster
in
the
data
More details about this can be found at: <https://distill.pub/2016/misread-tsne/>.
More
details
about
this
can
be
found
at
this
webpage
:
<
https
://
distill
.
pub
/
2016
/
misread
-
tsne
/>.
In
summary
,
while
t
-
SNE
has
been
used
to
reveal
structure
,
especially
the
perplexity
parameter
is
hard
to
set
.
Any
results
of
a
t
-
SNE
analysis
should
...
...
graphics_bioinf.html
View file @
a5082e43
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment