Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
stat_methods_bioinf
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Service Desk
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Operations
Operations
Incidents
Environments
Packages & Registries
Packages & Registries
Container Registry
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Bernd Klaus
stat_methods_bioinf
Commits
249a86bd
Commit
249a86bd
authored
Aug 23, 2017
by
Bernd Klaus
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
finished var transform section, now on to dim reduction and clustering
parent
efda91cc
Changes
3
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
593 additions
and
112 deletions
+593
-112
graphics_bioinf.R
graphics_bioinf.R
+7
-1
graphics_bioinf.Rmd
graphics_bioinf.Rmd
+284
-8
graphics_bioinf.html
graphics_bioinf.html
+302
-103
No files found.
graphics_bioinf.R
View file @
249a86bd
...
...
@@ -304,7 +304,13 @@ counts_norm_crc_mat <- select(counts_crc_tidy, sample_id, ensembl_id, count_norm
as.matrix
()
meanSdPlot
(
counts_norm_crc_mat
)
$
gg
+
ylim
(
c
(
0
,
10
))
ylim
(
c
(
0
,
2000
))
## ----varStabCountData----------------------------------------------------
counts_norm_trans_crc
<-
varianceStabilizingTransformation
(
counts_crc
)
meanSdPlot
(
counts_norm_trans_crc
)
$
gg
+
ylim
(
c
(
0
,
5
))
## ----session_info, cache = FALSE-----------------------------------------
sessionInfo
()
...
...
graphics_bioinf.Rmd
View file @
249a86bd
...
...
@@ -613,7 +613,7 @@ counts_crc_tidy <- left_join(counts_crc_tidy,
We
can
see
tha
the
size
factors
computed
are
the
scaled
medians
on
the
raw
count
scale
,
however
they
are
not
identical
.
margin
^[
Note
that
it
is
no
r
recommended
to
scale
sequencing
libraries
by
margin
^[
Note
that
it
is
no
t
recommended
to
scale
sequencing
libraries
by
the
total
library
size
:
it
is
not
robust
and
artificially
limits
the
range
of
the
data
to
0
--
1
].
We
can
now
normalize
the
data
:
...
...
@@ -624,8 +624,7 @@ counts_crc_tidy <- mutate(counts_crc_tidy, count_norm = count / sf )
##
Exercise
:
Data
normalization
Normalize
the
data
using
the
computed
size
factors
and
create
boxplots
of
the
normalized
data
Create
boxplots
of
the
normalized
data
and
compare
it
to
the
raw
one
.
```{
r
data_norm_plot
}
...
...
@@ -680,7 +679,7 @@ The variance of count data depends on the its mean, i.e. the higher
the
counts
,
the
higher
their
variance
.
This
is
readily
seen
in
a
plot
of
variance
against
the
mean
.
We
use
the
function
`
meanSdPlot
`
from
the
`
r
Biocpkg
(
"vsn"
)
`
package
to
create
such
a
plot
for
the
normalized
RNA
--
Seq
data
.
normalized
bulk
RNA
--
Seq
data
.
This
function
accepts
an
expression
matrix
as
an
input
,
so
we
need
to
turn
our
tidy
data
table
into
an
expression
matrix
first
.
The
...
...
@@ -733,7 +732,7 @@ A simple variance stabilization method for count data is a log + a pseudocount.
Note
that
for
negative
binomially
distributed
data
:
\[
\
text
{
Var
(
log
(
counts
+
pseudocount
))}
\
approx
=
\
frac
{
1
}{
counts
+
pseudocount
}
+
\
alpha
\
text
{
Var
(
log
(
counts
+
pseudocount
))}
\
approx
\
frac
{
1
}{
counts
+
pseudocount
}
+
\
alpha
\]
where
$\
alpha
$
is
the
dispersion
.
The
formula
shows
that
for
genes
with
...
...
@@ -747,14 +746,291 @@ In general, are carefull modelling of the variance mean realtionship is
preferable
to
ad
--
hoc
solutions
.
The
package
`
r
Biocpkg
(
"DESeq2"
)`
provides
variance
stabilizations
that
are
tailored
to
RNA
--
Seq
data
.
We
use
its
function
are
based
on
refined
approximations
using
a
fitted
dispersion
--
mean
relationship
.
We
use
its
function
`
varianceStabilizingTransformation
`
to
obtain
normalize
and
variance
--
stabilized
count
data
.
count
data
.
margin
^[
Note
the
size
factors
for
the
bulk
RNA
--
Seq
data
set
differ
widely
.
For
this
scenarion
,`
r
Biocpkg
(
"DESeq2"
)`
recommends
a
different
transform
called
`
rlog
`.]
```{
r
varStabCountData
}
counts_norm_trans_crc
<-
varianceStabilizingTransformation
(
counts_crc
)
meanSdPlot
(
counts_norm_trans_crc
)$
gg
+
ylim
(
c
(
0
,
5
))
```
We
can
see
from
the
plot
that
the
transformation
has
worked
,
the
data
is
now
normalized
and
the
variance
--
mean
dependency
has
be
aleviated
.
<
!-- \subsection{PCA for bulk RNA--Seq data} -->
<
!-- A PCA represents the individual samples in our RNA--Seq data -->
<
!-- data set by two or more ``artificial" variables, -->
<
!-- which are called Principal Components (PCs). The components -->
<
!-- are linear combinations of the genes and the linear -->
<
!-- combination is chosen in such a way that the euclidean distance between the -->
<
!-- individuals using centered and standardized variables is minimized. -->
<
!-- If $X = (x_1, \dotsc, x_p)$ denotes the original data matrix -->
<
!-- (i.e. our rlog--transformed data), then the columns of -->
<
!-- the transformed data matrix, also called the ``principal components" -->
<
!-- or "(PC) scores'' is simply given by: -->
<
!-- $$ -->
<
!-- \text{PC}_i= Xa_i, \quad i = 1, \dotsc, p -->
<
!-- $$ -->
<
!-- Where $a_i$ is a vector of length $p$. The $a_i$ vectors are commonly -->
<
!-- called ``loadings" or ``principal axes". These loadings are usually -->
<
!-- the eigenvectors of the correlation matrix of the data. -->
<
!-- It can be shown that choosing the loadings in this way leads to a -->
<
!-- transformation that generates principal components that have maximal -->
<
!-- variance and preserve as good as possible the distances between the original -->
<
!-- observations. -->
<
!-- If we have many variables (genes) as in our case ($p = 1000$), it is advisable -->
<
!-- to perform a selection of variables since this leads to more stable PCA results. -->
<
!-- Here, we simply use the 500 most variable genes. -->
<
!-- << PCA_simData, dependson="meanSdrlog" >>= -->
<
!-- ntop = 500 -->
<
!-- pvars <- rowVars(rldSim) -->
<
!-- select <- order(pvars, decreasing = TRUE)[seq_len(min(ntop, -->
<
!-- length(pvars)))] -->
<
!-- PCA <- prcomp(t(rldSim)[, select], scale = F) -->
<
!-- percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) -->
<
!-- dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], -->
<
!-- PC3 = PCA$x[,3], PC4 = PCA$x[,4], -->
<
!-- sex = colData(dds)$sex, -->
<
!-- condition = colData(dds)$condition) -->
<
!-- (qplot(PC1, PC2, data = dataGG, color = condition, shape = sex, -->
<
!-- main = "PC1 vs PC2, top variable genes", size = I(6)) -->
<
!-- + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), -->
<
!-- y = paste0("PC2, VarExp:", round(percentVar[2],4))) -->
<
!-- + scale_colour_brewer(type="qual", palette=2) -->
<
!-- ) -->
<
!-- @ -->
<
!-- We can see that the first PC separates the two experimental conditions, -->
<
!-- while the second one splits the samples by sex. This shows that the PCA can -->
<
!-- be used to visualize and detect batch effects. Note that the principal -->
<
!-- components are commonly ordered according to ``informativeness'', i.e. -->
<
!-- the percentage of the total variance in the data set they explain. So -->
<
!-- a separation according to the first PC is stronger than a separation -->
<
!-- according to the second one. In the simulated data, the experimental -->
<
!-- group is more important the sample batch (= sex). -->
<
!-- \subsection{Heatmaps and hierarchical clustering} -->
<
!-- Another very common visualization technique is a heatmap. Analogous to a PCA -->
<
!-- we can visualize the pairwise distances between the samples using a heatmap, -->
<
!-- i.e. a false color representation of the distance between two samples. Here -->
<
!-- we use the euclidean distance of all the genes per sample. This can be conveniently -->
<
!-- computed using the function \Rfunction{dist}. -->
<
!-- This heatmap is then ordered via hierarchical clustering. Hierarchical clustering -->
<
!-- starts with as many clusters as there are samples and successively merges samples -->
<
!-- that are close to each other. This merging process is commonly visualized as -->
<
!-- a tree like graphic called a dendogramm. -->
<
!-- <<SimData_heatmap_and_clustering, fig.height = 5, dependson="meanSdrlog">>= -->
<
!-- dists <- as.matrix(dist(t(rldSim))) -->
<
!-- rownames(dists) <- paste0(colData(dds)$sex, -->
<
!-- "_", str_sub(rownames(colData(dds)), 7)) -->
<
!-- colnames(dists) <- paste0(colData(dds)$condition, -->
<
!-- "_", str_sub(rownames(colData(dds)), 7)) -->
<
!-- hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) -->
<
!-- heatmap.2(dists, trace="none", col = rev(hmcol)) -->
<
!-- @ -->
<
!-- Again we see that samples cluster mainly by experimental condition, but also -->
<
!-- by batch. -->
<
!-- \section{Multidimensional scaling (MDS)} -->
<
!-- Multidimensional scaling is a technique alternative to PCA. -->
<
!-- MDS takes a set of dissimilarities -->
<
!-- and returns a set of points such that the distances between -->
<
!-- the points are approximately equal to the dissimilarities. -->
<
!-- We can think of ``squeezing'' a high-dimensional point cloud into -->
<
!-- a small number of dimensions (2, perhaps 3) while -->
<
!-- preserving as well as possible the inter--point distances. -->
<
!-- Thus, it can be seen as a generalization of PCA in some sense, -->
<
!-- since it allows to represent any distance, not only -->
<
!-- the euclidean one. -->
<
!-- \subsection{Classical MDS for single cell RNA--Seq data} -->
<
!-- Here we get the cell--cycle corrected single cell RNA--seq data from -->
<
!-- \href{http://dx.doi.org/10.1038/nbt.3102}{Buettner et. al.}. The authors -->
<
!-- use a dataset studying the differentiation of T-cells. The authors found -->
<
!-- that the cell cycle had a profound impact on the gene expression for this -->
<
!-- data set and developed a factor analysis--type method (a regression--type model -->
<
!-- where the coefficients are random and follow a distribution) -->
<
!-- to remove the cell cycle effects. The data are given as log--transformed -->
<
!-- counts. -->
<
!-- We first import the data as found in the supplement of the article -->
<
!-- and the compute the euclidean distances between the single cells. -->
<
!-- Then, a classic MDS (with 2 dimensions in our case) -->
<
!-- can be produced from this distance matrix. We compute the distance -->
<
!-- matrix on the centered and scaled gene expression measurements per -->
<
!-- cell as to make them comparable across cells. -->
<
!-- This gives us a set of points that we can then plot. -->
<
!-- <<r scalingSingleCell >>= -->
<
!-- scData <- t(read.csv("http://www-huber.embl.de/users/klaus/nbt.3102-S7.csv", -->
<
!-- row.names = 1)) -->
<
!-- scData[1:10,1:10] -->
<
!-- distEucSC <- dist(t(scale(scData))) -->
<
!-- scalingEuSC <- as.data.frame(cmdscale(distEucSC, k = 2)) -->
<
!-- names(scalingEuSC) <- c("MDS_Dimension_1", "MDS_Dimension_2") -->
<
!-- head(scalingEuSC) -->
<
!-- @ -->
<
!-- In the non--linear PCA method used in the original paper -->
<
!-- a clear two groups separation appears with regard to gata3, -->
<
!-- a factor that is important in T\textunderscore H2 differentiation. We color the -->
<
!-- cells in the MDS plot according to gata3 expression and observe -->
<
!-- that the first MDS dimension separates two cell populations with -->
<
!-- high and low gata3 expression. This is consistent with the results -->
<
!-- reported in the original publication, although a different method -->
<
!-- , namely classic MDS, to produce a low--dimensional representation -->
<
!-- of the data was used. -->
<
!-- << plotScalingSingleCell, dependson="scalingSingleCell" >>= -->
<
!-- rownames(scData)[727] -->
<
!-- gata3 <- cut(scale(scData[727,]), 5, -->
<
!-- labels = c("very low", "low", "medium", "high", 'very high')) -->
<
!-- # reverse label ordering -->
<
!-- gata3 <- factor(gata3, rev(levels(gata3))) -->
<
!-- scPlot <- qplot(MDS_Dimension_1, MDS_Dimension_2, data = scalingEuSC, -->
<
!-- main = "Metric MDS", -->
<
!-- color = gata3, size = I(3)) + scale_color_brewer(palette = "RdBu") -->
<
!-- scPlot -->
<
!-- # + scale_color_gradient2(mid = "#ffffb3") -->
<
!-- # + scale_colour_manual(values = rev(brewer.pal(5, "RdBu")))) -->
<
!-- gata3Colours <- invisible(print(scPlot))$data[[1]]$colour -->
<
!-- @ -->
<
!-- \subsection{Assessing the quality of the scaling and non--metric MDS} -->
<
!-- A measure of how good the scaling is given by a comparison of the original -->
<
!-- distances to the approximated distances. This comparison is commonly done using a -->
<
!-- ``stress--function" that summarizes the difference between the low dimensional distances and -->
<
!-- the original ones. Stress functions can also be used to compute a MDS. This is for example done in -->
<
!-- so called non--metric MDS, where for example the stress function is defined in such a way that -->
<
!-- small distances (Sammon scaling) or a general monotonic function of the observed distances -->
<
!-- is used as a target (Kruskal's non--metric scaling). -->
<
!-- Here we check the classical stress functions, which sums up the squared differences -->
<
!-- and scales them to the squared sum of the original ones. -->
<
!-- Additionally, we produce a ``Shepard'' plot, which plots the original distances against -->
<
!-- the ones inferred from using the scaling solution. This allows to assess how accurately -->
<
!-- the scaling represents the actual distances. -->
<
!-- << checkScaling, dependson="scalingSingleCell" >>= -->
<
!-- distMDS <- dist(scalingEuSC) -->
<
!-- ## get stress -->
<
!-- sum((distEucSC - distMDS)^2)/sum(distEucSC^2) -->
<
!-- ord <- order(as.vector(distEucSC)) -->
<
!-- dataGG <- data.frame(dOrg = as.vector(distEucSC)[ord], -->
<
!-- dMDS = as.vector(distMDS)[ord]) -->
<
!-- (qplot(dOrg, dMDS, data=dataGG, -->
<
!-- main = "Shepard plot: original vs MDS distances") + geom_smooth()) -->
<
!-- @ -->
<
!-- We see that the the scaling is not extremely good. -->
<
!-- \subsection{Kruskal's Scaling} -->
<
!-- Kruskal's scaling method is implemented in the function \Rfunction{isoMDS} -->
<
!-- from the \CRANpkg{MASS} package. We used the metric MDS solution as -->
<
!-- a starting solution. -->
<
!-- << Kruskal, dependson="scalingSingleCell" >>= -->
<
!-- kruskalMDS <- isoMDS(distEucSC, y = as.matrix(scalingEuSC)) -->
<
!-- #tressplot(kruskalMDS, distEucSC) -->
<
!-- kruskalMDS$stress -->
<
!-- ord <- order(as.vector(distEucSC)) -->
<
!-- dataGGK <- data.frame(dOrg = as.vector(distEucSC)[ord], -->
<
!-- dMDS = as.vector(dist(scores(kruskalMDS)))[ord]) -->
<
!-- (qplot(dOrg, dMDS, data=dataGGK, -->
<
!-- main = "Shepard plot for Kruskal: original vs MDS distances") + geom_smooth()) -->
<
!-- @ -->
<
!-- Now the Shepard plot looks much better, the scaling should reflect the original -->
<
!-- distances in a better way. Let's repeat the plot of the single cells -->
<
!-- << kruskalMDSPlot, dependson=c("Kruskal", "plotScalingSingleCell") >>= -->
<
!-- scalingK <- as.data.frame(scores(kruskalMDS)) -->
<
!-- names(scalingK) <- c("MDS_Dimension_1", "MDS_Dimension_2") -->
<
!-- qplot(MDS_Dimension_1, MDS_Dimension_2, data = scalingK, -->
<
!-- main = "Non--metric MDS", -->
<
!-- color = gata3, size = I(3)) + scale_color_brewer(palette = "PuOr") -->
<
!-- @ -->
<
!-- Although the representation is better than the first, metric, MDS the overall -->
<
!-- visual impression is similar. -->
<
!-- Single cell data is commonly used to infer (developmental) hierarchies of single cells. -->
<
!-- For a nice bioconductor package wrapping a lot of dimension reduction -->
<
!-- techniques for single cell data, see \Biocpkg{sincell} and the -->
<
!-- \href{http://dx.doi.org/10.1093/bioinformatics/btv368}{associated article}, -->
<
!-- especially, supplementary table 1. -->
#
move
to
new
doc
:
Confounding
factors
,
Statistical
Testing
,
Machine
Learning
...
...
graphics_bioinf.html
View file @
249a86bd
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment