Commit 649c766f authored by Bernd Klaus's avatar Bernd Klaus

added map of Europe to MDS section, small fixes for April 18

parent 5640d9ca
---
title: "Visual exploration for bioinformatics"
author: "Bernd Klaus"
date: "December 11, 2017"
date: "April 17, 2018"
output:
slidy_presentation:
df_print: paged
......@@ -437,6 +437,24 @@ pca_plot
* MDS: matrix of dissimilarities => map
* distances in the map = distances between samples
## Map of Europe via MDS
```{r mdsEurope, echo=FALSE}
loc <- cmdscale(eurodist)
x <- loc[, 1]
y <- -loc[, 2] # reflect so North is at the top
coord <- tibble(x, y, cities = rownames(loc))
ggplot(coord, aes(x = x, y = y)) +
geom_point() +
geom_label(aes(label = cities)) +
ggtitle("Map of Europe computed by MDS")
```
## MDS for sc--RNA--Seq from Buettner et. al., 2015
```{r scalingSingleCell}
......
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -10,6 +10,7 @@ output:
code_download: true
df_print: paged
toc_depth: 2
highlight: tango
BiocStyle::pdf_document:
toc: true
toc_depth: 2
......
This source diff could not be displayed because it is too large. You can view the blob instead.
No preview for this file type
......@@ -13,7 +13,6 @@ output:
toc_depth: 2
BiocStyle::pdf_document:
toc: true
highlight: tango
toc_depth: 2
bibliography: stat_methods_bioinf.bib
---
......@@ -24,7 +23,7 @@ graphics.off();rm(list=ls());rmarkdown::render('factor_ana_testing_ml.Rmd',
output_format = "BiocStyle::html_document");purl('factor_ana_testing_ml.Rmd')
rmarkdown::render('factor_ana_testing_ml.Rmd', output_format =
BiocStyle::pdf_document(toc = TRUE, highlight = "tango",
BiocStyle::pdf_document(toc = TRUE,
df_print = "paged", toc_depth = 2))
-->
......@@ -809,7 +808,7 @@ mutate_if(tab_conf, is_numeric, round, digits = 3)
As we can see, there is no confounding between sex and condition, but between
the genetic background and the condition. They cannot be easily
disentangled.^[see
also [Gerard and Stephens, 2017](http://dcgerard.github.io/research/2017/09/28/mouthwash.html)]
also [Gerard and Stephens, 2017](http://dcgerard.github.io/research/2017/06/29/ruv.html)
Popular methods like Surrogate variable analysis, (`r Biocpkg("sva")`)
and RUV (`r Biocpkg("RUVnormalize") ` `r Biocpkg("RUVseq") `, `r Biocpkg("RUVcorr") `)
......@@ -869,7 +868,9 @@ though that sva can actually identify the genetic background from the data.
the influence of the cell cycle from the gene expression data. Briefly,
@Buettner_2015 perform a factor analysis--like algorithm to estimate
a latent factor on gene annotated to cell--cycle. Here, we want to use
the method of @Risso_2018 instead.
the method of @Risso_2018 instead. There is also a more recent
package, `r Biocpkg("scone") `, [@Cole_2018] available from som of the authors which
deals directly with normalization of sc--RNA--Seq data.
They propose a zero--inflated Negative Binomial Model for (sc--) RNA-Seq
data implemented in the package `r Biocpkg("zinbwave")`. In contrast to
......@@ -1645,6 +1646,13 @@ tune them towards certain objectives. However, in our data set, it
is intrinsically hard to predict cluster 10 so that lowering the error
rate for this class, will strongly increase the rate for the other class.
In practice, data over- or undersampling is used to overcome classed
unbalancedness. Note that simply repeating data is not useful, as this usually
decreases the variance of the features in the minority class and can lead
to too optimistic results. Sampling with replacement or techniques like
SMOTE (Synthetic Minority Oversampling TEchnique, @Chawla_2005).
# Session Info
```{r session_info, cache = FALSE}
......
This diff is collapsed.
......@@ -98,6 +98,8 @@ citep("10.1186/s13059-017-1334-8")
citep("10.15252/msb.20156651")
### export citations
write.bibtex(file = "stat_methods_bioinf.bib")
......@@ -226,3 +228,40 @@ add_manually("
eprint = {https://www.biorxiv.org/content/early/2018/01/19/142760.full.pdf},
journal = {bioRxiv}
}")
add_manually("
@article {Chawla_2005,
author = {Chawla, Nitesh V.},
biburl = {https://www.bibsonomy.org/bibtex/282896d5075d60317023dde6a91287ace/dblp},
booktitle = {The Data Mining and Knowledge Discovery Handbook},
crossref = {books/sp/DM2005},
date = {2005-09-05},
description = {dblp},
editor = {Maimon, Oded and Rokach, Lior},
interhash = {efcb5e9d1ce4afc3525d08b6dd3ac2d4},
intrahash = {82896d5075d60317023dde6a91287ace},
isbn = {0-387-24435-2},
keywords = {dblp},
pages = {853-867},
publisher = {Springer},
timestamp = {2005-09-05T00:00:00.000+0200},
title = {Data Mining for Imbalanced Datasets: An Overview.},
year = 2005
}")
# Cole et. al. , 2018
add_manually("
@article {Cole_2018,
author = {Cole, Michael B and Risso, Davide and Wagner, Allon and DeTomaso, David and Ngai, John and Purdom, Elizabeth and Dudoit, Sandrine and Yosef, Nir},
title = {Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq},
year = {2017},
doi = {10.1101/235382},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Due to the presence of systematic measurement biases, data normalization is an essential preprocessing step in the analysis of single-cell RNA sequencing (scRNA-seq) data. While a variety of normalization procedures are available for bulk RNA-seq, their suitability with respect to single-cell data is still largely unexplored. Furthermore, there may be multiple, competing considerations behind the assessment of normalization performance, some of them study-specific. The choice of normalization method can have a large impact on the results of downstream analyses (e.g., clustering, inference of cell lineages, differential expression analysis), and thus it is critically important to assess the performance of competing methods in order to select a suitable procedure for the study at hand. We have developed scone - a framework that implements a wide range of normalization procedures in the context of scRNA-seq, and enables the assessment of their performance based on a comprehensive set of data-driven performance metrics. The accompanying open-source Bioconductor R software package scone (available at https://bioconductor.org/packages/scone) also provides numerical and graphical summaries of expression measures, data quality assessment, and data-adaptive gene and sample filtering criteria. We demonstrate the effectiveness of scone on a selection of scRNA-seq datasets across a variety of protocols, ranging from plate- to droplet-based methods. We show that scone is able to correctly rank normalization methods according to their performance in a given dataset and that selecting the best performing normalization leads to higher agreement with independent validation data than lowly-ranked methods.},
URL = {https://www.biorxiv.org/content/early/2017/12/16/235382},
eprint = {https://www.biorxiv.org/content/early/2017/12/16/235382.full.pdf},
journal = {bioRxiv}
}")
......@@ -93,7 +93,7 @@ scatter_tra
scatter_tra +
geom_rug(alpha = I(0.2))
## ----ex_geom, echo=FALSE, results="hide", fig.show="hide"----------------
## ----ex_geom, echo=FALSE, results="hide", fig.show="hide", message=FALSE----
ggplot(tra_detected, aes(x = total_detected, y = tra))
scatter_tra +
......@@ -479,7 +479,24 @@ hmcol <- colorRampPalette(brewer.pal(9, "RdPu"))(255)
pheatmap(create_dist_mat(cleaned_data,
anno = col_data_crc$cl_anno,
method = "euclidean"),
trace="none", col = rev(hmcol))
trace="none", col = hmcol)
## ----eurodist------------------------------------------------------------
as.matrix(eurodist)[1:4, 1:4]
## ----mdsEurope, echo=TRUE------------------------------------------------
loc <- cmdscale(eurodist)
x <- loc[, 1]
y <- -loc[, 2] # reflect so North is at the top
coord <- tibble(x, y, cities = rownames(loc))
ggplot(coord, aes(x = x, y = y)) +
geom_point() +
geom_label(aes(label = cities)) +
ggtitle("Map of Europe computed by MDS")
## ----scalingSingleCell---------------------------------------------------
......
......@@ -10,6 +10,7 @@ output:
code_download: true
df_print: paged
toc_depth: 2
highlight: tango
BiocStyle::pdf_document:
toc: true
toc_depth: 2
......@@ -242,7 +243,7 @@ out how to add a smoothing curve to the plot. Can you change the smoother
to a regression line?
```{r ex_geom, echo=FALSE, results="hide", fig.show="hide"}
```{r ex_geom, echo=FALSE, results="hide", fig.show="hide", message=FALSE}
ggplot(tra_detected, aes(x = total_detected, y = tra))
scatter_tra +
......@@ -1163,7 +1164,7 @@ hmcol <- colorRampPalette(brewer.pal(9, "RdPu"))(255)
pheatmap(create_dist_mat(cleaned_data,
anno = col_data_crc$cl_anno,
method = "euclidean"),
trace="none", col = rev(hmcol))
trace="none", col = hmcol)
```
......@@ -1190,7 +1191,34 @@ Thus, it can be seen as a generalization of PCA in some sense,
since it allows to represent any dissimilarity, not just
the Euclidean distance.
A typical example for the application of MDS is the creation
of a map from pairwise distances between cities:
```{r eurodist}
as.matrix(eurodist)[1:4, 1:4]
```
Given these distances, MDS produces a map, where the eucledian distance
between two cities represents as faithfully as possible the actual
distance.
```{r mdsEurope, echo=TRUE}
loc <- cmdscale(eurodist)
x <- loc[, 1]
y <- -loc[, 2] # reflect so North is at the top
coord <- tibble(x, y, cities = rownames(loc))
ggplot(coord, aes(x = x, y = y)) +
geom_point() +
geom_label(aes(label = cities)) +
ggtitle("Map of Europe computed by MDS")
```
Of course, the world is not flat but a sphere etc., but the MDS
results gives a reasonable approximation to a map of Europe.
# Dimensionality reduction: Kruskal scaling for single cell RNA--Seq data
......@@ -1220,7 +1248,7 @@ dimensional vectors \(y \) such that:
\|y_i-y_j \|_E \approx \theta( d_{i,j})
\]
^[There are many variants of MDS, For a review see [@Buja_2008]
^[There are many variants of MDS, For a review see [@Buja_2008]]
Where \(\| . \|_E \) is the Euclidean distance and
theta is a monotone transformation of the input distances. Allowing
......
This diff is collapsed.
No preview for this file type
......@@ -417,3 +417,42 @@
}
@article {Chawla_2005,
author = {Chawla, Nitesh V.},
biburl = {https://www.bibsonomy.org/bibtex/282896d5075d60317023dde6a91287ace/dblp},
booktitle = {The Data Mining and Knowledge Discovery Handbook},
crossref = {books/sp/DM2005},
date = {2005-09-05},
description = {dblp},
editor = {Maimon, Oded and Rokach, Lior},
interhash = {efcb5e9d1ce4afc3525d08b6dd3ac2d4},
intrahash = {82896d5075d60317023dde6a91287ace},
isbn = {0-387-24435-2},
keywords = {dblp},
pages = {853-867},
publisher = {Springer},
timestamp = {2005-09-05T00:00:00.000+0200},
title = {Data Mining for Imbalanced Datasets: An Overview.},
year = 2005
}
@article {Cole_2018,
author = {Cole, Michael B and Risso, Davide and Wagner, Allon and DeTomaso, David and Ngai, John and Purdom, Elizabeth and Dudoit, Sandrine and Yosef, Nir},
title = {Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq},
year = {2017},
doi = {10.1101/235382},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Due to the presence of systematic measurement biases, data normalization is an essential preprocessing step in the analysis of single-cell RNA sequencing (scRNA-seq) data. While a variety of normalization procedures are available for bulk RNA-seq, their suitability with respect to single-cell data is still largely unexplored. Furthermore, there may be multiple, competing considerations behind the assessment of normalization performance, some of them study-specific. The choice of normalization method can have a large impact on the results of downstream analyses (e.g., clustering, inference of cell lineages, differential expression analysis), and thus it is critically important to assess the performance of competing methods in order to select a suitable procedure for the study at hand. We have developed scone - a framework that implements a wide range of normalization procedures in the context of scRNA-seq, and enables the assessment of their performance based on a comprehensive set of data-driven performance metrics. The accompanying open-source Bioconductor R software package scone (available at https://bioconductor.org/packages/scone) also provides numerical and graphical summaries of expression measures, data quality assessment, and data-adaptive gene and sample filtering criteria. We demonstrate the effectiveness of scone on a selection of scRNA-seq datasets across a variety of protocols, ranging from plate- to droplet-based methods. We show that scone is able to correctly rank normalization methods according to their performance in a given dataset and that selecting the best performing normalization leads to higher agreement with independent validation data than lowly-ranked methods.},
URL = {https://www.biorxiv.org/content/early/2017/12/16/235382},
eprint = {https://www.biorxiv.org/content/early/2017/12/16/235382.full.pdf},
journal = {bioRxiv}
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment