Commit 9386b0c8 authored by Bernd Klaus's avatar Bernd Klaus

finished size factor section

parent 0fad9a56
......@@ -39,6 +39,7 @@ print(date())
```{r required_packages_and_data, echo = TRUE, cache=FALSE}
library("readxl")
library("scran")
library("BiocStyle")
library("knitr")
library("tidyverse")
......@@ -65,9 +66,9 @@ library("Single.mTEC.Transcriptomes")
library("DESeq2")
library("tibble")
library("broom")
library("scran")
library("locfit")
library("recount")
library("psych")
theme_set(theme_gray(base_size = 18))
......@@ -387,7 +388,7 @@ dataL$locFit <- predict(locfit(y~lp(a, nn=0.5, deg=1), data=dataL),
# Normalization and variance stabilization of bulk and single cell data
# Normalization of bulk and single cell data
## Size factors for RNA--Seq data
......@@ -418,7 +419,10 @@ is part of Bioconductor and we use it to conveniently download and import [the
data](http://www.ebi.ac.uk/ena/data/view/PRJNA201245) into R:
```{r import_crc}
download_study("SRP022054", type = "rse-gene")
if(!file.exists("SRP022054/rse_gene.Rdata")){
download_study("SRP022054", type = "rse-gene")
}
load(file.path("SRP022054", "rse_gene.Rdata"))
crc_data <- rse_gene
crc_data
......@@ -542,6 +546,10 @@ samples. Here, limited the y--axis in order to ignore the outliers and tilted
the x--Axis labels, so that we can actually read them. `r CRANpkg("ggplot2") `
allows you to directly use palettes from the [Colorbrewer project](http://colorbrewer2.org).
Here, we use a qualitative paletter called "Paired" instead of the default one.
We can that the variance / spread of the data points is quite similar, only
their sample--specific medians are different. There, a scaling normalization
will be sufficient for this data set.
## Exercise: Adapt the boxplot
......@@ -565,23 +573,94 @@ count_boxplot_tissue + facet_grid( patient ~ .)
```
## Computing size factors
Computing size factors is easy: first we create a reference sample by computing
the geometric mean of the gene expressions across all samples. Then, the sizefactor
for a specific sample is given by the formula:
__sizefactor = median(sample / reference sample) __
the median gene expression per sample relative to a reference
sample formed by the geometric mean of each single gene across samples.
The size factor normalization will only work if most of the
genes are not differentially expressed between the samples (or when
it is computed on control genes). It is customary to normalize the size
factors by deviding them by their geometric mean. This enables us to
directly interprate them: a sample with a size factor less than (more than) one
will have a smaller (larger) library size than the reference sample.
We use the function `estimateSizeFactorsForMatrix` from the `r Biocpkg("DESeq2")`
in order to calculate the size factors for each sample.
```{r calcsf}
crc_sf <- estimateSizeFactorsForMatrix(counts_crc[, col_data_crc$sample_id])
sample_medians_sf <- left_join(sample_medians,
tibble(sample_id = names(crc_sf),
sf = crc_sf)) %>%
mutate(s_med_count_scaled = 2^sample_median /geometric.mean(2^sample_median)) %>%
dplyr::arrange(sf)
ggplot(sample_medians_sf, aes(x = s_med_count_scaled,
y = sf)) +
ggtitle("Size factors versus scaled median") +
geom_point()
counts_crc_tidy <- left_join(counts_crc_tidy,
dplyr::select(sample_medians_sf, sample_id, sf))
```
We can see tha the size factors computed are the scaled medians
on the raw count scale, however they are not identical.
margin^[Note that it is nor recommended to scale sequencing libraries by
the total library size: it is not robust and artificially limits the range
of the data to 0--1]
## Size factors for single cell data
Normalizing single cell data is not straightforward since many counts are zero.
[Lun et. al.](http://dx.doi.org/10.1186/s13059-016-0947-7) present an approach
whre expression values are summed across pools of cells, the summed values
are used then deconvolved to yield cell--based size factors. By pooling and
deconvoluting, [Lun et. al.](http://dx.doi.org/10.1186/s13059-016-0947-7)
achieve a more accurate estimation of size factors.
We use the function `computeSumFactors` from the `r Biocpkg("scran")`
that implements the approach to normalize the
mTEC single cell data. We have to remove the ENSEMBL id column
and turn the data into a matrix in order to make it work with the
function. We then joint the scran size factors to the mtec annotation
table.
```{r scran_norm}
mtec_scran_sf <- tibble(scran_sf = computeSumFactors(as.matrix(mtec_counts[, -1])),
cell_id = colnames(mtec_counts)[-1])
mtec_cell_anno <- left_join(mtec_cell_anno, mtec_scran_sf,
by = c("cellID" = "cell_id"))
```
# Exercise: Comparing size factors
The mTEC data set contains pre--computed size factors in the column
`sizeFactor` that were computed
in the classical way.
<!-- the median gene expression per sample relative to a reference -->
<!-- sample formed by the geometric mean of each single gene across samples. -->
1. Create a plot similar to \@ref(fig:calcsf) where you compare the classical
against the scran size factors. What do you notice?
<!-- Thus it, gives a scaling factor which makes the gene expression measurements -->
<!-- (counts) comparable across samples. Dividing the -->
<!-- counts by the sizefactors will normalize them. -->
<!-- Note that the size factor normalization will only work if most of the -->
<!-- genes are not differentially expressed between the samples. We use the function -->
<!-- \Rfunction{estimateSizeFactorsForMatrix} from the \Rpackage{DESeq2} in order -->
<!-- to calculate the size factors for each sample. -->
```{r compar_sf}
ggplot(mtec_cell_anno, aes(x = sizeFactor, y = scran_sf)) +
ggtitle("SCRAN vs. classical size factors") +
geom_point() +
geom_smooth()
```
## sc data
* use scran size factors
# Variance stabilization
## Confounding factors
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment