Commit a3f4143f authored by Bernd Klaus's avatar Bernd Klaus

added bulk RNA-Seq example data

parent 92831134
......@@ -3,3 +3,5 @@ material_stat_methods/
data/
.Rhistory
*_files/
SRP022054/
rse_gene.Rdata
......@@ -39,6 +39,7 @@ library("tibble")
library("broom")
library("scran")
library("locfit")
library("recount")
theme_set(theme_gray(base_size = 18))
......@@ -171,6 +172,39 @@ dataL$locFit <- predict(locfit(y~lp(a, nn=0.5, deg=1), data=dataL),
+ geom_line(aes(x = a, y = LinReg), color = "coral3"))
## ----import_crc----------------------------------------------------------
download_study("SRP022054", type = "rse-gene")
load(file.path("SRP022054", "rse_gene.Rdata"))
crc_data <- rse_gene
crc_data
## ----sumexp, echo=FALSE--------------------------------------------------
par(mar=c(0,0,0,0))
plot(1,1,xlim=c(0,100),ylim=c(0,100),bty="n",
type="n",xlab="",ylab="",xaxt="n",yaxt="n")
polygon(c(45,80,80,45),c(10,10,70,70),col=rgb(1,0,0,.5),border=NA)
polygon(c(45,80,80,45),c(68,68,70,70),col=rgb(1,0,0,.5),border=NA)
text(62.5,40,"assay(s)")
text(62.5,30,"e.g. 'counts'")
polygon(c(20,40,40,20),c(10,10,70,70),col=rgb(0,0,1,.5),border=NA)
polygon(c(20,40,40,20),c(68,68,70,70),col=rgb(0,0,1,.5),border=NA)
text(30,40,"rowData")
polygon(c(45,80,80,45),c(75,75,90,90),col=rgb(.5,0,.5,.5),border=NA)
polygon(c(45,47,47,45),c(75,75,90,90),col=rgb(.5,0,.5,.5),border=NA)
text(62.5,82.5,"colData")
## ----pre_crc-------------------------------------------------------------
counts_crc <- assay(crc_data)
counts_crc[1:5, 1:5]
counts_p4_meta <- counts_crc[, c("SRR837829", "SRR837847")]
View(counts_p4_meta)
## ----pre_crc_genes-------------------------------------------------------
colnames(colData(crc_data))
colData(crc_data)[1:5, c("title", "characteristics", "mapped_read_count")]
nrow(colData(crc_data))
## ----session_info, cache = FALSE-----------------------------------------
sessionInfo()
......@@ -67,6 +67,7 @@ library("tibble")
library("broom")
library("scran")
library("locfit")
library("recount")
theme_set(theme_gray(base_size = 18))
......@@ -385,10 +386,115 @@ dataL$locFit <- predict(locfit(y~lp(a, nn=0.5, deg=1), data=dataL),
# Normalization and confounding factors.
# Normalization and variance stabilization of bulk and single cell data
## Normalization of single cell data
## Size factors for bulk RNA--Seq
Before exploring normalization of single cell data, we look at this issue
in bulk RNA--Seq: Next generation sequencing data is often analyzed in the
form of read counts assigned to genomic bins. Such bins typically correspond to
all the exons of a gene, or an isoform for RNA--Seq data.
The aim of normalization is to remove between--sample differences
realted to library sizes and other systematic technical effects.
A popular method for the normalization in sequencing data is global scaling:
all counts for a given sample are scaled by factor such that the scaled
counts are comparable across samples. Such a factor is commonly called a
"size factor" and the normalized data then computed as:
__normalized sample counts__ = __raw sample counts__ / __ sample size factor__
## A Colectoral Cancer exmaple data set
We will illustrate this using a dataset from the [recount2 resource](http://dx.doi.org/10.1038/nbt.3838). [Löhr et. al., 2013](https://doi.org/10.1371/journal.pone.0067461) have
applied RNA-Seq on colorectal normal,
tumor and metastasis tissues from 8 patients to generate gene
expression patterns. They were particularly interested
in differential expression of miRNAs between tumor
and metastasis and tumor / normal tissues. The package `r Biocpkg("recount") `
is part of Bioconductor and we use it to conveniently download and import [the
data](http://www.ebi.ac.uk/ena/data/view/PRJNA201245) into R:
```{r import_crc}
download_study("SRP022054", type = "rse-gene")
load(file.path("SRP022054", "rse_gene.Rdata"))
crc_data <- rse_gene
crc_data
```
The data come in the form of a _RangedSummarizedExperiment_.
`SummarizedExperiment` is a matrix-like structure where rows represent features
of interest (e.g. genes, transcripts, exons, etc.) and columns represent
samples.
One or more `assay(s)`, contain a matrix of the
actual experimental data: The rows of an assay represent features of interest.
Annotation of these features is retrieved by using the
function `rowData()` while metadata on the samples can be obtained by a call
to `colData()`. The assay(s) and the row and column data tables are interconnected,
removing a feature or sample from the assay will also remove it from the rowData
and colData.
```{r sumexp, echo=FALSE}
par(mar=c(0,0,0,0))
plot(1,1,xlim=c(0,100),ylim=c(0,100),bty="n",
type="n",xlab="",ylab="",xaxt="n",yaxt="n")
polygon(c(45,80,80,45),c(10,10,70,70),col=rgb(1,0,0,.5),border=NA)
polygon(c(45,80,80,45),c(68,68,70,70),col=rgb(1,0,0,.5),border=NA)
text(62.5,40,"assay(s)")
text(62.5,30,"e.g. 'counts'")
polygon(c(20,40,40,20),c(10,10,70,70),col=rgb(0,0,1,.5),border=NA)
polygon(c(20,40,40,20),c(68,68,70,70),col=rgb(0,0,1,.5),border=NA)
text(30,40,"rowData")
polygon(c(45,80,80,45),c(75,75,90,90),col=rgb(.5,0,.5,.5),border=NA)
polygon(c(45,47,47,45),c(75,75,90,90),col=rgb(.5,0,.5,.5),border=NA)
text(62.5,82.5,"colData")
```
Therefore, we can preview the actual count data matrix like so:
```{r pre_crc}
counts_crc <- assay(crc_data)
counts_crc[1:5, 1:5]
counts_p4_meta <- counts_crc[, c("SRR837829", "SRR837847")]
View(counts_p4_meta)
```
The annotation for the genes looks like this:
```{r pre_crc_genes}
colnames(colData(crc_data))
colData(crc_data)[1:5, c("title", "characteristics", "mapped_read_count")]
nrow(colData(crc_data))
```
Where we subseted on the interesting columns giving us information on the samples.
We have various normal, tumor and metastasis tissues from 8 donors. The authors
had one pipelinr for the quantification of miRNA and one for the quantification
of "ordinary" genes. But as the recount2 ressource uses a unified pipeline with
annotation data from the [genecode project](https://www.gencodegenes.org/stats/archive.html), which contains non coding features as well, we can focus on the samples annotated
as "mRNA".
<!-- the median gene expression per sample relative to a reference -->
<!-- sample formed by the geometric mean of each single gene across samples. -->
<!-- Thus it, gives a scaling factor which makes the gene expression measurements -->
<!-- (counts) comparable across samples. Dividing the -->
<!-- counts by the sizefactors will normalize them. -->
<!-- Note that the size factor normalization will only work if most of the -->
<!-- genes are not differentially expressed between the samples. We use the function -->
<!-- \Rfunction{estimateSizeFactorsForMatrix} from the \Rpackage{DESeq2} in order -->
<!-- to calculate the size factors for each sample. -->
## sc data
* use scran size factors
......
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment