Commit 18746b2a authored by Bernd Klaus's avatar Bernd Klaus

added boxplot for raw counts

parent a3f4143f
......@@ -389,7 +389,7 @@ dataL$locFit <- predict(locfit(y~lp(a, nn=0.5, deg=1), data=dataL),
# Normalization and variance stabilization of bulk and single cell data
## Size factors for bulk RNA--Seq
## Size factors for RNA--Seq data
Before exploring normalization of single cell data, we look at this issue
in bulk RNA--Seq: Next generation sequencing data is often analyzed in the
......@@ -406,7 +406,7 @@ counts are comparable across samples. Such a factor is commonly called a
__normalized sample counts__ = __raw sample counts__ / __ sample size factor__
## A Colectoral Cancer exmaple data set
## Normalization of a Colectoral Cancer exmaple data set
We will illustrate this using a dataset from the [recount2 resource]( [Löhr et. al., 2013]( have
applied RNA-Seq on colorectal normal,
......@@ -460,17 +460,15 @@ Therefore, we can preview the actual count data matrix like so:
```{r pre_crc}
counts_crc <- assay(crc_data)
counts_crc[1:5, 1:5]
counts_p4_meta <- counts_crc[, c("SRR837829", "SRR837847")]
The annotation for the genes looks like this:
```{r pre_crc_genes}
colData(crc_data)[1:5, c("title", "characteristics", "mapped_read_count")]
colData(crc_data)[1:5, c("title", "mapped_read_count")]
......@@ -479,7 +477,42 @@ We have various normal, tumor and metastasis tissues from 8 donors. The authors
had one pipelinr for the quantification of miRNA and one for the quantification
of "ordinary" genes. But as the recount2 ressource uses a unified pipeline with
annotation data from the [genecode project](, which contains non coding features as well, we can focus on the samples annotated
as "mRNA".
as "mRNA". We will now create a column data table that contains the appropriate
sample annotation.
Specifically, we have to extract the sample annotations from
the title column and subset on the samples processed using mRNA
based quantification.
```{r create_crc_col_data}
col_data_crc <- select(,
title, characteristics, mapped_read_count) %>%
rownames_to_column(var = "sample_id") %>%
as_tibble() %>%
tidyr::extract(title, into = c("quantification", "patient", "tissue"),
regex = "([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)") %>%
dplyr::filter(quantification == "mRNA")
We now plot the log counts of the mRNA samples to see whether their distributions
are comparable. In order to avoid taking the log of zero, we only retain
a gene within a sample that has at least one count
```{r mrna_counts}
counts_crc_tidy <- rownames_to_column(data.frame(counts_crc), var = "ensembl_id") %>%
as_tibble() %>%
gather(key = "sample_id", value = "count", -ensembl_id) %>%
dplyr::filter(sample_id %in% col_data_crc$sample_id) %>%
dplyr::filter(count > 1)
ggplot(counts_crc_tidy, aes(x = sample_id, y = log2(count), fill = sample_id) ) +
<!-- the median gene expression per sample relative to a reference -->
<!-- sample formed by the geometric mean of each single gene across samples. -->
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment