Commit c8c57fd8 authored by Bernd Klaus's avatar Bernd Klaus

added MA plots section

parent 4bd3e6ac
......@@ -449,8 +449,8 @@ discuss them next.
Before exploring size factors for single cell data, we look at this issue
in a bulk RNA--Seq data set.
As we know, the aim of normalization is to remove between--sample differences
related to library sizes and other systematic technical effects.
As we have just learned, the aim of normalization is to remove between--sample
differences related to library sizes and other systematic technical effects.
A popular method for the normalization in sequencing data is global scaling:
all counts for a given sample are scaled by factor such that the scaled
counts are comparable across samples. Such a factor is commonly called a
......@@ -483,9 +483,9 @@ crc_data
```
The data come in the form of a _RangedSummarizedExperiment_.
`SummarizedExperiment` is a matrix-like structure where rows represent features
of interest (e.g. genes, transcripts, exons, etc.) and columns represent
samples.
`SummarizedExperiment` is a matrix-like data structure in Bioconductor,
where rows represent features of interest (e.g. genes, transcripts, exons, etc.)
and columns represent samples.
One or more `assay(s)`, contain a matrix of the
actual experimental data: The rows of an assay represent features of interest.
......@@ -493,7 +493,8 @@ Annotation of these features is retrieved by using the
function `rowData()` while metadata on the samples can be obtained by a call
to `colData()`. The assay(s) and the row and column data tables are interconnected,
removing a feature or sample from the assay will also remove it from the rowData
and colData.
and colData. The figure gives an overview of the structure of a
`RangedSummarizedExperiment` object.
```{r sumexp, echo=FALSE}
......@@ -513,17 +514,19 @@ text(62.5,82.5,"colData")
```
Therefore, we can preview the actual count data matrix like so:
We can preview the actual count data matrix like so:
```{r pre_crc}
counts_crc <- assay(crc_data)
counts_crc[1:5, 1:5]
```
We filter all the features that have less than 5 counts on average:
We stringently filter all the features that have less than
20 counts on average to speed up the computates and focus on the highly
expressed features:
```{r crc_filtering}
counts_crc <- counts_crc[rowMeans(counts_crc) >= 5, ]
counts_crc <- counts_crc[rowMeans(counts_crc) >= 20, ]
dim(counts_crc)
```
......@@ -538,17 +541,22 @@ nrow(colData(crc_data))
Where we subseted on the interesting columns giving us information on the samples.
We have various normal, tumor and metastasis tissues from 8 donors. The authors
had one pipeline for the quantification of miRNA and one for the quantification
of "ordinary" genes. But as the recount2 resource uses a unified pipeline with
had one pipeline for the quantification of miRNA only, where only reads overlapping
miRNAs were counted, and a second one were all mRNAs were quanitifed.
As the recount2 resource uses a unified pipeline with
annotation data from the [genecode project](https://www.gencodegenes.org/stats/archive.html), which contains non coding features as well, we can focus on the samples annotated
as "mRNA". We will now create a column data table that contains the appropriate
sample annotation.
Specifically, we have to extract the sample annotations from
the `title` column and subset on the samples processed using mRNA
the `title` column of the `colData` and subset on the samples processed using mRNA
based quantification. This can be done using [regular expressions](http://www.zytrax.com/tech/web/regex.htm) and
the function `extract` from `r CRANpkg("tidyr")`. We subset the
count table accordingly.
the function `extract` from `r CRANpkg("tidyr")`. The regualr exptession
looks a bit scary at first sight, however it is simply tells R to extract
3 alphanumeric (so numbers or letters) characters separated by an underscore.
Finally, we subset the count table to retain only the samples annotated
as mRNA.
```{r create_crc_col_data}
......@@ -635,6 +643,9 @@ count_boxplot_tissue + facet_grid( patient ~ .)
```
## Computing size factors
Computing size factors is easy: first we create a reference sample by computing
......@@ -685,11 +696,117 @@ counts_crc_tidy <- mutate(counts_crc_tidy, count_norm = count / sf )
```
## Exercise: Data normalization
## Mean--Difference plots for normalization checking
Another useful function for checking the result of our normalization efforts
are so called mean--difference plots (also known as MA plots), which takes
two samples and plots their mean against their difference on the log2 scale.
A LOESS fit is then added. If the samples are properly normalized, the values
should scatter around a vertical lien at the origin. You can also think of
an MA--plot as a scattterplot rotated by 45°. It shows deviations across the whole
range of expression values and thus very helpful for seeing how similar values
for two different samples actually are.
In order to avoid a large number of pairwise plots,
one can also create a reference sample and compare every sample to this
reference. Here, we create a reference sample by taking the mean on the log2
scale across a features. (Similar to the goemetric mean). Instead of the ordinary
log2 function we us the __generlized logarithm__, which behaves linearly for small
values (less than 5 say) and like log2 for large ones. It avoids the strong
negative slope of the logarithm near zero and thus shrinks fold changes
for small counts to zero.
```{r MAplot_function}
glog2 <- function(x) ((asinh(x)-log(2))/log(2))
ref_sample <- group_by(counts_crc_tidy, ensembl_id) %>%
summarize(ref_sample = mean(glog2(count_norm )))
counts_crc_tidy <- left_join(counts_crc_tidy, ref_sample) %>%
mutate(M = ref_sample - glog2(count_norm),
A = (ref_sample + glog2(count_norm))/ 2)
```
We have first computed the expression values for the ref sample and then
joined it to our tidy count table. Then, we create the M and A values for
all samples.
For the creation of the MA plot, we only use a random fraction of 10% of the
data in order to avoid long computation times. We use a hexbin representation
of the data points to avoid overplotting.
In a hexbin plot, the data points are binned in 2D and the number of counts
per bin is then color-coded. So it is essentially like a 2D histogram.
We create out own colorscale in order to increase the contrast: the default
corresponds to equally sized bins, while we want to have more color range
for the larger bins.
We also set the aspect ratio to 2, since the range of the mean is half the range of
the difference.
```{r createMAs}
colorscale <- scale_fill_gradientn(
colors = rev(brewer.pal(9, "YlGnBu")),
values = c(0, exp(seq(-5, 0, length.out = 200))^0.5))
ma_pl <- ggplot(aes(A, M), data = sample_frac(counts_crc_tidy,
size = 0.1)) +
facet_wrap(~ sample_id) +
geom_hex(binwidth = c(0.5, 0.5)) +
geom_smooth(method = "loess", se = FALSE, col = "#5D84C5", span = .4) +
geom_hline(aes(yintercept = 0), col = "#9850C3") +
ylim(c(-5, 5)) +
coord_fixed(ratio = 2) +
colorscale +
ggtitle("MA plot (vs reference sample)")
ma_pl
```
### Exercise: Data normalization
1. Create a plot where you compare log2 and glog2, When is log2 roughly
equal to glog2? Hint: use the function `seq` to create value to evaluate
a the function on.
2. Create MA plots of the normalized crc data using transparent points (alpha
parameter) instead of a hexbin plot. Which one is better in your opinion?
```{r ex_data_norm_plot}
comp_data <- tibble(x = seq(0, 10, by = 0.1),
log2 = log2(x),
glog2 = glog2(x)) %>%
gather(key = "func",
value = "value", log2, glog2)
ggplot(data = comp_data,
aes(x = x, y = value, color = func)) +
geom_line()
Create boxplots of the normalized data and compare it to the raw one.
ma_pl_points <- ggplot(aes(A, M), data = sample_frac(counts_crc_tidy,
size = 0.1)) +
facet_wrap(~ sample_id) +
geom_point(alpha = 0.1) +
geom_smooth(method = "loess", se = FALSE, col = "#5D84C5", span = .4) +
geom_hline(aes(yintercept = 0), col = "#9850C3") +
ylim(c(-5, 5)) +
coord_fixed(ratio = 2) +
ggtitle("MA plot (vs ref sample)")
```{r data_norm_plot}
ma_pl_points
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment