Commit 249a86bd by Bernd Klaus

### finished var transform section, now on to dim reduction and clustering

parent efda91cc
 ... ... @@ -304,7 +304,13 @@ counts_norm_crc_mat <- select(counts_crc_tidy, sample_id, ensembl_id, count_norm as.matrix() meanSdPlot(counts_norm_crc_mat)$gg + ylim(c(0, 10)) ylim(c(0, 2000)) ## ----varStabCountData---------------------------------------------------- counts_norm_trans_crc <- varianceStabilizingTransformation(counts_crc) meanSdPlot(counts_norm_trans_crc)$gg + ylim(c(0, 5)) ## ----session_info, cache = FALSE----------------------------------------- sessionInfo() ... ...
 ... ... @@ -613,7 +613,7 @@ counts_crc_tidy <- left_join(counts_crc_tidy, We can see tha the size factors computed are the scaled medians on the raw count scale, however they are not identical. margin^[Note that it is nor recommended to scale sequencing libraries by margin^[Note that it is not recommended to scale sequencing libraries by the total library size: it is not robust and artificially limits the range of the data to 0--1]. We can now normalize the data: ... ... @@ -624,8 +624,7 @@ counts_crc_tidy <- mutate(counts_crc_tidy, count_norm = count / sf ) ## Exercise: Data normalization Normalize the data using the computed size factors and create boxplots of the normalized data Create boxplots of the normalized data and compare it to the raw one. {r data_norm_plot} ... ... @@ -680,7 +679,7 @@ The variance of count data depends on the its mean, i.e. the higher the counts, the higher their variance. This is readily seen in a plot of variance against the mean. We use the function meanSdPlot from the r Biocpkg("vsn")  package to create such a plot for the normalized RNA--Seq data. normalized bulk RNA--Seq data. This function accepts an expression matrix as an input, so we need to turn our tidy data table into an expression matrix first. The ... ... @@ -733,7 +732,7 @@ A simple variance stabilization method for count data is a log + a pseudocount. Note that for negative binomially distributed data: $\text{Var(log(counts + pseudocount))} \approx = \frac{1}{counts + pseudocount} + \alpha \text{Var(log(counts + pseudocount))} \approx \frac{1}{counts + pseudocount} + \alpha$ where $\alpha$ is the dispersion. The formula shows that for genes with ... ... @@ -747,14 +746,291 @@ In general, are carefull modelling of the variance mean realtionship is preferable to ad--hoc solutions. The package r Biocpkg("DESeq2") provides variance stabilizations that are tailored to RNA--Seq data. We use its function are based on refined approximations using a fitted dispersion--mean relationship. We use its function varianceStabilizingTransformation to obtain normalize and variance--stabilized count data. count data. margin^[Note the size factors for the bulk RNA--Seq data set differ widely. For this scenarion,r Biocpkg("DESeq2") recommends a different transform called rlog.] {r varStabCountData} counts_norm_trans_crc <- varianceStabilizingTransformation(counts_crc) meanSdPlot(counts_norm_trans_crc)\$gg + ylim(c(0, 5)) ` We can see from the plot that the transformation has worked, the data is now normalized and the variance--mean dependency has be aleviated. # move to new doc: Confounding factors, Statistical Testing, Machine Learning ... ...
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!