Commit 1ba0a426 authored by Bernd Klaus's avatar Bernd Klaus

first final version with slides, corrected severe fac ana bug (variance based...

first final version with slides, corrected severe fac ana bug (variance based selection on samples not genes)
parent bc325de9
......@@ -17,3 +17,9 @@ data/nomarkerCellsClustering.RData
data/zinb.RData
get_clustering_aljeandro.R
norm.pdf
import_mtec_gene_expression_data.R
Slides_stat_methods_bioinf/slides_data_handling_bioinf_cache
Slides_stat_methods_bioinf/slides_factor_ana_testing_ml_cache
Slides_stat_methods_bioinf/slides_graphics_bioinf_cache
Slides_stat_methods_bioinf/SRP022054
This diff is collapsed.
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -33,8 +33,10 @@ library("forcats")
library("openxlsx")
library("readxl")
library("limma")
library("Single.mTEC.Transcriptomes")
theme_set(theme_gray(base_size = 18))
library("ggthemes")
theme_set(theme_solarized(base_size = 18))
data_dir <- file.path("data/")
......@@ -228,7 +230,8 @@ head(sc_qc_stats_wide[[idx]], 5)
## ----mutate-example------------------------------------------------------
sc_qc_stats_wide <- mutate(sc_qc_stats_wide,
sum_perc = concordantMult + concordantUniq + nomap + others)
sum_perc = concordantMult + concordantUniq + nomap
+ others)
all(round(sc_qc_stats_wide$sum_perc) == 100)
......@@ -269,7 +272,7 @@ worst_cells
## ----data_q_pcr----------------------------------------------------------
q_pcr <- read_csv(file.path(data_dir, "qPCR.csv"))
sample_n(q_pcr, 5)
head(q_pcr, 5)
## ----add_rep_info--------------------------------------------------------
q_pcr <- q_pcr %>%
......@@ -287,11 +290,11 @@ summary_across_reps <- group_by(q_pcr,
spread = abs(diff(C_T))) %>%
ungroup()
sample_n(summary_across_reps, 5)
head(summary_across_reps, 5)
## ----left_join_summ, dependson="sum_across_reps"-------------------------
augmented_table <- left_join(q_pcr, summary_across_reps )
sample_n(augmented_table, 5)
head(augmented_table, 5)
## ----ex_high_qual, echo=FALSE, results="hide"----------------------------
......
......@@ -3,7 +3,7 @@ title: "Data handling for bioinformatics"
author: "Bernd Klaus"
date: "`r doc_date()`"
output:
BiocStyle::html_document2:
BiocStyle::html_document:
toc: true
highlight: tango
self_contained: true
......@@ -20,7 +20,7 @@ bibliography: stat_methods_bioinf.bib
<!--
graphics.off();rm(list=ls());rmarkdown::render('data_handling_bioinf.Rmd',
output_format = "all");purl('data_handling_bioinf.Rmd')
output_format = "BiocStyle::html_document");purl('data_handling_bioinf.Rmd')
rmarkdown::render('data_handling_bioinf.Rmd', output_format =
BiocStyle::pdf_document(toc = TRUE, highlight = "tango",
......@@ -71,8 +71,10 @@ library("forcats")
library("openxlsx")
library("readxl")
library("limma")
library("Single.mTEC.Transcriptomes")
theme_set(theme_gray(base_size = 18))
library("ggthemes")
theme_set(theme_solarized(base_size = 18))
data_dir <- file.path("data/")
......@@ -141,7 +143,8 @@ started by simply typing:
```
What does R do? It sums up the two numbers and returns the scalar value 10.
In fact, R returns a vector of length 1 - hence the [1] denoting first element of the vector.
In fact, R returns a vector of length 1 - hence the [1] denoting first element
of the vector.
We can assign objects values for subsequent use. For example:
......@@ -153,7 +156,8 @@ z
```
does the same calculation as above, storing the result in an object called `z`.
We can look at the contents of the object by simply typing its name. At any time we can
We can look at the contents of the object by simply typing its name. At any
time we can
list the objects which we have created:
```{r simple_ex_3, echo = TRUE}
......@@ -162,17 +166,21 @@ ls()
Notice that `ls` is actually an object itself. Typing `ls`
ould result in a display of the contents of
this object, in this case, the commands of the function. The use of parentheses, `ls()`, ensures that
the function is executed and its result --- in this case, a list of the objects in the current environment --- displayed.
this object, in this case, the commands of the function. The use of parentheses,
`ls()`, ensures that
the function is executed and its result --- in this case, a list of the objects
in the current environment --- displayed.
More commonly, a function will operate on an object, for example
```{r simple_ex_4, echo = TRUE}
sqrt(16)
```
calculates the square root of 16. Objects can be removed from the current workspace with the
calculates the square root of 16. Objects can be removed from the current
workspace with the
function `rm()`. There are many standard functions available in R,
and it is also possible to create new ones. Vectors can be created in R in a number of ways.
and it is also possible to create new ones. Vectors can be created in R in
a number of ways.
For example, we can list all of the elements:
```{r simple_ex_5, echo = TRUE}
......@@ -201,9 +209,11 @@ seq(1, 9, by = 2)
seq(8, 20, length = 6)
```
These examples illustrate that many functions in R have optional arguments, in this case, either
the step length or the total length of the sequence (it doesn't make sense to use both). If you leave
out both of these options, R will make its own default choice, in this case assuming a step length
These examples illustrate that many functions in R have optional arguments,
in this case, either the step length or the total length of the sequence
(it doesn't make sense to use both). If you leave
out both of these options, R will make its own default choice, in this case
assuming a step length
of 1. So, for example,
......@@ -220,7 +230,8 @@ or simply `?functionname` where
`functionname` is the name of the function you are interested in.
This will usually help and will often include
examples to make things even clearer.
Another useful function for building vectors is the `rep` command for repeating things:
Another useful function for building vectors is the `rep` command for
repeating things:
the first command will repeat the vector `r 1:3` six times, will the second
one will repeat each element six times.
......@@ -281,7 +292,8 @@ summary(x)
It may be, however, that we subsequently learn that the first
6 data points correspond to measurements made in one experiment,
and the second six on another experiment.
This might suggest summarizing the two sets of data separately, so we would need to extract from
This might suggest summarizing the two sets of data separately,
so we would need to extract from
`x` the two relevant subvectors. This is achieved by subscripting:
```{r subscr, echo = TRUE}
......@@ -293,7 +305,8 @@ summary(x[7:12])
You simply put the indexes of the element you want to access in square brackets.
Note that R starts counting from 1 onwards.
Other subsets can be created in the obvious way. Putting a minus in front, excludes
Other subsets can be created in the obvious way. Putting a minus in
front, excludes
the elements:
```{r subscr_2, echo = TRUE}
......@@ -327,7 +340,8 @@ rank(x)
## Matrices
Matrices are two--dimensional vectors and
can be created in R in a variety of ways. Perhaps the simplest is to create the columns
can be created in R in a variety of ways. Perhaps the simplest is to
create the columns
and then glue them together with the command `cbind`. For example:
......@@ -352,9 +366,11 @@ The functions `cbind` and `rbind` can also be applied to matrices themselves
Notice that the dimension of the matrix is determined
by the size of the vector and the requirement that the number of
rows is 3 in the example above, as specified by the
argument `nrow = 3`. As an alternative we could have specified the number of columns with the
argument `ncol = 2` (obviously, it is unnecessary to give both). Notice that the matrix is "filled up"
column-wise. If instead you wish to fill up row-wise, add the option `byrow=TRUE`.
argument `nrow = 3`. As an alternative we could have specified
the number of columns with the argument `ncol = 2` (obviously, it is
unnecessary to give both). Notice that the matrix is "filled up"
column-wise. If instead you wish to fill up row-wise, add the option
`byrow=TRUE`.
```{r Matrix-ex, echo = TRUE}
......@@ -553,7 +569,8 @@ sample_vector[c(TRUE, FALSE, TRUE)]
This can also be used in conjunction with logical tests
which generate a boolean result. Boolean vectors can be combined with logical operators
which generate a boolean result. Boolean vectors can be combined
with logical operators
to create more complex filters.
```{r access_boolean2, dependson="acces_recap"}
......@@ -638,7 +655,8 @@ Now, a tidy data frame has:
3. on cell per value
Tidy data will result in a "long" data table and is the most appropriate format for
Tidy data will result in a "long" data table and is the most appropriate
format for
data handling. It allows straighforward subestting and group--wise
computations. However, a "wide" data table is better for viewing.
......@@ -667,7 +685,8 @@ sc_qc_stats_wide
We now have separate columns for every mapping type, which makes it easier to
view the data. This data is not "tidy", as the columns we created contain both
ther pecent as well as the type variable --- We have multiple variables per column.
ther pecent as well as the type variable --- We have multiple variables
per column.
The function `gather` can be used to obtain the original
format again:
......@@ -690,7 +709,8 @@ We will also use them extensively later on when discussing the
Since the first thing you do in a data manipulation task is to subset/transform
your data, it includes "verbs" that provide basic functionality. We will
introduce these in the following. The command structure for all `r CRANpkg("dplyr")`
introduce these in the following. The command structure for all
`r CRANpkg("dplyr")`
verbs is :
* first argument is a data frame
......@@ -762,7 +782,7 @@ the column giving the percentage of uniquely mapping reads.
select(sc_qc_stats_wide, concordantUniq)
```
The `select` does is similar to the basic R acccess options for data
The `select` is similar to the basic R acccess options for data
frame: the square bracket for index/name based access.
```{r base_r_select}
......@@ -788,7 +808,8 @@ We implement a column holding a sum of all percentages as sanity check:
```{r mutate-example}
sc_qc_stats_wide <- mutate(sc_qc_stats_wide,
sum_perc = concordantMult + concordantUniq + nomap + others)
sum_perc = concordantMult + concordantUniq + nomap
+ others)
all(round(sc_qc_stats_wide$sum_perc) == 100)
```
......@@ -906,7 +927,7 @@ to as *C_T* value.
```{r data_q_pcr}
q_pcr <- read_csv(file.path(data_dir, "qPCR.csv"))
sample_n(q_pcr, 5)
head(q_pcr, 5)
```
......@@ -946,7 +967,7 @@ summary_across_reps <- group_by(q_pcr,
spread = abs(diff(C_T))) %>%
ungroup()
sample_n(summary_across_reps, 5)
head(summary_across_reps, 5)
```
We now might want to combine this information to the original data table, this
......@@ -959,7 +980,7 @@ join types can be found here: <https://cran.r-project.org/web/packages/dplyr/vig
```{r left_join_summ, dependson="sum_across_reps"}
augmented_table <- left_join(q_pcr, summary_across_reps )
sample_n(augmented_table, 5)
head(augmented_table, 5)
```
`r CRANpkg("dplyr") ` will automatically try to identify matching columns, and
......
This source diff could not be displayed because it is too large. You can view the blob instead.
No preview for this file type
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -69,6 +69,14 @@ citep("10.2202/1544-6115.1175")
citep("10.1214/09-AOAS277")
# Finak et. al., 2015
citep("10.1186/s13059-015-0844-5")
### export citations
write.bibtex(file = "stat_methods_bioinf.bib")
knitcitations::cleanbib()
......
......@@ -4,7 +4,7 @@ options(digits=3, width=80)
golden_ratio <- (1 + sqrt(5)) / 2
opts_chunk$set(echo=TRUE,tidy=FALSE,include=TRUE,
dev=c('png', 'pdf', 'svg'), fig.height = 5, fig.width = 4 * golden_ratio, comment = ' ', dpi = 300,
cache = TRUE)
cache = TRUE, warning = FALSE)
## ---- echo=FALSE, cache=FALSE--------------------------------------------
print(date())
......@@ -38,52 +38,13 @@ library("pheatmap")
library("tidyverse")
library("Rtsne")
library("devtools")
library("ggthemes")
theme_set(theme_gray(base_size = 18))
theme_set(theme_solarized(base_size = 18))
data_dir <- file.path("data/")
## ----import_gene_expression, echo=FALSE, eval=FALSE----------------------
## # Here we import the gene expression data using only the subset of highly
## # variable genes, resave it
## load(file.path(data_dir, "deGenesNone.RData"))
## load(file.path(data_dir, "mTECdxd.RData"))
##
## mtec_counts <- counts(dxd)[deGenesNone, ]
##
## mtec_counts <- as_tibble(rownames_to_column(as.data.frame(mtec_counts),
## var = "ensembl_id"))
##
## mtec_cell_anno <- as_tibble(as.data.frame(colData(dxd))) %>%
## modify_if(.p = is.factor, as.character)
##
## data("biotypes")
## data("geneNames")
##
## mtec_gene_anno <- tibble(biotype) %>%
## add_column(ensembl_id = names(biotype),
## gene_name = geneNames,
## .before = "biotype")
##
## mtec_cell_anno <- colData(dxd)
##
## save(mtec_counts, file = file.path(data_dir, "mtec_counts.RData"),
## compress = "xz")
##
## save(mtec_cell_anno, file = file.path(data_dir, "mtec_cell_anno.RData"),
## compress = "xz")
##
##
## save(mtec_gene_anno, file = file.path(data_dir, "mtec_gene_anno.RData"),
## compress = "xz")
##
## tras <- as_tibble(tras)
##
## save(tras, file = file.path(data_dir, "tras.RData"),
## compress = "xz")
## ----import_data---------------------------------------------------------
load(file.path(data_dir, "mtec_counts.RData"))
load(file.path(data_dir, "mtec_cell_anno.RData"))
......@@ -236,7 +197,8 @@ col_data_crc <- select(as.data.frame(colData(crc_data)),
counts_crc <- counts_crc[, col_data_crc$sample_id]
## ----mrna_counts---------------------------------------------------------
counts_crc_tidy <- rownames_to_column(data.frame(counts_crc), var = "ensembl_id") %>%
counts_crc_tidy <- rownames_to_column(data.frame(counts_crc),
var = "ensembl_id") %>%
as_tibble() %>%
gather(key = "sample_id", value = "count", -ensembl_id) %>%
dplyr::filter(sample_id %in% col_data_crc$sample_id)
......@@ -248,7 +210,8 @@ sample_medians <- group_by(counts_crc_tidy, sample_id) %>%
counts_crc_tidy <- left_join(counts_crc_tidy, sample_medians,
by = c("sample_id" = "sample_id")) %>%
dplyr::arrange(sample_median) %>%
dplyr::mutate(sample_id_by_median = as_factor(sample_id)) %>%
dplyr::mutate(sample_id_by_median =
as_factor(sample_id)) %>%
left_join(col_data_crc)
count_boxplot <- ggplot(counts_crc_tidy,
......@@ -258,6 +221,7 @@ count_boxplot <- ggplot(counts_crc_tidy,
geom_boxplot() +
ylim(c(0, 10)) +
scale_fill_brewer(palette = "Paired") +
ggtitle("Boxplots of raw counts") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
count_boxplot
......@@ -269,7 +233,9 @@ count_boxplot_tissue <- ggplot(counts_crc_tidy,
fill = tissue)) +
geom_boxplot() +
ylim(c(0, 10)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Boxplots of raw counts") +
scale_fill_brewer(palette = "Dark2")
count_boxplot_tissue
......@@ -296,6 +262,19 @@ counts_crc_tidy <- left_join(counts_crc_tidy,
## ----norm_data-----------------------------------------------------------
counts_crc_tidy <- mutate(counts_crc_tidy, count_norm = count / sf )
## ----boxplot_norm, eval=TRUE, fig.show="hide", echo=FALSE, warning = FALSE----
count_boxplot_norm <- ggplot(counts_crc_tidy,
aes(x = sample_id_by_median,
y = log2(count_norm),
fill = sample_id) ) +
geom_boxplot() +
ylim(c(0, 10)) +
scale_fill_tableau(palette = "cyclic") +
ggtitle("Boxplots of normalized counts") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
count_boxplot_norm
## ----MAplot_function-----------------------------------------------------
glog2 <- function(x) ((asinh(x)-log(2))/log(2))
......@@ -357,7 +336,8 @@ ma_pl_points
## ----scran_norm----------------------------------------------------------
mtec_scran_sf <- tibble(scran_sf = computeSumFactors(as.matrix(mtec_counts[, -1])),
mtec_scran_sf <- tibble(scran_sf = computeSumFactors(
as.matrix(mtec_counts[, -1])),
cell_id = colnames(mtec_counts)[-1])
mtec_cell_anno <- left_join(mtec_cell_anno, mtec_scran_sf,
......@@ -371,7 +351,8 @@ mtec_cell_anno <- left_join(mtec_cell_anno, mtec_scran_sf,
## geom_smooth()
## ----createNormCounts----------------------------------------------------
counts_norm_crc_mat <- select(counts_crc_tidy, sample_id, ensembl_id, count_norm) %>%
counts_norm_crc_mat <- select(counts_crc_tidy, sample_id, ensembl_id,
count_norm) %>%
spread(key = sample_id, value = count_norm) %>%
{
tmp <- as.data.frame(.)
......
This diff is collapsed.
This diff is collapsed.
No preview for this file type
......@@ -82,6 +82,19 @@
journal = {Nature Biotechnology},
}
@Article{Finak_2015,
doi = {10.1186/s13059-015-0844-5},
url = {https://doi.org/10.1186/s13059-015-0844-5},
year = {2015},
month = {dec},
publisher = {Springer Nature},
volume = {16},
number = {1},
author = {Greg Finak and Andrew McDavid and Masanao Yajima and Jingyuan Deng and Vivian Gersuk and Alex K. Shalek and Chloe K. Slichter and Hannah W. Miller and M. Juliana McElrath and Martin Prlic and Peter S. Linsley and Raphael Gottardo},
title = {{MAST}: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell {RNA} sequencing data},
journal = {Genome Biology},
}
@Article{Jaffe_2015,
doi = {10.1186/s12859-015-0808-5},
url = {https://doi.org/10.1186/s12859-015-0808-5},
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment