Commit 1ba0a426 authored by Bernd Klaus's avatar Bernd Klaus

first final version with slides, corrected severe fac ana bug (variance based...

first final version with slides, corrected severe fac ana bug (variance based selection on samples not genes)
parent bc325de9
......@@ -17,3 +17,9 @@ data/nomarkerCellsClustering.RData
data/zinb.RData
get_clustering_aljeandro.R
norm.pdf
import_mtec_gene_expression_data.R
Slides_stat_methods_bioinf/slides_data_handling_bioinf_cache
Slides_stat_methods_bioinf/slides_factor_ana_testing_ml_cache
Slides_stat_methods_bioinf/slides_graphics_bioinf_cache
Slides_stat_methods_bioinf/SRP022054
---
title: "Data handling for bioinformatics"
author: "Bernd Klaus"
date: "December 7, 2017"
output:
slidy_presentation:
df_print: paged
font_adjustment: +1
fig_width: 6
fig_height: 3.71
---
```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
options(digits = 3, width = 80)
opts_chunk$set(echo = TRUE, tidy = FALSE, include = TRUE,
dev = 'png', fig.width = 6, fig.asp = 0.618,
comment = ' ', dpi = 300,
cache = TRUE)
data_dir <- file.path("../data")
library("readxl")
library("BiocStyle")
library("knitr")
library("tidyverse")
library("RColorBrewer")
library("stringr")
library("pheatmap")
library("matrixStats")
library("purrr")
library("fdrtool")
library("readr")
library("gtools")
library("factoextra")
library("magrittr")
library("entropy")
library("forcats")
library("plotly")
library("corrplot")
library("car")
library("forcats")
library("openxlsx")
library("readxl")
library("limma")
library("ggthemes")
theme_set(theme_solarized(base_size = 18))
```
## Basics of arithmetics and data handling in R
## Simple arithmetics and vectors
```{r simple_arithmetics}
x <- 6
y <- 4
z <- x + y
z
```
* The most basic elementary data structure in R are vectors
* can be created by concatenating individual elements via the **c** function
```{r vectors}
x <- c(7.5, 8.2, 3.1, 5.6, 8.2)
```
--------
* Subsets are created by the bracket operator (1-based counting!)
* The indices of the elements you want to access are given inside the square brackets
* __head__ gives the first 6 elements of a data structure
```{r vector access}
head(x)
x[c(1, 2, 4)]
x[-(1:3)]
```
## Matrices in R
* Matrices are two--dimensional vectors
* the simplest is to create the columns and then glue them together with the command __cbind__
```{r cbind-ex, echo = TRUE}
x <- c(5, 7, 9)
y <- c(6, 3, 4)
z <- cbind(x, y)
z
dim(z)
```
-----
* Access is now works by specifying indices for rows and columns
```{r cbind-ex_acces, echo = TRUE}
z[c(1,2), ]
z[, -1]
z[2, ]
```
## Data frames (tibbles) and lists
* A data frame is a matrix where the __columns can have different data types__
* Rows represent the samples and columns the variables
* the tidyverse equivalent of a **data frame** is called a **tibble**
---
* import a small csv table with __read_csv__ from the __readr__ package
```{r load-Patients, echo = TRUE}
pat <- read_csv("http://www-huber.embl.de/users/klaus/BasicR/Patients.csv")
pat
```
## Accessing data in data frames
* **filter** (for rows) and **select** (for columns / variables) from the tidyverse package **dplyr**
```{r subset_data}
pat_tiny <- filter(pat, Height < 1.7)
select(pat_tiny, PatientId, Height, Gender)
```
There are a couple of operators useful for comparisons:
* `Variable == value`: equal
* `Variable != value`: unequal
* `Variable < value`: less
* `Variable > value`: greater
* `&: and`
* `|` or
* `!`: negation
* `%in%`: is element?
----
* You can also access data frames via indices or row / column names
```{r df_access_old}
pat[2, c("PatientId", "Height")]
pat[2, c(1, 2)]
```
## Vectors with arbitrary contents: Lists
```{r list_example, echo = TRUE}
L <- list(one = 1, two = c(1, 2), five = seq(1, 4, length = 5),
list(string = "Hello World"))
L
```
------------------
* access works via the double bracket operator
* recursively accesses the list contents
<http://r4ds.had.co.nz/vectors.html#visualising-lists>
```{r list_access, echo = TRUE}
names(L)
L$five + 10
L[[3]] + 10
L[["two"]]
```
---
* data frames = special lists
=> can be accessed in the same way
```{r list-example_df, echo = TRUE}
pat$Height
pat[[2]]
pat[["Gender"]]
```
## Summary: data access in R
We prape a simple vector to illustrate the access options again:
```{r acces_recap}
sample_vector <- c("Alice" = 5.4, "Bob" = 3.7, "Claire" = 8.8)
sample_vector
```
## Access by index
* provide a vector of indices to indicate the elements to retrieve
* a minus sign excludes the respective positions
```{r access_index, dependson="accesRecap"}
sample_vector[1:2]
sample_vector[-(1:2)]
```
## Access by boolean
* Using a boolean (TRUE / FALSE) vector that has the same length as
your data will pull out the elements for which the boolean vector is TRUE.
```{r access_boolean, dependson="acces_recap"}
sample_vector[c(TRUE, FALSE, TRUE)]
```
* can also be used in conjunction with logical tests which generate a boolean result
* Boolean vectors can be combined with logical operators to create more complex filters
```{r access_boolean2, dependson="acces_recap"}
sample_vector[sample_vector < 6]
```
## Access by name
if there are names such as
column names present (note that rowname are not preserved in the tidyverse),
you can access by name as well:
```{r access_name}
sample_vector[c("Alice", "Claire")]
```
## (single cell) RNA--Seq data
![Figure by Thomas Shafee](Summary_of_RNA-Seq.png)
* The capture effeciency of scRNA--Seq is quite
only around 10%
* data obtained often quite sparse
* Many genes are simply not captured.
## The Brennecke et. al. sc--RNA--Seq mTEC data
* self--antigen--tolerance is largely established in the thymus,
where T cells develop.
* T--cells encounter most tissue--specific constituents already there
* ... imposing a broad scope of tolerance before T cells circulate through the body
* This "training" of the T--cells is neccessary to prevent autoimunity
* __It happens by the "promiscuous" expression of tissue-specific antigens
(TRAs) in medullary thymic epithelial cells (mTECs)__
**How is this process is regulated** in single mTECs?
<img src="Illu_thymus.jpg" style="width: 25%; height: 25%" >
## mTEC quality control data
For each sample (single cell) the reads were classified as either:
* mapping uniquely to the reference genome,
* mapping multiple times to the reference genome
* reads which could not be assigned to any position of the reference genome
* others (e.g read pairs were only one read pair mate could be mapped).
```{r load_inspect_qc_data}
load(file.path(data_dir, "sc_qc_stats.RData"))
sc_qc_stats
```
## The concept of tidy data
* a dataset is a collection of values; Values are organized in two ways:
Every value belongs to a variable and an observation.
* A variable contains all values that measure an underlying attribute
across units.
* An observation contains all values measured on the same unit across attributes.
Now, a tidy data frame has:
1. one row per observation
2. one column per variable
3. on cell per value
The QC data is tidy, but **long**, __for viewing__ it will be nicer to have
a __wide__ table with different types in one row.
## The tidyr package
The package `r CRANpkg("tidyr")` has two main functions that allows
us to go back and forth:
* `gather()` takes multiple columns, and gathers them
into key--value pairs: it makes "wide" data longer.
* `spread()` takes two columns (key & value) and
spreads into multiple columns, it makes "long" data wider.
## Widening the QC data
Widening our data frame is simple: Our key column is `type` and our value column is `percent`
```{r spread_qc_stats}
sc_qc_stats_wide <- spread(sc_qc_stats, key = type, value = percent)
sc_qc_stats_wide
```
* We now have separate columns for every mapping type
## Going back again
The function `gather` can be used to obtain the original
format again:
```{r gather_qc_stats}
sc_qc_stats_org <- gather(sc_qc_stats_wide, key = "type",
value = "percent", concordantMult:others)
sc_qc_stats_org
```
## Basic data manipulation with dplyr verbs
* The package `r CRANpkg("dplyr")` provides a "grammar" of data manipulation.
* It includes "verbs" that provide basic functionality.
* structure for all `r CRANpkg("dplyr")` verbs is :
- first argument is a data frame
- return value is a data frame
- nothing is modified in place
Note that `r CRANpkg("dplyr")` generally does not preserve row names.
## Selecting rows (observations) by their values with `filter()`
The function `filter()`
```{r mapped_reads}
filter(sc_qc_stats_wide, concordantUniq > 80 )
```
## Arrange rows (samples) with `arrange()`
`arrange()` reorders rows according to columns
```{r arange_asc}
arrange(sc_qc_stats_wide, concordantUniq, nomap)
```
## Select columns (variables) by their names with `select()`
```{r select}
select(sc_qc_stats_wide, concordantUniq)
```
The `select` does is similar to the basic R acccess options for data
frame: the square bracket for index/name based access.
## Create new variables using existing ones with `mutate()`
```{r mutate-example}
sc_qc_stats_wide <- mutate(sc_qc_stats_wide,
sum_perc = concordantMult + concordantUniq + nomap + others)
all(round(sc_qc_stats_wide$sum_perc) == 100)
```
* Indeed, the different categories add up to one
## Grouped summaries: `group_by()` and `summarize()`
* `group_by()` creates groups in a data frame
* to which you can then apply any set of functions
*and finally obtain a new data frame via `summarize()`.
Let's look at the mean alignment rates per batch:
```{r mean_per_batch}
sc_qc_stats_per_batch <- group_by(sc_qc_stats, type, batch)
sc_qc_mean_align_per_batch <- filter(summarize(sc_qc_stats_per_batch,
mean_align_rate = mean(percent)),
type == "concordantUniq")
sc_qc_mean_align_per_batch
```
## The piping / chaining operator
* ` x %>% f(y) == f(x, y)` :
```{r chainingSimpleExample_1}
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1 - x2)^2))
```
```{r chainingSimpleExample_2}
(x1 - x2)^2 %>%
sum() %>%
sqrt()
```
We can simplify our code like this:
```{r mean_per_batch_pipe}
sc_qc_mean_align_per_batch <- group_by(sc_qc_stats, batch, type ) %>%
summarize(mean_align_rate = mean(percent)) %>%
filter(type == "concordantUniq")
```
It is now clear that you group first,
then you summarize and then you filter.
## Filtering per group
Filtering also works on grouped data frames, we can identify the cell with
the worst alignment rate in each batch like this:
```{r worst_cell}
worst_cells <- group_by(sc_qc_stats, batch, type ) %>%
filter(percent == min(percent),
type == "concordantUniq") %>%
arrange(percent)
worst_cells
```
## Combining tables with related information
* qPCR data for various genes in two replicates; *C_T* values
```{r data_q_pcr}
q_pcr <- read_csv(file.path(data_dir, "qPCR.csv"))
head(q_pcr, 5)
```
## Adding columns group--wise
* we want to add replicate information to the table
* the dot . refers to the current data
```{r add_rep_info}
q_pcr <- q_pcr %>%
group_by(sample_name) %>%
{.$replicate = c("r_1", "r_2")
.} %>% ungroup
head(q_pcr)
```
## Computing summaries
* the mean and the variability of the *C_T* values
```{r sum_across_reps, dependson="add_rep_info"}
summary_across_reps <- group_by(q_pcr,
sample_name,
target_Name) %>%
dplyr::summarize(mean = mean(C_T, na.rm = TRUE),
spread = abs(diff(C_T))) %>%
ungroup()
sample_n(summary_across_reps, 5)
```
## Performing a join operation
* left join will keep all rows in the first table
```{r left_join_summ, dependson="sum_across_reps"}
augmented_table <- left_join(q_pcr, summary_across_reps )
head(augmented_table, 5)
```
This source diff could not be displayed because it is too large. You can view the blob instead.
---
title: "Factor analysis, testing and machine learning for bioinformatics"
author: "Bernd Klaus"
date: "December 11, 2017"
output:
slidy_presentation:
df_print: paged
font_adjustment: +1
fig_width: 6
fig_height: 3.71
---
```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
options(digits = 2, width = 80, scipen = 40)
golden_ratio <- (1 + sqrt(5)) / 2
opts_chunk$set(echo = TRUE, tidy = FALSE, include = TRUE,
dev = 'png', fig.height = 5, fig.width = 4 * golden_ratio,
comment = ' ', dpi = 300,
cache = TRUE, fig.show = "hide", include = FALSE, echo = FALSE, message = FALSE)
data_dir <- file.path("../data")
library("readxl")
library("BiocStyle")
library("knitr")
library("RColorBrewer")
library("stringr")
library("pheatmap")
library("matrixStats")
library("purrr")
library("readr")
library("magrittr")
library("entropy")
library("forcats")
library("DESeq2")
library("broom")
library("limma")
library("ggthemes")
library("corpcor")
library("sva")
library("zinbwave")
library("clusterExperiment")
library("clue")
library("sda")
library("crossval")
library("randomForest")
library("tidyverse")
theme_set(theme_solarized(base_size = 18))
data_dir <- file.path("../data")
load(file.path(data_dir, "mtec_counts.RData"))