Newer
Older
---
title: "Visual Exploration of High--Throughput--Microscopy Data"
author: "Bernd Klaus, Andrzej Oles, Mike Smith"
date: "`r doc_date()`"
output:
BiocStyle::html_document:
toc: true
toc_float: true
highlight: tango
code_folding: hide
BiocStyle::pdf_document2:
toc: true
highlight: tango
---
<!--
To compile this document
graphics.off();rm(list=ls());rmarkdown::render('Tutorial_HTM_2016.Rmd');purl('Tutorial_HTM_2016.Rmd')
rmarkdown::render('Tutorial_HTM_2016.Rmd', BiocStyle::pdf_document())
-->
```{r options, include=FALSE}
library(knitr)
options(digits=3, width=80)
opts_chunk$set(echo=TRUE,tidy=FALSE,include=TRUE,
dev='png', fig.width = 6, fig.height = 3.5, comment = ' ', dpi = 300,
cache = TRUE)
```
# Required packages and other preparations
devtools::install_github("aoles/cellh5-R")
```
```{r required packages and data, echo = TRUE}
library(rmarkdown)
library(tidyverse)
library(openxlsx)
library(cellh5)
library(psych)
library(stringr)
library(splots)
```
# Annotation import
```{r}
data_path <- "~/p12_data"
plate_map <- read.xlsx(xlsxFile = file.path(data_path, "plate_mapping.xlsx"))
head(plate_map)
```
# Importing the raw data
* importing using `r Biocpkg("rhdf5")`
* possibly discuss the hdf5 format
```{r readingCellH5, eval=FALSE}
path <- file.path(data_path, "_all_positions.ch5")
c5f <- CellH5(path)
c5_pos <- C5Positions(c5f, C5Plates(c5f))
predictions <- C5Predictions(c5f, c5_pos[[1]], mask = "primary__primary3", as = "name")
c5_pos[["WB08_P1"]] <- NULL
```
# Extract raw data
```{r, eval=FALSE}
raw_data <- sapply(c5_pos, function(pos){
predictions <- C5Predictions(c5f, pos, mask = "primary__primary3", as = "name")
table(predictions)}
)
save(raw_data, file = "raw_data.RData")
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
* discuss score computation from phenotype classification results
# The concept of tidy data
A lot of analysis time is spent on the process of cleaning and preparing
the data. Data preparation is not just a first step, but must be
repeated many over the course of analysis as new problems come to light or new data is
collected. An often neglected, but important aspect of data cleaning is
data tidying: structuring datasets to facilitate analysis.
This "data tidying" includes the ability to move
data between different different shapes.
In a nutshell, a dataset is a collection of values, usually either numbers
(if quantitative) or
strings (if qualitative). Values are organized in two ways. Every value belongs to a
variable and an observation. A variable contains all values that measure the same
underlying attribute (like height, temperature, duration) across units.
An observation contains all values measured on the same unit (like a person, or a day,
or a race) across attributes.
A tidy data frame now organizes the data in such a way that
each observation corresponds to an single line in the data set.
This is in general the most appropriate format for downstream analysis, although
it might not be the most appropriate form for viewing the data.
For a thorough discussion of this topic see the paper by
[Hadley Wickham - tidy data](\href{http://www.jstatsoft.org/v59/i10/paper).
# Reshaping the screen data
* [regex tutorial](http://www.zytrax.com/tech/web/regex.htm)
```{r}
load("raw_data.RData")
tidy_raw_data <- rownames_to_column(as.data.frame(raw_data), var = "class") %>%
gather(key = "well", value = "count", WA01_P1:WC07_P1)
tidy_raw_data$well <- str_replace(tidy_raw_data$well, "^W([A-H][0-9]{2})_P1", "\\1_01")
input_data <- left_join(tidy_raw_data, plate_map, by = c("well" = "Position"))
## Plotting in R: ggplot2
## Creating the PCA plot
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
```{r}
no_cells_per_well <- input_data %>%
group_by(well) %>%
summarize(no_cells = sum(count))
data_with_sums <- left_join(input_data, no_cells_per_well)
# size_factors <- no_cells_per_well$no_cells / geometric.mean(no_cells_per_well$no_cells)
data_for_PCA <- mutate(data_with_sums, perc = count / no_cells,
z_score = logit(perc))
data_for_PCA <- data_for_PCA %>%
select(class, well, z_score) %>%
spread(key = class, value = z_score)
PCA <- prcomp(data_for_PCA[, -1], center = TRUE, scale. = TRUE)
genes <- input_data %>%
group_by(well) %>%
summarize(gene = unique(Gene.Symbol))
genes <- ifelse(is.na(genes$gene), "empty", genes$gene)
dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
PC3 = PCA$x[,3], PC4 = PCA$x[,4],
genes)
(qplot(PC1, PC2, data = dataGG, color = genes, geom = "text",
label = genes, asp = 1,
main = "PC1 vs PC2, top variable genes", size = I(6))
)
```
# Heatmap of apoptosis z--scores
```{r heatmap_apoptosis}
dat_rows = toupper(letters[1:8])
dat_cols = c(paste0("0",seq(1:9)),seq(10,12))
wells <- data.frame( well = paste0(outer(dat_rows, dat_cols, paste0), "_01"))
full_data <- arrange(full_join(data_for_PCA, wells), well)
plotScreen(list(logistic (full_data$Apoptosis)), ncol = 1, nx = 12, ny = 8,
main = "Apoptosis percentages",
do.names = FALSE, legend.label = "percentage of apoptotic cells ", zrange = c(0,.4) )
```