Tutorial_HTM_2016.Rmd
title: "Visual Exploration of High-Throughput-Microscopy Data"
author: "Bernd Klaus, Andrzej Oleś, Mike Smith"
date: "`r doc_date()`"
bibliography: HTM_2016.bib
output:
BiocStyle::html_document:
toc: true
toc_float: true
highlight: tango
code_folding: hide
BiocStyle::pdf_document2:
toc: true
highlight: tango
library(knitr)
options(digits=3, width=80)
opts_chunk$set(echo=TRUE,tidy=FALSE,include=TRUE,
dev='png', fig.width = 6, fig.height = 3.5, comment = ' ', dpi = 300,
cache = TRUE)
set.seed(2016)
Required packages and other preparations
## this code chunk is disabled from running and displaying
## run in case you need to install/update the 'cellh5' package
devtools::install_github("CellH5/cellh5-R")
library("rmarkdown")
library("tidyverse")
library("openxlsx")
library("cellh5")
library("psych")
library("stringr")
library("factoextra")
About the tutorial
In this tutorial, we will import a single plate from a high content screen performed on 96 well plates. The input data that we are going to use are class labels for each single cell. These classification results have been obtained using a machine learning algorithm based on the original image features. The data produced is similar to the one in @Neumann_2010: Each cell is classified into a mitotic phenotype class.
Annotation import
We first import the annotation of the plate. This consists of table that informs us about the content of each well on the plate. A well can be transfected with an siRNA targeting a certain gene, it can be empty, contain scrambled siRNAs or negative controls.
plate_map <- read.xlsx(xlsxFile = file.path("plate_mapping.xlsx"))
head(plate_map)
Importing the raw data
We will now import the raw data. This data is stored in a variant of the HDF5 format called "CellH5", which defines a more restricted sub-format designed specifically to store data from high content screens. More information can be found in the paper by @Sommer_2013.
In the code below, we use the cellh5 R--package
to import the data. The file _all_positions.ch5
contains links to the other ch5
files that contain the full data of the plate.
We are only interested in the predictions produced
by the machine learning algorithm, so we only extract this part of the file.
path <- file.path(switch(.Platform$OS.type, unix = "/g", windows = "Z:"),
"embo2016/htm2016/P12_Visual_Exploration/hdf5/_all_positions.ch5")
c5f <- CellH5(path)
c5_pos <- C5Positions(c5f, C5Plates(c5f))
predictions <- C5Predictions(c5f, c5_pos[[1]], mask = "primary__primary", as = "name")
Tabulate the raw data
We now tabulate the raw data: we compute how many cells are assigned to each class for each well. The result is a data matrix, which contains the wells of the screen plate in the columns and the counts for the respective classes in the rows.
This is a typical example of a "wide" data table, where the variables contained in the data set spread across multiple columns (here we only show the first six ones).
raw_data <- sapply(c5_pos,
function(pos){
predictions <- C5Predictions(c5f, pos, mask = "primary__primary", as = "name")
table(predictions)
})
raw_data[, 1:6]
Reshaping the screen data and joining the plate annotation
We now reshape the input data, which is in a long format into a wide format. For this, we first turn the row names into an explicit column and then "gather" all the columns representing wells. This will turn all the columns that contain the per--well data into a single "measurement" column that is paired with a "key"" column containing the well identifiers.
The result is a "long" data table, which contains one observation per row: in our case the number of times a cell was assigned to a specific class in every single well. Class in combination with well serves as our "key" here, and the class--count is the associated value.
We now want to join the annotation to this data table in the long format. Before we can do this however, we need to solve a subtle problem: Our well identifiers in the imported data are different from the identifiers in the annotation table so we cannot easily join the two tables together.
We first need to strip the lead "W" from the well identifiers and replace the "P1" suffix by a "01" suffix. We do this by using a regular expression. Regular expressions are a powerful tool for the handling of strings and one can find a nice tutorial about them here.
We can now join the annotation to our long table and use the well as the joining key.
tidy_raw_data <- rownames_to_column(as.data.frame(raw_data), var = "class") %>%
gather(key = "well", value = "count",-class)
sample_n(tidy_raw_data, 6)
tidy_raw_data$well <- str_replace(tidy_raw_data$well, "^W([A-H][0-9]{2})_P1", "\\1_01")
#join annotation
input_data <- left_join(tidy_raw_data, plate_map, by = c("well" = "Position"))
sample_n(input_data, 6)
Using ggplot to create a PCA plot for the data
The data we have lives in a ten dimensional space, as every well contains cells classified into one of ten different classes. In order to produce a succinct overview of the data, one tries to reduce the dimensionality of the data. A popular way to do this is to compute new, artificial, variables that are a weighted sum of the original variables. The weights are obtained in such a way that the variables are independent of each other and retain as much of the original variability (a proxy of information content) as possible. These new variables are called "principal components" (PCs) of the data.
Before we can compute the PCs, we have to make sure that the variables that we have obtained for every single well are normalized. As our data consists of the number of cells in each phenotypic category, a straightforward normalization consists of transforming the counts into percentages by dividing the data for each well by its total number of cells.
Grouping, summarizing and data transformation
In the code chunk below, we use the group_by()
function
to plot the dataset into groups according to the well ID.
We then apply the function sum()
to the counts of each well and
use summarize()
to obtain a table of counts per well. This is an example
of a split-apply-combine strategy.