-
Andrzej Oles authoredAndrzej Oles authored
title: "Visual Exploration of High-Throughput-Microscopy Data"
author: "Bernd Klaus, Andrzej Oleś, Mike Smith"
date: "`r doc_date()`"
bibliography: HTM_2016.bib
output:
BiocStyle::html_document:
toc: true
toc_float: true
highlight: tango
code_folding: hide
BiocStyle::pdf_document2:
toc: true
highlight: tango
library(knitr)
options(digits=3, width=80)
opts_chunk$set(echo=TRUE,tidy=FALSE,include=TRUE,
dev='png', fig.width = 6, fig.height = 3.5, comment = ' ', dpi = 300,
cache = TRUE)
set.seed(2016)
Required packages and other preparations
## this code chunk is disabled from running and displaying
## run in case you need to install/update the 'cellh5' package
devtools::install_github("CellH5/cellh5-R")
library("rmarkdown")
library("tidyverse")
library("openxlsx")
library("cellh5")
library("psych")
library("stringr")
library("factoextra")
About the tutorial
In this tutorial, we will import a single plate from a high content screen performed on 96 well plates. The input data that we are going to use are class labels for each single cell. These classification results have been obtained using a machine learning algorithm based on the original image features. The data produced is similar to the one in @Neumann_2010: Each cell is classified into a mitotic phenotype class.
Annotation import
We first import the annotation of the plate. This consists of table that informs us about the content of each well on the plate. A well can be transfected with an siRNA targeting a certain gene, it can be empty, contain scrambled siRNAs or negative controls.
plate_map <- read.xlsx(xlsxFile = file.path("plate_mapping.xlsx"))
head(plate_map)
Importing the raw data
We will now import the raw data. This data is stored in a variant of the HDF5 format called "CellH5", which defines a more restricted sub-format designed specifically to store data from high content screens. More information can be found in the paper by @Sommer_2013.
In the code below, we use the cellh5 R--package
to import the data. The file _all_positions.ch5
contains links to the other ch5
files that contain the full data of the plate.
We are only interested in the predictions produced
by the machine learning algorithm, so we only extract this part of the file.
path <- file.path(switch(.Platform$OS.type, unix = "/g", windows = "Z:"),
"embo2016/htm2016/P12_Visual_Exploration/hdf5/_all_positions.ch5")
c5f <- CellH5(path)
c5_pos <- C5Positions(c5f, C5Plates(c5f))
predictions <- C5Predictions(c5f, c5_pos[[1]], mask = "primary__primary", as = "name")
Tabulate the raw data
We now tabulate the raw data: we compute how many cells are assigned to each class for each well. The result is a data matrix, which contains the wells of the screen plate in the columns and the counts for the respective classes in the rows.
This is a typical example of a "wide" data table, where the variables contained in the data set spread across multiple columns (here we only show the first six ones).
raw_data <- sapply(c5_pos,
function(pos){
predictions <- C5Predictions(c5f, pos, mask = "primary__primary", as = "name")
table(predictions)
})
raw_data[, 1:6]
Reshaping the screen data and joining the plate annotation
We now reshape the input data, which is in a long format into a wide format. For this, we first turn the row names into an explicit column and then "gather" all the columns representing wells. This will turn all the columns that contain the per--well data into a single "measurement" column that is paired with a "key"" column containing the well identifiers.
The result is a "long" data table, which contains one observation per row: in our case the number of times a cell was assigned to a specific class in every single well. Class in combination with well serves as our "key" here, and the class--count is the associated value.
We now want to join the annotation to this data table in the long format. Before we can do this however, we need to solve a subtle problem: Our well identifiers in the imported data are different from the identifiers in the annotation table so we cannot easily join the two tables together.
We first need to strip the lead "W" from the well identifiers and replace the "P1" suffix by a "01" suffix. We do this by using a regular expression. Regular expressions are a powerful tool for the handling of strings and one can find a nice tutorial about them here.
We can now join the annotation to our long table and use the well as the joining key.
tidy_raw_data <- rownames_to_column(as.data.frame(raw_data), var = "class") %>%
gather(key = "well", value = "count",-class)
sample_n(tidy_raw_data, 6)
tidy_raw_data$well <- str_replace(tidy_raw_data$well, "^W([A-H][0-9]{2})_P1", "\\1_01")
#join annotation
input_data <- left_join(tidy_raw_data, plate_map, by = c("well" = "Position"))
sample_n(input_data, 6)
Using ggplot to create a PCA plot for the data
The data we have lives in a ten dimensional space, as every well contains cells classified into one of ten different classes. In order to produce a succinct overview of the data, one tries to reduce the dimensionality of the data. A popular way to do this is to compute new, artificial, variables that are a weighted sum of the original variables. The weights are obtained in such a way that the variables are independent of each other and retain as much of the original variability (a proxy of information content) as possible. These new variables are called "principal components" (PCs) of the data.
Before we can compute the PCs, we have to make sure that the variables that we have obtained for every single well are normalized. As our data consists of the number of cells in each phenotypic category, a straightforward normalization consists of transforming the counts into percentages by dividing the data for each well by its total number of cells.
Grouping, summarizing and data transformation
In the code chunk below, we use the group_by()
function
to plot the dataset into groups according to the well ID.
We then apply the function sum()
to the counts of each well and
use summarize()
to obtain a table of counts per well. This is an example
of a split-apply-combine strategy.
We can now join the table containing the sums to the original data and compute compute percentage using the sums.
As PCA works best on data that is on normal distribution (z-score) scale, we perform a logit transformation to turn the percentages into z-scores. This is similar in spirit to a log transformation on intensity values.
no_cells_per_well <- input_data %>%
group_by(well) %>%
summarize(no_cells = sum(count))
head(no_cells_per_well)
data_with_sums <- left_join(input_data, no_cells_per_well)
data_for_PCA <- mutate(data_with_sums, perc = count / no_cells,
z_score = logit(perc))
head(data_for_PCA)
Here, we use the chaining/piping operator %>%
to "pipe" the result of a
computation into the next. This leads to more digestible code compared to e.g. loops.
creating the PCA plot using ggplot2
We are now ready to create the PCA plot. For this, we first need to turn the input data into a wide data frame again by spreading the z-scores across columns.
We can then use the function prcomp()
to compute the actual principal components.
We also create a vector genes, giving us the gene each of our siRNAs is targeting.
We then create a ggplot object by mapping the first principal component to the x-- and the second one to the y--axis. We use the gene names as plotting symbols and color the names according to to the gene name (As we have multiple empty wells as well as multiple siRNAs targeting the same gene).
Furthermore, we specify that the aspect ratio of x and y axis should be equal to the ratio of the variance explained by PC1 to the variance explained by PC2 so that the axes represent the same units. This facilitates a correct interpretation of the PCA plot: PC1 has more variance than PC2, so the x--axis should be longer than the y--axis to reflect the differences in scale.
data_for_PCA <- data_for_PCA %>%
dplyr::select(class, well, z_score) %>%
spread(key = class, value = z_score)
PCA <- prcomp(data_for_PCA[, -1], center = TRUE, scale. = TRUE)
genes <- input_data %>%
group_by(well) %>%
dplyr::summarize(gene = unique(Gene.Symbol))
genes <- ifelse(is.na(genes$gene), "empty", genes$gene)
dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
PC3 = PCA$x[,3], PC4 = PCA$x[,4],
genes)
pl <- (ggplot(dataGG, aes(x = PC1, y = PC2, color = genes))
+ geom_text(aes(label = genes), size = I(2))
+ coord_fixed(ratio = (PCA$sdev^2)[2] / (PCA$sdev^2)[1])
+ ggtitle("Principal components plot")
)
pl
We can see that the control wells cluster together as well as . Note that it is easy to turn
this plot into an interactive version using ggplotly
from the r CRANpkg("plotly")
.
Variable importance for the principal components
The first PC nicely separates wells containing various controls from the ones treated with siRNAs. As every component is simply a weighted sum of the original variables, we can inspect these weights (called "loadings") to see which classes "drive" the components and try to interpret what we find.
loadings <- PCA$rotation[, 1:2]
loadings_gg <- loadings %>%
as.data.frame() %>%
rownames_to_column(var = "class") %>%
dplyr::select(class, PC1, PC2) %>%
gather( key = "comp", value = "loading", PC1:PC2)
ggplot(loadings_gg, aes(x = class, y = loading, fill = class)) +
facet_wrap( ~ comp) +
geom_bar(stat = "identity", position = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(type = "qual", palette = "Set3")
We can see that e.g. the "inter" and the "map /prometa" classes as well as the "apo" class are important for PC1. The map and prometa classes combined define cells that are in mitotic delay / arrest, while the interphase class defines a control category.
So a possible explanation for PC1 would be that it separates wells with cells in mitotic arrest/delay (or apoptotic cells) from control wells with many cells in the interphase phase. (c.f. Figure 1 of @Neumann_2010).
The second principal component seems to separate wells that contain mainly cells in ana--/metaphase from wells that predominantly contains cells with strange shape phenotypes.
Correlation circles and PCs for hit calling
The loadings are closely related (up to a scaling factor) to the correlation of the original variables with the computed principal components. We can investigate this relationship by plotting the correlations of the original variables with the components in a circle. Arrows that are orthogonal to one of the axis.
fviz_pca_var(PCA, col.circle="black", title="Correlation Circle for the PCA") + coord_equal()
We see that "apo", "artefact", "map" and "meta" are highly positively correlated with PC1, while "inter" is strongly negatively correlated with PC1, confirming that PC1 identifies wells were there is a large proportion of cells which are stuck in mitosis or undergoing apoptosis.
PC2, is positively correlated the strange phenotypes. Thus, we could potentially use the first and second PC to call hits. For example, PLK1, which is very far from the controls in PC1 coordinates is known to be important during the M--phase of the cell cycle. Thus, if this gene is inhibited, the mitosis does not work properly any more and the cells are stuck in the cell cycle or undergo apoptosis.
Plate heatmap of apoptosis proportions
Heatmaps are a powerful of visualizing large, matrix-like datasets and giving a quick
overview over the patterns that might be in there. There are a number of heatmap drawing
functions in R; one that is convenient and produces good-looking output is the function
pheatmap
from the eponymous package.
However, here we will use ggplot to create the heatmap. We first join all the well IDs to the data, so that we can plot missing values in a uniform fashion.
Then we extract the row (letters) and column (numbers) from the well ID and finally map the rows to the y--axis, the columns to the x--axis and the color fill to the percentage values for apoptosis
The heatmap plot shows that well D08 contains a high percentage of apoptotic cells compared to the other wells.
dat_rows <- toupper(letters[1:8])
dat_cols <- c(paste0("0",seq(1:9)),seq(10,12))
wells <- data.frame( well = paste0(outer(dat_rows, dat_cols, paste0), "_01"),
stringsAsFactors = FALSE)
data_for_heatmap <- arrange(full_join(data_for_PCA, wells), well) %>%
select(well, apo) %>%
extract(well, into = c("row", "column"),
regex = "([A-Z])([0-9]{2})", remove = FALSE) %>%
mutate(perc_apoptosis = logistic(apo)) %>%
mutate(row = factor(row,
levels = sort(unique(row), decreasing = TRUE)))
theme_set(theme_bw())
heatmap <- (ggplot(data_for_heatmap, aes(column, row))
+ geom_tile(aes(fill = perc_apoptosis))
+ scale_fill_distiller(type = "seq", palette = "RdYlBu"))
heatmap
Background information
The following sections contain additional background information on the techniques used in this tutorial.
Long and wide data tables, tidy data, split--apply--combine
In a nutshell, a dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways: Every value belongs to a variable and an observation.
A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
Now, a tidy data frame now organizes the data in such a way that each observation corresponds to an single line in the data set. This representation is often referred to as a "long" data tables.
In general, a long data table is the most appropriate format for downstream analysis, although a wide representation is better for viewing the data. For a thorough discussion of this topic see the paper by Hadley @greycite53870.
In any case, we often need to move between the different shapes. The
r CRANpkg("tidyr")
has two main functions that allows us to go back and forth:
-
gather()
takes multiple columns, and gathers them into key--value pairs: it makes "wide" data longer. -
spread()
takes two columns (key & value) and spreads into multiple columns, it makes "long" data wider.
Having tidy data frames, we can employ a split--apply--combine strategy for data analysis. What this means is that we first group our individual observations by one or more factors (split operation), then apply computations to each group and then combine the results into new data frame that contains one line per group.
The chaining operator
For this, we use the chaining (or piping) operator %>%
.
$x % \textgreater % f(y) $ is simply f(x, y), so one can use
it to rewrite multiple operations in such a way that they can be read
from you can read from left--to--right,
top--to--bottom. A simple example will make that clear: We create
two vectors and calculate Euclidean distance between them. Instead of
the usual way: