Skip to content
Snippets Groups Projects
Select Git revision
  • 397e4a6933d1d7c17f21927227525ffa7f5e566b
  • master default
  • Malvika-patches
  • use-just-the-docs
4 results

release-drafter.yml

Blame
  • Tutorial_HTM_2016.Rmd 5.42 KiB
    title: "Visual Exploration of High--Throughput--Microscopy Data"
    author: "Bernd Klaus, Andrzej Oles, Mike Smith"
    date: "`r doc_date()`"
    output: 
        BiocStyle::html_document:
            toc: true
            toc_float: true
            highlight: tango
            code_folding: hide
        BiocStyle::pdf_document2:
            toc: true
            highlight: tango
    library(knitr)
    options(digits=3, width=80)
    opts_chunk$set(echo=TRUE,tidy=FALSE,include=TRUE,
                   dev='png', fig.width = 6, fig.height = 3.5, comment = '  ', dpi = 300,
    cache = TRUE)

    Required packages and other preparations

    devtools::install_github("aoles/cellh5-R")
    library(rmarkdown)
    library(tidyverse)
    library(openxlsx)
    library(cellh5)
    library(psych)
    library(stringr)
    library(splots)

    Annotation import

    data_path <- "~/p12_data"
    plate_map <- read.xlsx(xlsxFile = file.path(data_path, "plate_mapping.xlsx"))
    head(plate_map)

    Importing the raw data

    • importing using r Biocpkg("rhdf5")
    • possibly discuss the hdf5 format
    path <- file.path(data_path, "_all_positions.ch5")
    c5f <- CellH5(path)
    c5_pos <- C5Positions(c5f, C5Plates(c5f))
    predictions <- C5Predictions(c5f, c5_pos[[1]], mask = "primary__primary3", as = "name")
    
    c5_pos[["WB08_P1"]] <- NULL
    

    Extract raw data

    
    raw_data <- sapply(c5_pos, function(pos){
                    predictions <- C5Predictions(c5f, pos, mask = "primary__primary3", as = "name")
                    table(predictions)}
                   )                     
    save(raw_data, file = "raw_data.RData")
    
    • discuss score computation from phenotype classification results

    The concept of tidy data

    A lot of analysis time is spent on the process of cleaning and preparing the data. Data preparation is not just a first step, but must be repeated many over the course of analysis as new problems come to light or new data is collected. An often neglected, but important aspect of data cleaning is data tidying: structuring datasets to facilitate analysis.

    This "data tidying" includes the ability to move data between different different shapes.

    In a nutshell, a dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.

    An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

    A tidy data frame now organizes the data in such a way that each observation corresponds to an single line in the data set. This is in general the most appropriate format for downstream analysis, although it might not be the most appropriate form for viewing the data.

    For a thorough discussion of this topic see the paper by Hadley Wickham - tidy data.

    Reshaping the screen data

    load("raw_data.RData")
    
    tidy_raw_data  <- rownames_to_column(as.data.frame(raw_data), var = "class") %>%
                       gather(key = "well", value = "count", WA01_P1:WC07_P1)
    
    tidy_raw_data$well <- str_replace(tidy_raw_data$well, "^W([A-H][0-9]{2})_P1", "\\1_01")
    
    #join annotation
    
    input_data <- left_join(tidy_raw_data, plate_map, by = c("well" = "Position"))
    
    

    Plotting in R: ggplot2

    Creating the PCA plot

    
    no_cells_per_well <- input_data %>%
                        group_by(well) %>%
                        summarize(no_cells = sum(count))
    
    data_with_sums <-  left_join(input_data, no_cells_per_well)
    
    # size_factors <- no_cells_per_well$no_cells /  geometric.mean(no_cells_per_well$no_cells)          
    
    data_for_PCA <- mutate(data_with_sums, perc = count / no_cells, 
                           z_score = logit(perc))
    
    data_for_PCA <- data_for_PCA %>% 
                    select(class, well, z_score) %>%
                    spread(key = class, value = z_score)
    
    PCA <- prcomp(data_for_PCA[, -1], center = TRUE, scale. = TRUE)
    
    
    genes <- input_data %>%
             group_by(well) %>%
             summarize(gene = unique(Gene.Symbol))
    
    genes <- ifelse(is.na(genes$gene), "empty", genes$gene)
    
    dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
                        PC3 = PCA$x[,3], PC4 = PCA$x[,4],
                        genes)
    (qplot(PC1, PC2, data = dataGG, color =  genes, geom = "text",
           label = genes, asp = 1,
           main = "PC1 vs PC2, top variable genes", size = I(6))
    )
    

    Heatmap of apoptosis z--scores

    dat_rows = toupper(letters[1:8])
    dat_cols = c(paste0("0",seq(1:9)),seq(10,12))
    wells <- data.frame( well = paste0(outer(dat_rows, dat_cols, paste0), "_01"))
    full_data <- arrange(full_join(data_for_PCA, wells), well)
    
    plotScreen(list(logistic (full_data$Apoptosis)), ncol = 1, nx = 12, ny = 8, 
               main = "Apoptosis percentages",
               do.names = FALSE, legend.label = "percentage of apoptotic cells ", zrange = c(0,.4) )
    

    Other clustering methods, changing the ggplot2 plot

    sessionInfo()