--- title: "Introduction to _Bioconductor_" author: "Thomas Schwarzl adapted from Valerie Obenchain (valerie.obenchain@roswellpark.org)
Lori Shepherd (lori.shepherd@roswellpark.org)
Martin Morgan (martin.morgan@roswellpark.org)
Stanford University, Stanford, CA" output: slidy_presentation: theme: paper --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r packages, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE} suppressPackageStartupMessages({ library(BioC2016Introduction) library(SummarizedExperiment) library(airway) }) ``` # _Bioconductor_ ### Physically - Collection of 1211 software, 916 annotation and 293 experimental data R packages. - Web site (http://bioconductor.org) for package distribution and other resources. - Support site (https://support.bioconductor.org) for user questions. # _Bioconductor_ ### Conceptually - Analysis and comprehension of high throughput genomic data # Core principles: Volume of data ### Type of research question - Designed experiments - Population samples - ... # Core principles: Volume of data ### Technological artifacts - Differences in sequencing depth between samples - Bias in the genomic regions sampled # Reproducibility - Cisplatin-resistant non-small-cell lung cancer gene sets - Hsu et al. 2007 J Clin Oncol 25: [4350-4357](http://jco.ascopubs.org/content/25/28/4350.abstract) [retracted](http://jco.ascopubs.org/content/28/35/5229.long) ![](our_figures/HsuEtAl-F1-large-a.jpg) # Reproducibility - Baggerly & Coombes 2009 Ann Appl Stat [3: 1309-1334](http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1267453942) ![](our_figures/BaggerlyCoombes2009-fig2a.jpg) # Lessons - Record each step of the analysis - Coordinated manipulation of feature, sample, and assay data - Informative labels on visualizations # How to be reproducible? - Use software 'objects' that take care of some of the tedious book-keeping - Document our analysis in scripts and 'markdown' documents # Example: `SummarizedExperiment` ![](our_figures/SE_Description.png) # Example: `SummarizedExperiment` Underlying data is a matrix - Regions of interest (e.g., genes) x samples - `assay()` -- e.g., matrix of counts of reads overlapping genes # Example: `SummarizedExperiment` Include information about rows - `rowRanges()` -- gene identifiers, or _genomic ranges_ describing the coordinates of each gene # Example: `SummarizedExperiment` Include information about columns - `colData()` -- describing samples, experimental design, ... # Example: `SummarizedExperiment` ```{r airway-SummarizedExperiment, eval=F} library(airway) # An 'ExperimentData' package... data(airway) # ...with a sample data set... airway # ...that is a SummarizedExperiment head(assay(airway)) # contains a matrix of counts head(rowRanges(airway)) # information about the genes... colData(airway)[, 1:3] # ...and samples ## coordinated subsetting untrt <- airway[, airway$dex == 'untrt'] head(assay(untrt)) colData(untrt)[, 1:3] ``` # Packages! - Discover and navigate via [biocViews][] - Package 'landing page' - Title, author / maintainer, short description, citation, installation instructions, ..., download statistics - All user-visible functions have help pages, most with runnable examples - 'Vignettes' an important feature in Bioconductor -- narrative documents illustrating how to use the package, with integrated code - 'Workflows' make use of multiple packages for complete end-to-end analysis - 'Release' (every six months) and 'devel' branches - [Support site](https://support.bioconductor.org); [videos](https://www.youtube.com/user/bioconductor), [recent courses](http://bioconductor.org/help/course-materials/) # Visualization Inter-operability between packages made easier by using similar data structures Examples (details later) - `SummarizedExperiment` - `DNAStringSet` - `GenomicRanges` # Comprehension is more than statistical analysis ### Annotation - Mapping from technical to user-friendly identifiers - Assigning genes to pathways - Placing our results in the context of large-scale analyses # Comprehension is more than statistical analysis ### Objects - Represent complicated data types - Foster interoperability - S4 object system - Introspection: `methods()`, `getClass()`, `selectMethod()` - 'accessors' and other documented functions / methods for manipulation, rather than direct access to the object structure - Interactive help - `method?"substr,"` to select help on methods, `class?D` for help on classes # A sequence analysis package tour ![Alt Sequencing Ecosystem](our_figures/SequencingEcosystem.png) # BiocViews - Bioconductor packages are listed on the [biocViews][] page. - tags from a controlled vocabulary # Landing page 'landing page', e.g., for [GenomicRanges][]. - description, authors, and installation instructions. - references to scientific literature - vignettes and reference manual - indication of cross-platform availability - download statistics. # Installation A package needs to be installed once, using the instructions on the landing page. Once installed, the package can be loaded into an R session ```{r require} library(GenomicRanges) ``` and the help system queried interactively, as outlined above: ```{r help, eval=FALSE} help(package="GenomicRanges") vignette(package="GenomicRanges") vignette(package="GenomicRanges", "GenomicRangesHOWTOs") ?GRanges ``` # Types of Packages Now examples for popular packages for high-throughput analyses # Differential expression - Important packages for analysis of differential expression include [edgeR][] and [DESeq2][]; both have excellent vignettes for exploration. Additional research methods embodied in Bioconductor packages can be discovered by visiting the [biocViews][] web page, searching for the 'DifferentialExpression' view term, and narrowing the selection by searching for 'RNA seq' and similar. # ChIP-seq - Popular ChIP-seq packages include [csaw][] an d[DiffBind][] for comparison of peaks across samples, [ChIPQC][] for quality assessment, and [ChIPseeker][] for annotating results (e.g., discovering nearby genes). What other ChIP-seq packages are listed on the [biocViews][] page? # Variants - Working with called variants (VCF files) is facilitated by packages such as [VariantAnnotation][], [VariantFiltering][], [ensemblVEP][], and [SomaticSignatures][]; packages for calling variants include, e.g., [h5vc][] and [VariantTools][]. # Copy number variants - Several packages identify copy number variants from sequence data, including [cn.mops][]; from the [biocViews][] page, what other copy number packages are available? The [CNTools][] package provides some useful facilities for comparison of segments across samples. # Microbiome and Metagenomic analysis - Microbiome and metagenomic analysis is facilitated by packages such as [phyloseq][] and [metagenomeSeq][]. # Metabolomics, Chemoinformatics, Image analysis - Metabolomics, chemoinformatics, image analysis, and many other high-throughput analysis domains are also represented in Bioconductor; explore these via biocViews and title searches. # Genomic Ranges Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the [IRanges][] / [GenomicRanges][] infrastructure that we will encounter later in the course. # Sequences - The [Biostrings][] package is used to represent DNA and other sequences, with many convenient sequence-related functions. Check out the functions documented on the help page `?consensusMatrix`, for instance. Also check out the [BSgenome][] package for working with whole genome sequences, e.g., `?"getSeq,BSgenome-method"` # Alignments - The [GenomicAlignments][] package is used to input reads aligned to a reference genome. See for instance the `?readGAlignments` help page and `vigentte(package="GenomicAlignments", "summarizeOverlaps")` - [rtracklayer][]'s `import` and `export` functions can read in many common file types, e.g., BED, WIG, GTF, ..., in addition to querying and navigating the UCSC genome browser. Check out the `?import` page for basic usage. - The [ShortRead][] and [Rsamtools][] packages can be used for lower-level access to FASTQ and BAM files, respectively. Explore the [ShortRead vignette](http://bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf) and Scalable Genomics labs to see approaches to effectively processing the large files. # Visualization - The [Gviz][] package provides great tools for visualizing local genomic coordinates and associated data. - [epivizr][] drives the [epiviz](http://epiviz.cbcb.umd.edu/) genome browser from within R; [rtracklayer][] provides easy ways to transfer data to and manipulate UCSC browser sessions. - Additionl packages include [ggbio][], [OmicCircos][], ... # DNA or amino acid sequences _Biostrings_, _ShortRead_, _BSgenome_ # Classes - XString, XStringSet, e.g., DNAString (genomes), DNAStringSet (reads) # Methods - [Cheat sheat](http://bioconductor.org/packages/release/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf) - Manipulation, e.g., `reverseComplement()` - Summary, e.g., `letterFrequency()` - Matching, e.g., `matchPDict()`, `matchPWM()` # Related packages - [BSgenome][] - Whole-genome representations - Model and custom - [ShortRead][] - FASTQ files # Example - Whole-genome sequences are distrubuted by ENSEMBL, NCBI, and others as FASTA files; model organism whole genome sequences are packaged into more user-friendly `BSgenome` packages. The following calculates GC content across chr14. ```{r BSgenome-require, message=FALSE} require(BSgenome.Hsapiens.UCSC.hg19) chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"])) chr14_dna <- getSeq(Hsapiens, chr14_range) letterFrequency(chr14_dna, "GC", as.prob=TRUE) ``` # Ranges: _GenomicRanges_, _IRanges_ Ranges represent: - Data, e.g., aligned reads, ChIP peaks, SNPs, CpG islands, ... - Annotations, e.g., gene models, regulatory elements, methylated regions - Ranges are defined by chromosome, start, end, and strand - Often, metadata is associated with each range, e.g., quality of alignment, strength of ChIP peak Many common biological questions are range-based - What reads overlap genes? - What genes are ChIP peaks nearest? - ... # GenomicRanges The [GenomicRanges][] package defines essential classes and methods ### `GRanges` ![Alt ](our_figures/GRanges.png) # GenomicRanges ### `GRangesList` ![Alt ](our_figures/GRangesList.png) # GenomicRanges ### Range operations ![Alt Ranges Algebra](our_figures/RangeOperations.png) # Ranges - IRanges - `start()` / `end()` / `width()` - List-like -- `length()`, subset, etc. - 'metadata', `mcols()` - GRanges - 'seqnames' (chromosome), 'strand' - `Seqinfo`, including `seqlevels` and `seqlengths` # Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - `shift()`, `narrow()`, `flank()`, `promoters()`, `resize()`, `restrict()`, `trim()` - See `?"intra-range-methods"` # Inter-range methods - Depends on other ranges in the same object - `range()`, `reduce()`, `gaps()`, `disjoin()` - `coverage()` (!) - see `?"inter-range-methods"` # Between-range methods - Functions of two (or more) range objects - `findOverlaps()`, `countOverlaps()`, ..., `%over%`, `%within%`, `%outside%`; `union()`, `intersect()`, `setdiff()`, `punion()`, `pintersect()`, `psetdiff()` # Example ```{r ranges, message=FALSE, eval=F} require(GenomicRanges) gr <- GRanges("A", IRanges(c(10, 20, 22), width=5), "+") shift(gr, 1) # 1-based coordinates! range(gr) # intra-range reduce(gr) # inter-range coverage(gr) setdiff(range(gr), gr) # 'introns' ``` # IRangesList, GRangesList - List: all elements of the same type - Many *List-aware methods, but a common 'trick': apply a vectorized function to the unlisted representaion, then re-list grl <- GRangesList(...) orig_gr <- unlist(grl) transformed_gr <- FUN(orig) transformed_grl <- relist(, grl) # Aligned reads: _GenomicAlignments_, _Rsamtools_ Classes -- GenomicRanges-like behaivor - GAlignments, GAlignmentPairs, GAlignmentsList - SummarizedExperiment - Matrix where rows are indexed by genomic ranges, columns by a DataFrame. Methods - `readGAlignments()`, `readGAlignmentsList()` - Easy to restrict input, iterate in chunks - `summarizeOverlaps()` # Example - Find reads supporting the junction identified above, at position 19653707 + 66M = 19653773 of chromosome 14 ```{r bam-require, eval=F} require(GenomicRanges) require(GenomicAlignments) require(Rsamtools) ## our 'region of interest' roi <- GRanges("chr14", IRanges(19653773, width=1)) ## sample data require('RNAseqData.HNRNPC.bam.chr14') bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE) ## alignments, junctions, overlapping our roi paln <- readGAlignmentsList(bf) j <- summarizeJunctions(paln, with.revmap=TRUE) j_overlap <- j[j %over% roi] ## supporting reads paln[j_overlap$revmap[[1]]] ``` # Called variants: _VariantAnnotation_, _VariantFiltering_ Classes -- GenomicRanges-like behavior - VCF -- 'wide' - VRanges -- 'tall' Functions and methods - I/O and filtering: `readVcf()`, `readGeno()`, `readInfo()`, `readGT()`, `writeVcf()`, `filterVcf()` - Annotation: `locateVariants()` (variants overlapping ranges), `predictCoding()`, `summarizeVariants()` - SNPs: `genotypeToSnpMatrix()`, `snpSummary()` # Example - Read variants from a VCF file, and annotate with respect to a known gene model ```{r vcf, message=FALSE} ## input variants require(VariantAnnotation) fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") seqlevels(vcf) <- "chr22" ## known gene model require(TxDb.Hsapiens.UCSC.hg19.knownGene) coding <- locateVariants(rowRanges(vcf), TxDb.Hsapiens.UCSC.hg19.knownGene, CodingVariants()) head(coding) ``` # Related packages - [ensemblVEP][] - Forward variants to Ensembl Variant Effect Predictor - [VariantTools][], [h5vc][] - Call variants - [VariantFiltering][] - Filter variants using criteria such as coding consequence, MAF, ..., inheritance model # Integrated data representations: _SummarizedExperiment_ ![](our_figures/SE_Description.png) [SummarizedExperiment][] - 'feature' x 'sample' `assays()` - `colData()` data frame for desciption of samples - `rowRanges()` _GRanges_ / _GRangeList_ or data frame for description of features - `exptData()` to describe the entire object ```{r SummarizedExperiment} library(SummarizedExperiment) library(airway) data(airway) airway colData(airway) airway[, airway$dex %in% "trt"] ``` # Annotation: _org_, _TxDb_, _AnnotationHub_, _biomaRt_, ... - _Bioconductor_ provides extensive access to 'annotation' resources (see the [AnnotationData][] biocViews hierarchy); some interesting examples to explore during this lab include: - [biomaRt][], [PSICQUIC][], [KEGGREST][] and other packages for querying on-line resources; each of these have informative vignettes. - [AnnotationDbi][] is a cornerstone of the [Annotation Data][AnnotationData] packages provided by Bioconductor. - **org** packages (e.g., [org.Hs.eg.db][]) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page `?select` - **TxDb** packages (e.g., [TxDb.Hsapiens.UCSC.hg19.knownGene][]) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg19 knownGene track of the UCSC genome browser. These packages can be queried, e.g., as described on the `?exonsBy` page to retrieve all exons grouped by gene or transcript. - **BSgenome** packages (e.g., [BSgenome.Hsapiens.UCSC.hg19][]) contain whole genomes of model organisms. - [VariantAnnotation][] and [ensemblVEP][] provide access to sequence annotation facilities, e.g., to identify coding variants; see the [Introduction to VariantAnnotation](http://bioconductor.org/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf) vignette for a brief introduction. - Take a quick look at the [annotation work flow](http://bioconductor.org/help/workflows/annotation/annotation/) on the Bioconductor web site. # Scalable computing 1. Efficient _R_ code - Vectorize! - Reuse others' work Know -- [DESeq2][], [GenomicRanges][], [Biostrings][], [dplyr][], [data.table][], [Rcpp][] 2. Iteration - Chunk-wise - `open()`, read chunk(s), `close()`. - e.g., `yieldSize` argument to `Rsamtools::BamFile()` 3. Restriction - Limit to columns and / or rows of interest - Exploit domain-specific formats, e.g., BAM files and `Rsamtools::ScanBamParam()` - Use a data base 4. Sampling - Iterate through large data, retaining a manageable sample, e.g., `ShortRead::FastqSampler()` 5. Parallel evaluation - **After** writing efficient code - Typically, `lapply()`-like operations - Cores on a single machine ('easy'); clusters (more tedious); clouds # Parallel evaluation in _Bioconductor_ - [BiocParallel][] -- `bplapply()` for `lapply()`-like functions, increasingly used by package developers to provide easy, standard way of gaining parallel evaluation. - [GenomicFiles][] -- Framework for working on groups of files, ranges, or ranges x files - Bioconductor [AMI][] (Amazon Machine Instance) including pre-configured StarCluster, and [docker] containers. # Resources _R_ / _Bioconductor_ - [Web site][Bioconductor] -- install, learn, use, develop _R_ / _Bioconductor_ packages - [Support](http://support.bioconductor.org) -- seek help and guidance; also [StackOverflow](http://stackoverflow.com/questions/tagged/r) for _R_ programming questions - [biocViews](http://bioconductor.org/packages/release/BiocViews.html) -- discover packages - Package landing pages, e.g., [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html), including title, description, authors, installation instructions, vignettes (e.g., GenomicRanges '[How To](http://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)'), etc. - [Course](http://bioconductor.org/help/course-materials/) and other [help](http://bioconductor.org/help/) material (e.g., videos, EdX course, community blogs, ...) # Publications (General _Bioconductor_) - Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi: [10.1371/journal.pcbi.1003118][GRanges.bib] - Lawrence, M, and Morgan, M. 2014. Scalable Genomics with R and Bioconductor. Statistical Science 2014, Vol. 29, No. 2, 214-226. [http://arxiv.org/abs/1409.2864v1][Scalable.bib] Other - Lawrence, M. 2014. Software for Enabling Genomic Data Analysis. Bioc2014 conference [slides][Lawrence.bioc2014.bib]. [R]: http://r-project.org [Bioconductor]: http://bioconductor.org [GRanges.bib]: https://doi.org/10.1371/journal.pcbi.1003118 [Scalable.bib]: http://arxiv.org/abs/1409.2864 [Lawrence.bioc2014.bib]: http://bioconductor.org/help/course-materials/2014/BioC2014/Lawrence_Talk.pdf [AnnotationData]: http://bioconductor.org/packages/release/BiocViews.html#___AnnotationData [biocViews]: http://bioconductor.org/packages/release/BiocViews.html#___Software [AnnotationDbi]: http://bioconductor.org/packages/AnnotationDbi [AnnotationHub]: http://bioconductor.org/packages/AnnotationHub [BSgenome.Hsapiens.UCSC.hg19]: http://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19 [BSgenome]: http://bioconductor.org/packages/BSgenome [BiocParallel]: http://bioconductor.org/packages/BiocParallel [Biostrings]: http://bioconductor.org/packages/Biostrings [CNTools]: http://bioconductor.org/packages/CNTools [ChIPQC]: http://bioconductor.org/packages/ChIPQC [ChIPseeker]: http://bioconductor.org/packages/ChIPseeker [DESeq2]: http://bioconductor.org/packages/DESeq2 [DiffBind]: http://bioconductor.org/packages/DiffBind [GenomicAlignments]: http://bioconductor.org/packages/GenomicAlignments [GenomicFiles]: http://bioconductor.org/packages/GenomicFiles [GenomicRanges]: http://bioconductor.org/packages/GenomicRanges [Homo.sapiens]: http://bioconductor.org/packages/Homo.sapiens [IRanges]: http://bioconductor.org/packages/IRanges [KEGGREST]: http://bioconductor.org/packages/KEGGREST [PSICQUIC]: http://bioconductor.org/packages/PSICQUIC [Rsamtools]: http://bioconductor.org/packages/Rsamtools [Rsubread]: http://bioconductor.org/packages/Rsubread [ShortRead]: http://bioconductor.org/packages/ShortRead [SomaticSignatures]: http://bioconductor.org/packages/SomaticSignatures [SummarizedExperiment]: http://bioconductor.org/packages/SummarizedExperiment [TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [VariantAnnotation]: http://bioconductor.org/packages/VariantAnnotation [VariantFiltering]: http://bioconductor.org/packages/VariantFiltering [VariantTools]: http://bioconductor.org/packages/VariantTools [biomaRt]: http://bioconductor.org/packages/biomaRt [cn.mops]: http://bioconductor.org/packages/cn.mops [csaw]: http://bioconductor.org/packages/csaw [edgeR]: http://bioconductor.org/packages/edgeR [ensemblVEP]: http://bioconductor.org/packages/ensemblVEP [h5vc]: http://bioconductor.org/packages/h5vc [limma]: http://bioconductor.org/packages/limma [metagenomeSeq]: http://bioconductor.org/packages/metagenomeSeq [org.Hs.eg.db]: http://bioconductor.org/packages/org.Hs.eg.db [org.Sc.sgd.db]: http://bioconductor.org/packages/org.Sc.sgd.db [phyloseq]: http://bioconductor.org/packages/phyloseq [rtracklayer]: http://bioconductor.org/packages/rtracklayer [snpStats]: http://bioconductor.org/packages/snpStats [Gviz]: http://bioconductor.org/packages/Gviz [epivizr]: http://bioconductor.org/packages/epivizr [ggbio]: http://bioconductor.org/packages/ggbio [OmicCircos]: http://bioconductor.org/packages/OmicCircos [dplyr]: https://cran.r-project.org/package=dplyr [data.table]: https://cran.r-project.org/package=data.table [Rcpp]: https://cran.r-project.org/package=Rcpp [AMI]: http://bioconductor.org/help/bioconductor-cloud-ami/ [docker]: http://bioconductor.org/help/docker/