MosaiCatcher pipeline
Structural variant calling from single-cell Strand-seq data - summarized in a Snakemake pipeline.
For Info on Strand-seq see
- Falconer E et al., 2012 (doi: 10.1038/nmeth.2206)
- Sanders AD et al., 2017 (doi: 10.1038/nprot.2017.029)
Overview of this workflow
This workflow uses Snakemake to execute all steps of MosaiCatcher in order. The starting point are single-cell BAM files from Strand-seq experiments and the final output are SV predictions in a tabular format as well as in a graphical representation. To get to this point, the workflow goes through the following steps:
- Read binning in fixed-width genomic windows of 100kb via mosaicatcher
- Normalization of coverage with respect to a reference sample (included)
- Strand state detection (included)
- Haplotype resolution via StrandPhaseR
- Multi-variate segmentation of cells (mosaicatcher)
- Bayesian classification of segmentation to find SVs using mosaiClassifier (included)
- Visualization of results using custom R plots
Setup
While there are different options for the installation/execution (see below), the following steps are shared by all of them:
-
Download this pipeline.
git clone https://github.com/friendsofstrandseq/pipeline
-
Configuration. In the config file
Snake.config.json
, specify the reference genome, the chromosomes you would like to analyse. -
Add your data. Create a subdirectory
bam/sampleName/
and place the single-cell BAM files (one file per cell) in there:cd pipeline mkdir -p bam/NA12878 cp /path/to/my/cells/*.bam bam/NA12878/ cp /path/to/my/cells/*.bam.bai bam/NA12878
-
BAM requirements. Make sure that all BAM files belonging to the same sample contain the same read group sample tag (
@RG SM:thisIsTheSampleName
), that they are indexed and have PCR duplicates marked (the latter is especially relevant for single-cell data). -
SNP call set. If available, specify SNV calls (VCF) in
Snake.config.json
. Note that the sample name in the VCF must match the one in the BAM files.
Note: Multiple samples can be run simultaneously. Just create different subfolders
below bam/
. The same settings from the Snake.config.json
config files are
applied to all samples.
Installation using the Bioconda environment
-
Install MiniConda: In case you do not have Conda yet, it is easiest to just install MiniConda.
-
Create environment:
conda env create -n strandseqnation -f conda-environment.yml source activate strandseqnation
-
Install mosaicatcher and update the file paths pointing to it (and to several R scripts) in
Snake.config.json
. -
Run
snakemake
SNP calls
The pipeline will run simple SNV calling using samtools and bcftools on Strand-seq. If you already have
SNV calls, you can avoid that by entering your VCF files into the pipeline.
To so, make sure the files are tabix-indexed
and specifigy them inside the Snake.config.json
file:
"snv_calls" : {
"NA12878" : "path/to/snp/calls.vcf.gz"
},
Note on how Conda environment was created
The provided environment conda-environment.yml
is the result of running:
conda create -y -n strandseqnation snakemake=5.3 r-ggplot2=2.2 python bcftools bioconductor-biobase bioconductor-biocgenerics bioconductor-biocinstaller bioconductor-biocparallel bioconductor-biostrings bioconductor-bsgenome bioconductor-bsgenome.hsapiens.ucsc.hg38 bioconductor-delayedarray bioconductor-fastseg bioconductor-genomeinfodb bioconductor-genomeinfodbdata bioconductor-genomicalignments bioconductor-genomicranges bioconductor-iranges bioconductor-rsamtools bioconductor-rtracklayer bioconductor-s4vectors bioconductor-summarizedexperiment bioconductor-xvector bioconductor-zlibbioc htslib r-data.table r-peer samtools boost boost-cpp cairo certifi cmake pandas pango pip pixman python-dateutil r-assertthat r-base r-bh r-bitops r-colorspace r-cowplot r-curl r-devtools r-dichromat r-digest r-futile.logger r-futile.options r-git2r r-gridextra r-gtable r-hexbin r-httr r-jsonlite r-labeling r-lambda.r r-lattice r-lazyeval r-magrittr r-mass r-matrix r-matrixstats r-memoise r-mime r-munsell r-openssl r-plyr r-r6 r-rcolorbrewer r-rcpp r-rcurl r-reshape2 r-rlang r-rstudioapi r-scales r-snow r-stringi r-stringr r-tibble r-viridislite r-whisker r-withr r-xml setuptools numpy r-dplyr scipy whatshap freebayes r-mc2d r-pheatmap bioconductor-complexheatmap r-gplots
conda env export > conda-environment.yml
This avoids some annoying version downgrades that sometimes happen when creating an environment incrementally.