MosaiCatcher pipeline
Structural variant calling from single-cell Strand-seq data - summarized in a Snakemake pipeline.
Overview of this workflow
This workflow uses Snakemake to execute all steps of MosaiCatcher in order. The starting point are single-cell BAM files from Strand-seq experiments and the final output are SV predictions in a tabular format as well as in a graphical representation. To get to this point, the workflow goes through the following steps:
- Read binning in fixed-width genomic windows of 100kb via mosaicatcher
- Normalization of coverage with respect to a reference sample (included)
- Strand state detection (included)
- Haplotype resolution via StrandPhaseR
- Multi-variate segmentation of cells (mosaicatcher)
- Bayesian classification of segmentation to find SVs using mosaiClassifier (included)
- Visualization of results using custom R plots (included)
Installation
Choose one of three ways to install and run this workflow:
-
Install software using Bioconda
- Installation instructions here
- Configure
Snake.config.json
to match your software installation - Add your data and configuration as described below (Setup)
-
Run Snakemake together with a Singularity image
- Instructions here
- Requires Snakemake and Singularity. No further installations needed.
- Add your data and configuration as described below (Setup)
-
Run a complete example data set via Docker
- Instructions here
- Requires Docker (tested in version 18.09)
- Runs out of the box. No further setup required
Setup
-
Download this pipeline
git clone https://github.com/friendsofstrandseq/pipeline cd pipeline
-
Add your single-cell data
Create a subdirectory
bam/sampleName/
. Your Strand-seq BAM files of this sample go into two folders:-
all/
for the total set of BAM files -
selected/
for the subset of successful Strand-seq libraries (possibly hard-linked toall/
)
It is important to follow these rules for single-cell data
- One BAM file per cell
- Sorted and indexed
- Timestamp of index files must be newer than of the BAM files
- Each BAM file must contain a read group (
@RG
) with a common sample name (SM
), which must match the folder name (sampleName
above)
-
-
Adapt the config file
Set options to describe your data in
Snake.config.json
. If you use Singularity, please useSnake.config-singularity.json
instead.Here is a digest of the relevant variables:
-
reference
: The path to the reference genome (FASTA file). Must be indexed (FAI) -
chromosomes
: Specify which chromosomes should be analyzed. By defaultchr1
-chr22
andchrX
-
R_reference
The version of the reference genome being used by R scripts. The version should match the one inreference
. Note that the Singularity/Docker image only ship with BSgenome.Hsapiens.UCSC.hg38 -
paired_end
: Specifies whether you are using paired-end reads or not (default: true) -
snv_calls
: SNV call set for your sample - see below. Must be vcf.gz and indexed via tabix -
snv_sites_to_genotype
: SNV positions to be genotyped - see below. Must be vcf.gz and indexed via tabix
-
SNV calls for haplotype separation
The MosaiCatcher pipeline requires a set of genotyped single nucleotide variants (SNVs) to distinguish haplotypes, including the assignment of individually sequenced strands of a chromosome to a certain chromosome-length haplotype.
Depending on which prior information is available, the workflow will
- Phase SNVs using StrandPhaseR (when given a set of SNV calls)
- Genotype SNVs in your data (when given potential SNV sites)
- Call and genotype SNVs on your sample (when not given SNV sites)
To provide a list of SNV sites, set snv_sites_to_genotype
in the config file; to provide final SNV calls, set snv_calls
. Must be set per sample:
"snv_calls" : {
"NA12878" : "path/to/snp/calls.vcf.gz"
},
Make sure the files are tabix-indexed.
References
For information on Strand-seq see
Falconer E et al., 2012 (doi: 10.1038/nmeth.2206)