MosaiCatcher pipeline
Structural variant calling from single-cell Strand-seq data - summarized in a Snakemake pipeline.
For Info on Strand-seq see
- Falconer E et al., 2012 (doi: 10.1038/nmeth.2206)
- Sanders AD et al., 2017 (doi: 10.1038/nprot.2017.029)
Overview of this workflow
This workflow uses Snakemake to execute all steps of MosaiCatcher in order. The starting point are single-cell BAM files from Strand-seq experiments and the final output are SV predictions in a tabular format as well as in a graphical representation. To get to this point, the workflow goes through the following steps:
- Read binning in fixed-width genomic windows of 100kb via mosaicatcher
- Normalization of coverage with respect to a reference sample (included)
- Strand state detection (included)
- Haplotype resolution via StrandPhaseR
- Multi-variate segmentation of cells (mosaicatcher)
- Bayesian classification of segmentation to find SVs using mosaiClassifier (included)
- Visualization of results using custom R plots (included)
Installation
Choose one of three ways to install and run this workflow:
-
Install software using Bioconda
- Installation instructions here
- Configure
Snake.conf.json
according to your installtion - Add your single-cell data according to the specificaitons given below (Setup)
-
Run Snakemake together with a Singularity image
- Instructions here
- Requires Snakemake and Singularity. No further installations required
- Add your single-cell data according to the specificaitons given below
-
Run a complete example data set via Docker
- Requires Docker (tested in version 18.09)
- Includes a whole data set of 96 RPE-1 cells
- Example shown here
Setup
-
Download this pipeline
git clone https://github.com/friendsofstrandseq/pipeline cd pipeline
-
Add your single-cell data
Create a subdirectory
bam/sampleName/
. Your Strand-seq BAM files of this sample go into two folders:-
all/
for the total set of BAM files -
selected/
for the subset of successful Strand-seq libraries (possibly hard-linked toall/
)
It is important to follow these rules for single-cell data
- One BAM file per cell
- Sorted and indexed
- Timestamp of index files must be newer than of the BAM files
- Each BAM file must contain a read group (
@RG
) with a common sample name (SM
), which must match the folder name (sampleName
above)
-
-
Adapt the config file
In
Snake.conf.json
you can specify -
SNP call set, if available If available, specify SNV calls (VCF) in
Snake.config.json
. Note that the sample name in the VCF must match the one in the BAM files.
Note: Multiple samples can be run simultaneously. Just create different subfolders
below bam/
. The same settings from the Snake.config.json
config files are
applied to all samples.
SNP calls
The pipeline will run simple SNV calling using samtools and bcftools on Strand-seq. If you already have
SNV calls, you can avoid that by entering your VCF files into the pipeline.
To so, make sure the files are tabix-indexed
and specifigy them inside the Snake.config.json
file:
"snv_calls" : {
"NA12878" : "path/to/snp/calls.vcf.gz"
},
Installation using Singularity/Docker
Will be updated soon.