Newer
Older

====================================
Structural variant calling from single-cell Strand-seq data - summarized in a [Snakemake](https://bitbucket.org/snakemake/snakemake) pipeline.
* Falconer E *et al.*, 2012 (doi: [10.1038/nmeth.2206](https://doi.org/10.1038/nmeth.2206))
* Sanders AD *et al.*, 2017 (doi: [10.1038/nprot.2017.029](https://doi.org/10.1038/nprot.2017.029))
## Overview of this workflow
This workflow uses [Snakemake](https://bitbucket.org/snakemake/snakemake) to
execute all steps of MosaiCatcher in order. The starting point are single-cell
BAM files from Strand-seq experiments and the final output are SV predictions in
a tabular format as well as in a graphical representation. To get to this point,
the workflow goes through the following steps:
1. Read binning in fixed-width genomic windows of 100kb via [mosaicatcher](https://github.com/friendsofstrandseq/mosaicatcher)
2. Normalization of coverage with respect to a reference sample (included)
3. Strand state detection (included)
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
4. Haplotype resolution via [StrandPhaseR](https://github.com/daewoooo/StrandPhaseR)
5. Multi-variate segmentation of cells ([mosaicatcher](https://github.com/friendsofstrandseq/mosaicatcher))
6. Bayesian classification of segmentation to find SVs using mosaiClassifier (included)
7. Visualization of results using custom R plots
## Setup
While there are different options for the installation/execution (see below),
the following steps are shared by all of them:
1. **Download this pipeline.**
```
git clone https://github.com/friendsofstrandseq/pipeline
```
2. **Configuration.** In the config file `Snake.config.json`, specify the reference genome,
the chromosomes you would like to analyse.
3. **Add your data.** Create a subdirectory `bam/sampleName/` and place the single-cell BAM files (one file per cell) in there:
```
cd pipeline
mkdir -p bam/NA12878
cp /path/to/my/cells/*.bam bam/NA12878/
cp /path/to/my/cells/*.bam.bai bam/NA12878
```
4. **BAM requirements.** Make sure that all BAM files belonging to the same sample
contain the same read group sample tag (`@RG SM:thisIsTheSampleName`), that they
are indexed and have PCR duplicates marked (the latter is especially relevant for
single-cell data).
5. **SNP call set.** If available, specify SNV calls (VCF) in `Snake.config.json`.
Note that the sample name in the VCF must match the one in the BAM files.
**Note:** Multiple samples can be run simultaneously. Just create different subfolders
below `bam/`. The same settings from the `Snake.config.json` config files are
applied to all samples.
## Installation using the Bioconda environment
1. **Install MiniConda:**
In case you do not have Conda yet, it is easiest to just install
[MiniConda](https://conda.io/miniconda.html).
2. **Create environment:**
```
conda env create -n strandseqnation -f conda-environment.yml
source activate strandseqnation
```
3. **Install [mosaicatcher](https://github.com/friendsofstrandseq/mosaicatcher)**
and update the file paths pointing to it (and to several R scripts) in
`Snake.config.json`.
4. **Run** `snakemake`
The pipeline will run simple SNV calling using [samtools](https://github.com/samtools/samtools) and [bcftools](https://github.com/samtools/bcftools) on Strand-seq. If you **already have
SNV calls**, you can avoid that by entering your VCF files into the pipeline.
To so, make sure the files are [tabix](https://github.com/samtools/tabix)-indexed
and specifigy them inside the `Snake.config.json` file:
```
"snv_calls" : {
"NA12878" : "path/to/snp/calls.vcf.gz"
},
```
## Note on how Conda environment was created
The provided environment `conda-environment.yml` is the result of running:
```
conda create -y -n strandseqnation snakemake=5.3 r-ggplot2=2.2 python bcftools bioconductor-biobase bioconductor-biocgenerics bioconductor-biocinstaller bioconductor-biocparallel bioconductor-biostrings bioconductor-bsgenome bioconductor-bsgenome.hsapiens.ucsc.hg38 bioconductor-delayedarray bioconductor-fastseg bioconductor-genomeinfodb bioconductor-genomeinfodbdata bioconductor-genomicalignments bioconductor-genomicranges bioconductor-iranges bioconductor-rsamtools bioconductor-rtracklayer bioconductor-s4vectors bioconductor-summarizedexperiment bioconductor-xvector bioconductor-zlibbioc htslib r-data.table r-peer samtools boost boost-cpp cairo certifi cmake pandas pango pip pixman python-dateutil r-assertthat r-base r-bh r-bitops r-colorspace r-cowplot r-curl r-devtools r-dichromat r-digest r-futile.logger r-futile.options r-git2r r-gridextra r-gtable r-hexbin r-httr r-jsonlite r-labeling r-lambda.r r-lattice r-lazyeval r-magrittr r-mass r-matrix r-matrixstats r-memoise r-mime r-munsell r-openssl r-plyr r-r6 r-rcolorbrewer r-rcpp r-rcurl r-reshape2 r-rlang r-rstudioapi r-scales r-snow r-stringi r-stringr r-tibble r-viridislite r-whisker r-withr r-xml setuptools numpy r-dplyr scipy whatshap freebayes r-mc2d r-pheatmap bioconductor-complexheatmap r-gplots
conda env export > conda-environment.yml
```
This avoids some annoying version downgrades that sometimes happen when creating an environment incrementally.