Commit 6677c7ab authored by Robin Erich Muench's avatar Robin Erich Muench
Browse files

Merge branch 'master' of git.embl.de:rmuench/metaSNP

parents 9dbc0e7e ed95c0bc
......@@ -39,11 +39,6 @@ a) Change the SAMTOOLS variable in the **SETUPFILE** to your path to Samtools.
http://samtools.sourceforge.net/
We assume that ``samtools`` is in your system path variable.
You can set this variable by putting the following line into your bashrc.
export PATH=/path/2/samtools:$PATH
b) Change the BOOST variable in the **SETUPFILE** to your path to BOOST library.
If you don't have boost, you can find boost [here](http://www.boost.org/users/download/):
......@@ -60,69 +55,110 @@ II. Compilation:
1) run `make` in the parent directory of metaSNP to compile qaTools and the snpCaller.
usage: make
III. Environmental Variables:
----------------
We assume the metaSNP parent directory, ``samtools`` and ``python`` in your system path variable.
You can set this variable temporarily by runing the following commands and permanently by putting these line into your .bashrc:
export PATH=/path/2/metaSNP:$PATH
export PATH=/path/2/python:$PATH
export PATH=/path/2/samtools:$PATH
Note: Replace '/path/2' with the corresponding global path.
Required Files:
===============
* **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
* **'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
* **TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html) (GFF).
Workflow:
=========
**__Note:__ Each part is dependened on the successful completion of the previous one.**
## Required Files:
* **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
* **'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
* **TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html) (GFF).
## 1. Initiate a new project
usage: ./NewProject.sh project_dir
usage: metaSNP_NEW project_dirname
Generates a structured results directory for your project.
## 2. Part I: Pre-processing [optional]
Note: This part can be skipped if the workload balancing has already been computed.
Note: Subsequent SNP filtering depends on these coverage estimations.
This part can be skipped if you only want to use the 'raw' SNP output and do not intend to balance the workload or if you already performed the pre-processing for the dataset.
### a) run createCoverageComputationRun.sh
### a) run metaSNP_COV
usage: ./createCoverageComputationRun.sh all_samples project_dir/ > toRunCov
usage: metaSNP_COV project_dir/ all_samples
The script generates a list of commandline jobs.
Run these commandline jobs before proceeding with the next step.
Run each commandline jobs before proceeding with the next step.
### b) run createOptSplit
### b) run metaSNP_OPT
Helper script for workload balancing (reference genome splitting).
usage: ./createOptSplit project_dir/ [nr_splits]
usage: metaSNP_OPT project_dir/ genome_def nr_splits[int]
Parameters
Required:
project_dir/ = the projects result directory
genome_def FILE = Contig ranges in BED format. (Fields: Contig_id, contigStart, contigEnd)
Optional:
nr_splits INT = INT for job parallelization (range: 1-100) [10]
Note: This step requires an appropriate reference genome_definitions file.
Note: This step requires an appropriate reference genome_definitions file (included in our database).
## 3. Part II: SNP calling
Note: Requires pre-processing to be computed or you use an already existing one (e.g. src/bestsplits).
Note: Can be run as a single job, independently of the pre-processing or you use an already existing genome split.
### a) createSNPrunningBatch.sh
usage: ./createSNPrunningBatch.sh [options] project_dir/ all_samples > toRunSNP
### a) metaSNP_SNP
Parameters:
Required:
project_dir DIR = the project directory.
all_samples FILE = a list of bam files, one file per line.
Optional:
-l, splits/ DIR = use existing bestsplits DIR if pre-processing was omitted.
-f, ref_db FILE = the reference sequence FASTA file.
-a, db_ann FILE = the gene annotation file.
usage: metaSNP_SNP [options] project_dir/ all_samples ref_db
The script generates a list of commandline jobs.
Submit each commandline to your GE or compute locally.
Parameters:
Required:
project_dir DIR = the project directory.
all_samples FILE = a list of bam files, one file per line.
ref_db FILE = the reference multi-sequence FASTA file used for the alignments.
Optional:
-a, db_ann FILE = database annotation.
-l, splits/ DIR = bestsplits DIR for job parallelization (pre-processing).
Note: Alternatively, use the ``-l splits/`` option to call SNPs for specific species, contig regions (BED) or single positions (contig_id pos). Unlisted contigs/pos are skipped.
The script generates a list of commandline jobs. Submit each commandline to your GE or compute locally.
## 4. Part III: Post-Processing (Filtering & Analysis)
### a) Filtering:
Note: requires SNP calling (Part II) to be done!
> script in **src/filtering.py** with elaborate help message
usage: python src/filtering.py
Caution: Perform this step seperately for individual SNPs and population SNPs.
usage: metaSNP_filtering.py
positional arguments:
perc_FILE input file with horizontal genome (taxon) coverage (breadth) per sample (percentage covered)
cov_FILE input file with average genome (taxon) coverage
(depth) per sample (average number reads per site)
snp_FILE input files from SNP calling
all_samples list of input BAM files, one per line
output_dir/ output folder
optional arguments:
-h, --help show this help message and exit
-p PERC, --perc PERC Coverage breadth: Horizontal coverage cutoff
(percentage taxon covered) per sample (default: 40.0)
-c COV, --cov COV Coverage depth: Average vertical coverage cutoff per
taxon, per sample (default: 5.0)
-m MINSAMPLES, --minsamples MINSAMPLES
Minimum number of sample required to pass the coverage
cutoffs per genome (default: 2)
-s SNPC, --snpc SNPC FILTERING STEP II: SNP coverage cutoff (default: 5.0)
-i SNPI, --snpi SNPI FILTERING STEP II: SNP occurence (incidence) cutoff
within samples_of_interest (default: 0.5)
### b) Analysis:
> TODO: include R scripts for pairwise distance and subspecies computation
......@@ -133,6 +169,8 @@ Example Tutorial
## 1. Run the setup & compilation steps and download the provided reference database.
$ getRefDB.sh
## 2. Go to the EXAMPLE directory and download the samples with the getSamplesScript.sh
$ cd EXAMPLE
......@@ -140,7 +178,7 @@ Example Tutorial
## 3. Initiate a new project in the parent directory
$ ./NewProject.sh tutorial
$ metaSNP_New tutorial
## 4. Generate the 'all_samples' file
......@@ -148,41 +186,38 @@ Example Tutorial
## 5. Prepare and run the coverage estimation
$ ./createCovComputationRun.sh tutorial/ tutorial/all_samples > runCoverage
$ metaSNP_COV tutorial/ tutorial/all_samples > runCoverage
$ bash runCoverage
## 6. Perform a work load balancing step for run time optimization.
$ ./createOptSplit tutorial/ db/Genomev9_definitions 5
$ metaSNP_OPT tutorial/ db/Genomev9_definitions 5
$ bash runCoverage
## 7. Prepare and run the SNP calling step
$ ./createSNPrunningBatch.sh tutorial/ tutorial/all_samples > runSNPcall
$ metaSNP_SNP tutorial/ tutorial/all_samples db/RepGenomesv9.fna -a db/RefOrganismDB_v9_gene.clean -l tutorial/bestsplits/ > runSNPcall
$ bash runSNPcall
## 8. Run the post processing / filtering steps
### a) Compute allele frequencies for each position that pass the given thresholds.
$ ./filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNPs.best_split_* tutorial/all_samples tutorial/filtered/pop/
$ metaSNP_filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNPs.best_split_* tutorial/all_samples tutorial/filtered/pop/
### b) Compute pair-wise distances between samples on their SNP profiles and create a PCoA plot.
TODO!
Basic usage
===========
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone and without a gene annotation file) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation via make.
Basic usage (tools and scripts)
===============================
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation.
metaSNP caller
--------------
Calls SNPs from samtools pileup format and generates two outputs.
Additional information and options (tools and scripts):
---------------------------------------------------------
### metaSNP caller
Calls SNPs from samtools pileup format and generates two outputs:
usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs
usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs
Options:
-f, faidx indexed reference genome.
......@@ -191,7 +226,7 @@ Calls SNPs from samtools pileup format and generates two outputs:
Note: Expecting samtools mpileup as standard input
#### __Output__
### __Output__
1. Population SNPs (pSNPs):
Population wide variants that occur with a frequency of 1 % at positions with at least 4x coverage.
......@@ -199,11 +234,12 @@ Population wide variants that occur with a frequency of 1 % at positions with at
Non population variants, that occur with a frequency of 10 % at positions with at least 10x coverage.
### [qaComput](https://github.com/CosteaPaul/qaTools)
[qaComput](https://github.com/CosteaPaul/qaTools)
-------------------------------------------------
Computes normal and span coverage from a bam/sam file.
Also counts unmapped and sub-par quality reads.
#### __Parameters:__
### __Parameters:__
m - Compute median coverage for each contig/chromosome.
Will make running a bit slower. Off by default.
......@@ -238,8 +274,8 @@ Also counts unmapped and sub-par quality reads.
For more info on the parameteres try ./qaCompute
### filtering.py
metaSNP_filtering.py
--------------------
usage: metaSNP filtering [-h] [-p PERC] [-c COV] [-m MINSAMPLES] [-s SNPC]
[-i SNPI]
perc_FILE cov_FILE snp_FILE [snp_FILE ...]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment