Commit 6012fddc authored by Robin Erich Muench's avatar Robin Erich Muench
Browse files

Merge branch 'master' of git.embl.de:rmuench/metaSNP

parents 95089c6b 25f9cb25
......@@ -28,73 +28,76 @@ Dependencies
Setup & Compilation
==================
===================
MetaSNP is mainly implemented in C/C++ and needs to be compiled.
I. Setup:
---------
I. Setup
--------
a) Change the SAMTOOLS variable in the **SETUPFILE** to your path to Samtools.
If you don't have samtools, you can find it [here](http://samtools.sourceforge.net/):
http://samtools.sourceforge.net/
We assume that ``samtools`` is in your system path variable.
You can set this variable by putting the following line into your bashrc.
export PATH=/path/2/samtools:$PATH
b) Change the BOOST variable in the **SETUPFILE** to your path to BOOST library.
If you don't have boost, you can find boost [here](http://www.boost.org/users/download/):
http://www.boost.org/users/download/
c) Setup your own reference database or acquire the database we use.
In order to acquire our the database run the provided script:
c) Setup your own **reference database** or acquire the database we use.
In order to acquire our database run the provided script in the parent directory:
usage: ./getRefDB.sh
II. Compilation:
----------------
1) run *make* in the parent directory of metaSNP to compile qaTools and the snpCaller.
1) run `make` in the parent directory of metaSNP to compile qaTools and the snpCaller.
usage: make
Usage:
======
Required Files:
----------------
* a list of all BAM files to be used, one /path/to/sample.bam per line ('all_samples')
* the reference genome (fasta or multi-sequence fasta)
* genome/contig positions list (the provided one in 'src/genome_definitions' corresponds to our freeze9!)
* [optional] a gene annotation file for the reference genome (f.i. db/RefOrganismDB\_v9\_gene.clean)
===============
* **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
* **'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
* **TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html) (GFF).
Workflow:
---------
=========
**__Note:__ Each part is dependened on the successful completion of the previous one.**
### 1. Initiate a NewProject with NewProject.sh
## 1. Initiate a NewProject with NewProject.sh
usage: ./NewProject.sh 'project_dir'
usage: ./NewProject.sh project_dir
Generates a structured results directory for your project.
### 2. Part I: Pre-processing [optional]
## 2. Part I: Pre-processing [optional]
Note: This part can be skipped if the workload balancing has already been computed.
#### a) run createCoverageComputationRun.sh
### a) run createCoverageComputationRun.sh
usage: ./createCoverageComputationRun.sh all_samples project_dir/ > toRunCov
The script generates a list of commandline jobs.
Run these commandline jobs before proceeding with the next step.
#### b) run createOptSplit
### b) run createOptSplit
Helper script for workload balancing (reference genome splitting).
usage: ./createOptSplit project_dir/ [nr_splits]
Note: This step requires an appropriate reference genome_definitions file in src/.
Note: This step requires an appropriate reference genome_definitions file.
### 3. Part II: SNP calling
## 3. Part II: SNP calling
Note: Requires pre-processing to be computed or you use an already existing one (e.g. src/bestsplits).
### a) createSNPrunningBatch.sh
......@@ -113,31 +116,77 @@ Note: Requires pre-processing to be computed or you use an already existing one
The script generates a list of commandline jobs.
Submit each commandline to your GE or compute locally.
### 4. Part III: Post-Processing (Filtering & Analysis)
## 4. Part III: Post-Processing (Filtering & Analysis)
#### a) Filtering:
### a) Filtering:
Note: requires SNP calling (Part II) to be done!
> script in **src/filtering.py** with elaborate help message
usage: python src/filtering.py
#### b) Analysis:
### b) Analysis:
> TODO: include R scripts for pairwise distance and subspecies computation
Tools and scripts (additional information):
==========================================
__metaSNP caller__
-------------------
Example Tutorial
================
## 1. Run the setup & compilation steps and acquire the provided reference database.
## 2. Go to the EXAMPLE directory and acquire the samples with the getSamplesScript.sh
$ cd EXAMPLE
$ ./getSamplesScript.sh
## 3. Initiate a new project in the parent directory
$ ./NewProject.sh tutorial
## 4. Generate the 'all_samples' file
$ find `pwd`/EXAMPLE/samples -name “*.bam” > tutorial/all_samples
## 5. Prepare and run the coverage estimation
$ ./createCovComputationRun.sh tutorial/ tutorial/all_samples > runCoverage
$ bash runCoverage
## 6. Perform the work load balancing for parallelization into five jobs.
$ ./createOptSplit tutorial/ db/Genomev9_definitions 5
$ bash runCoverage
## 7. Prepare and run the SNP calling step
$ ./createSNPrunningBatch.sh tutorial/ tutorial/all_samples > runSNPcall
$ bash runSNPcall
## 8. Run the post processing / filtering steps
Compute allele frequencies for each position that pass the given thresholds.
$ ./filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNPs.best_split_* tutorial/all_samples tutorial/filtered/pop/
Basic usage
===========
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone and without a gene annotation file) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation via make.
Additional information and options (tools and scripts):
---------------------------------------------------------
###metaSNP caller
Calls SNPs from samtools pileup format and generates two outputs:
usage: ./snpCall [options] < stdin.mpileup
usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs
Options:
-f, faidx indexed reference genome .
-f, faidx indexed reference genome.
-g, gene annotation file.
-i, the gene annotation file.
-i, individual SNPs.
Note: Expecting samtools mpileup as standard input
#### __Output__
......@@ -147,14 +196,12 @@ Population wide variants that occur with a frequency of 1 % at positions with at
2. Individual specific SNPs (iSNPs):
Non population variants, that occur with a frequency of 10 % at positions with at least 10x coverage.
__qaCompute__
--------------
###[qaComput](https://github.com/CosteaPaul/qaTools)
Computes normal and span coverage from a bam/sam file.
Also counts unmapped and sub-par quality reads.
### __Parameters:__
#### __Parameters:__
m - Compute median coverage for each contig/chromosome.
Will make running a bit slower. Off by default.
......@@ -186,4 +233,39 @@ Also counts unmapped and sub-par quality reads.
Because mappers sometimes break the headers or simply don't output them,
this is provieded as a non-kosher way around it. Use with care!
For more info on the parameteres try ./qaCompute
\ No newline at end of file
For more info on the parameteres try ./qaCompute
###filtering.py
usage: metaSNP filtering [-h] [-p PERC] [-c COV] [-m MINSAMPLES] [-s SNPC]
[-i SNPI]
perc_FILE cov_FILE snp_FILE [snp_FILE ...]
all_samples output_dir/
metaSNP filtering
positional arguments:
perc_FILE input file with horizontal genome (taxon) coverage
(breadth) per sample (percentage covered)
cov_FILE input file with average genome (taxon) coverage
(depth) per sample (average number reads per site)
snp_FILE input files from SNP calling
all_samples list of input BAM files, one per line
output_dir/ output folder
optional arguments:
-h, --help show this help message and exit
-p PERC, --perc PERC Coverage breadth: Horizontal coverage cutoff
(percentage taxon covered) per sample (default: 40.0)
-c COV, --cov COV Coverage depth: Average vertical coverage cutoff per
taxon, per sample (default: 5.0)
-m MINSAMPLES, --minsamples MINSAMPLES
Minimum number of sample that have to pass the
filtering criteria in order to write an output for the
representative Genome (default: 2)
-s SNPC, --snpc SNPC FILTERING STEP II: SNP coverage cutoff (default: 5.0)
-i SNPI, --snpi SNPI FILTERING STEP II: SNP occurence (incidence) cutoff
within samples_of_interest (default: 0.5)
This diff is collapsed.
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment