Commit 6012fddc authored by Robin Erich Muench's avatar Robin Erich Muench
Browse files

Merge branch 'master' of git.embl.de:rmuench/metaSNP

parents 95089c6b 25f9cb25
...@@ -28,73 +28,76 @@ Dependencies ...@@ -28,73 +28,76 @@ Dependencies
Setup & Compilation Setup & Compilation
================== ===================
MetaSNP is mainly implemented in C/C++ and needs to be compiled. MetaSNP is mainly implemented in C/C++ and needs to be compiled.
I. Setup: I. Setup
--------- --------
a) Change the SAMTOOLS variable in the **SETUPFILE** to your path to Samtools. a) Change the SAMTOOLS variable in the **SETUPFILE** to your path to Samtools.
If you don't have samtools, you can find it [here](http://samtools.sourceforge.net/): If you don't have samtools, you can find it [here](http://samtools.sourceforge.net/):
http://samtools.sourceforge.net/ http://samtools.sourceforge.net/
We assume that ``samtools`` is in your system path variable.
You can set this variable by putting the following line into your bashrc.
export PATH=/path/2/samtools:$PATH
b) Change the BOOST variable in the **SETUPFILE** to your path to BOOST library. b) Change the BOOST variable in the **SETUPFILE** to your path to BOOST library.
If you don't have boost, you can find boost [here](http://www.boost.org/users/download/): If you don't have boost, you can find boost [here](http://www.boost.org/users/download/):
http://www.boost.org/users/download/ http://www.boost.org/users/download/
c) Setup your own reference database or acquire the database we use. c) Setup your own **reference database** or acquire the database we use.
In order to acquire our the database run the provided script: In order to acquire our database run the provided script in the parent directory:
usage: ./getRefDB.sh usage: ./getRefDB.sh
II. Compilation: II. Compilation:
---------------- ----------------
1) run *make* in the parent directory of metaSNP to compile qaTools and the snpCaller. 1) run `make` in the parent directory of metaSNP to compile qaTools and the snpCaller.
usage: make usage: make
Usage:
======
Required Files: Required Files:
---------------- ===============
* a list of all BAM files to be used, one /path/to/sample.bam per line ('all_samples') * **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
* the reference genome (fasta or multi-sequence fasta) * **'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
* genome/contig positions list (the provided one in 'src/genome_definitions' corresponds to our freeze9!) * **'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
* [optional] a gene annotation file for the reference genome (f.i. db/RefOrganismDB\_v9\_gene.clean) * **TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html) (GFF).
Workflow: Workflow:
--------- =========
**__Note:__ Each part is dependened on the successful completion of the previous one.** **__Note:__ Each part is dependened on the successful completion of the previous one.**
### 1. Initiate a NewProject with NewProject.sh ## 1. Initiate a NewProject with NewProject.sh
usage: ./NewProject.sh 'project_dir' usage: ./NewProject.sh project_dir
Generates a structured results directory for your project. Generates a structured results directory for your project.
### 2. Part I: Pre-processing [optional] ## 2. Part I: Pre-processing [optional]
Note: This part can be skipped if the workload balancing has already been computed. Note: This part can be skipped if the workload balancing has already been computed.
#### a) run createCoverageComputationRun.sh ### a) run createCoverageComputationRun.sh
usage: ./createCoverageComputationRun.sh all_samples project_dir/ > toRunCov usage: ./createCoverageComputationRun.sh all_samples project_dir/ > toRunCov
The script generates a list of commandline jobs. The script generates a list of commandline jobs.
Run these commandline jobs before proceeding with the next step. Run these commandline jobs before proceeding with the next step.
#### b) run createOptSplit ### b) run createOptSplit
Helper script for workload balancing (reference genome splitting). Helper script for workload balancing (reference genome splitting).
usage: ./createOptSplit project_dir/ [nr_splits] usage: ./createOptSplit project_dir/ [nr_splits]
Note: This step requires an appropriate reference genome_definitions file in src/. Note: This step requires an appropriate reference genome_definitions file.
### 3. Part II: SNP calling ## 3. Part II: SNP calling
Note: Requires pre-processing to be computed or you use an already existing one (e.g. src/bestsplits). Note: Requires pre-processing to be computed or you use an already existing one (e.g. src/bestsplits).
### a) createSNPrunningBatch.sh ### a) createSNPrunningBatch.sh
...@@ -113,31 +116,77 @@ Note: Requires pre-processing to be computed or you use an already existing one ...@@ -113,31 +116,77 @@ Note: Requires pre-processing to be computed or you use an already existing one
The script generates a list of commandline jobs. The script generates a list of commandline jobs.
Submit each commandline to your GE or compute locally. Submit each commandline to your GE or compute locally.
### 4. Part III: Post-Processing (Filtering & Analysis) ## 4. Part III: Post-Processing (Filtering & Analysis)
#### a) Filtering: ### a) Filtering:
Note: requires SNP calling (Part II) to be done! Note: requires SNP calling (Part II) to be done!
> script in **src/filtering.py** with elaborate help message > script in **src/filtering.py** with elaborate help message
usage: python src/filtering.py usage: python src/filtering.py
#### b) Analysis: ### b) Analysis:
> TODO: include R scripts for pairwise distance and subspecies computation > TODO: include R scripts for pairwise distance and subspecies computation
Tools and scripts (additional information): Example Tutorial
========================================== ================
__metaSNP caller__
------------------- ## 1. Run the setup & compilation steps and acquire the provided reference database.
## 2. Go to the EXAMPLE directory and acquire the samples with the getSamplesScript.sh
$ cd EXAMPLE
$ ./getSamplesScript.sh
## 3. Initiate a new project in the parent directory
$ ./NewProject.sh tutorial
## 4. Generate the 'all_samples' file
$ find `pwd`/EXAMPLE/samples -name “*.bam” > tutorial/all_samples
## 5. Prepare and run the coverage estimation
$ ./createCovComputationRun.sh tutorial/ tutorial/all_samples > runCoverage
$ bash runCoverage
## 6. Perform the work load balancing for parallelization into five jobs.
$ ./createOptSplit tutorial/ db/Genomev9_definitions 5
$ bash runCoverage
## 7. Prepare and run the SNP calling step
$ ./createSNPrunningBatch.sh tutorial/ tutorial/all_samples > runSNPcall
$ bash runSNPcall
## 8. Run the post processing / filtering steps
Compute allele frequencies for each position that pass the given thresholds.
$ ./filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNPs.best_split_* tutorial/all_samples tutorial/filtered/pop/
Basic usage
===========
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone and without a gene annotation file) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation via make.
Additional information and options (tools and scripts):
---------------------------------------------------------
###metaSNP caller
Calls SNPs from samtools pileup format and generates two outputs: Calls SNPs from samtools pileup format and generates two outputs:
usage: ./snpCall [options] < stdin.mpileup usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs
Options: Options:
-f, faidx indexed reference genome . -f, faidx indexed reference genome.
-g, gene annotation file. -g, gene annotation file.
-i, the gene annotation file. -i, individual SNPs.
Note: Expecting samtools mpileup as standard input Note: Expecting samtools mpileup as standard input
#### __Output__ #### __Output__
...@@ -147,14 +196,12 @@ Population wide variants that occur with a frequency of 1 % at positions with at ...@@ -147,14 +196,12 @@ Population wide variants that occur with a frequency of 1 % at positions with at
2. Individual specific SNPs (iSNPs): 2. Individual specific SNPs (iSNPs):
Non population variants, that occur with a frequency of 10 % at positions with at least 10x coverage. Non population variants, that occur with a frequency of 10 % at positions with at least 10x coverage.
__qaCompute__ ###[qaComput](https://github.com/CosteaPaul/qaTools)
--------------
Computes normal and span coverage from a bam/sam file. Computes normal and span coverage from a bam/sam file.
Also counts unmapped and sub-par quality reads. Also counts unmapped and sub-par quality reads.
### __Parameters:__ #### __Parameters:__
m - Compute median coverage for each contig/chromosome. m - Compute median coverage for each contig/chromosome.
Will make running a bit slower. Off by default. Will make running a bit slower. Off by default.
...@@ -186,4 +233,39 @@ Also counts unmapped and sub-par quality reads. ...@@ -186,4 +233,39 @@ Also counts unmapped and sub-par quality reads.
Because mappers sometimes break the headers or simply don't output them, Because mappers sometimes break the headers or simply don't output them,
this is provieded as a non-kosher way around it. Use with care! this is provieded as a non-kosher way around it. Use with care!
For more info on the parameteres try ./qaCompute For more info on the parameteres try ./qaCompute
\ No newline at end of file
###filtering.py
usage: metaSNP filtering [-h] [-p PERC] [-c COV] [-m MINSAMPLES] [-s SNPC]
[-i SNPI]
perc_FILE cov_FILE snp_FILE [snp_FILE ...]
all_samples output_dir/
metaSNP filtering
positional arguments:
perc_FILE input file with horizontal genome (taxon) coverage
(breadth) per sample (percentage covered)
cov_FILE input file with average genome (taxon) coverage
(depth) per sample (average number reads per site)
snp_FILE input files from SNP calling
all_samples list of input BAM files, one per line
output_dir/ output folder
optional arguments:
-h, --help show this help message and exit
-p PERC, --perc PERC Coverage breadth: Horizontal coverage cutoff
(percentage taxon covered) per sample (default: 40.0)
-c COV, --cov COV Coverage depth: Average vertical coverage cutoff per
taxon, per sample (default: 5.0)
-m MINSAMPLES, --minsamples MINSAMPLES
Minimum number of sample that have to pass the
filtering criteria in order to write an output for the
representative Genome (default: 2)
-s SNPC, --snpc SNPC FILTERING STEP II: SNP coverage cutoff (default: 5.0)
-i SNPI, --snpi SNPI FILTERING STEP II: SNP occurence (incidence) cutoff
within samples_of_interest (default: 0.5)
This diff is collapsed.
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment