@@ -69,67 +69,97 @@ You can set this variable temporarily by runing the following commands and perma
Note: Replace '/path/2' with the corresponding global path.
Required Files:
===============
***'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
***'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
***'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
***TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html)(GFF).
Workflow:
=========
**__Note:__ Each part is dependened on the successful completion of the previous one.**
## Required Files:
***'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
***'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
***'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
***TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html)(GFF).
## 1. Initiate a new project
usage: ./NewProject.sh project_dir
usage: metaSNP_NEW project_dirname
Generates a structured results directory for your project.
## 2. Part I: Pre-processing [optional]
Note: This part can be skipped if the workload balancing has already been computed.
Note: This part can be skipped if you do not want to balance the workload or you already performed the pre-processing for this dataset (samples).
Submit each commandline to your GE or compute locally.
Parameters:
Required:
project_dir DIR = the project directory.
all_samples FILE = a list of bam files, one file per line.
ref_db FILE = the reference multi-sequence FASTA file used for the alignments.
Optional:
-a, db_ann FILE = database annotation.
-l, splits/ DIR = bestsplits DIR for job parallelization (pre-processing).
Note: Alternatively, use the ``-l splits/`` option to call SNPs for specific species, contig regions (BED) or single positions (contig_id pos). Unlisted contigs/pos are skipped.
The script generates a list of commandline jobs. Submit each commandline to your GE or compute locally.
## 4. Part III: Post-Processing (Filtering & Analysis)
### a) Filtering:
Note: requires SNP calling (Part II) to be done!
> script in **src/filtering.py** with elaborate help message
usage: python src/filtering.py
Caution: Perform this step seperately for individual SNPs and population SNPs.
usage: metaSNP_filtering.py
positional arguments:
perc_FILE input file with horizontal genome (taxon) coverage (breadth) per sample (percentage covered)
cov_FILE input file with average genome (taxon) coverage
(depth) per sample (average number reads per site)
snp_FILE input files from SNP calling
all_samples list of input BAM files, one per line
output_dir/ output folder
optional arguments:
-h, --help show this help message and exit
-p PERC, --perc PERC Coverage breadth: Horizontal coverage cutoff
(percentage taxon covered) per sample (default: 40.0)
-c COV, --cov COV Coverage depth: Average vertical coverage cutoff per
taxon, per sample (default: 5.0)
-m MINSAMPLES, --minsamples MINSAMPLES
Minimum number of sample required to pass the coverage
### b) Compute pair-wise distances between samples on their SNP profiles and create a PCoA plot.
...
...
@@ -179,17 +211,17 @@ Example Tutorial
Basic usage
===========
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone and without a gene annotation file) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation via make.
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation.
Additional information and options (tools and scripts):