Commit b85b8f82 authored by Robin Erich Muench's avatar Robin Erich Muench
Browse files

Update README.md

parent af5d11a4
......@@ -69,67 +69,97 @@ You can set this variable temporarily by runing the following commands and perma
Note: Replace '/path/2' with the corresponding global path.
Required Files:
===============
* **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
* **'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
* **TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html) (GFF).
Workflow:
=========
**__Note:__ Each part is dependened on the successful completion of the previous one.**
## Required Files:
* **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
* **'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
* **TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html) (GFF).
## 1. Initiate a new project
usage: ./NewProject.sh project_dir
usage: metaSNP_NEW project_dirname
Generates a structured results directory for your project.
## 2. Part I: Pre-processing [optional]
Note: This part can be skipped if the workload balancing has already been computed.
Note: This part can be skipped if you do not want to balance the workload or you already performed the pre-processing for this dataset (samples).
### a) run createCoverageComputationRun.sh
### a) run metaSNP_COV
usage: ./createCoverageComputationRun.sh all_samples project_dir/ > toRunCov
usage: metaSNP_COV project_dir/ all_samples
The script generates a list of commandline jobs.
Run these commandline jobs before proceeding with the next step.
Run each commandline jobs before proceeding with the next step.
### b) run createOptSplit
### b) run metaSNP_OPT
Helper script for workload balancing (reference genome splitting).
usage: ./createOptSplit project_dir/ [nr_splits]
usage: metaSNP_OPT project_dir/ genome_def nr_splits[int]
Parameters
Required:
project_dir/ = the projects result directory
genome_def FILE = Contig ranges in BED format. (Fields: Contig_id, contigStart, contigEnd)
Optional:
nr_splits INT = INT for job parallelization (range: 1-100) [10]
Note: This step requires an appropriate reference genome_definitions file.
Note: This step requires an appropriate reference genome_definitions file (included in our database).
## 3. Part II: SNP calling
Note: Requires pre-processing to be computed or you use an already existing one (e.g. src/bestsplits).
Note: Can be run as a single job, independently of the pre-processing or you use an already existing genome split.
### a) createSNPrunningBatch.sh
usage: ./createSNPrunningBatch.sh [options] project_dir/ all_samples > toRunSNP
### a) metaSNP_SNP
Parameters:
Required:
project_dir DIR = the project directory.
all_samples FILE = a list of bam files, one file per line.
Optional:
-l, splits/ DIR = use existing bestsplits DIR if pre-processing was omitted.
-f, ref_db FILE = the reference sequence FASTA file.
-a, db_ann FILE = the gene annotation file.
usage: metaSNP_SNP [options] project_dir/ all_samples ref_db
The script generates a list of commandline jobs.
Submit each commandline to your GE or compute locally.
Parameters:
Required:
project_dir DIR = the project directory.
all_samples FILE = a list of bam files, one file per line.
ref_db FILE = the reference multi-sequence FASTA file used for the alignments.
Optional:
-a, db_ann FILE = database annotation.
-l, splits/ DIR = bestsplits DIR for job parallelization (pre-processing).
Note: Alternatively, use the ``-l splits/`` option to call SNPs for specific species, contig regions (BED) or single positions (contig_id pos). Unlisted contigs/pos are skipped.
The script generates a list of commandline jobs. Submit each commandline to your GE or compute locally.
## 4. Part III: Post-Processing (Filtering & Analysis)
### a) Filtering:
Note: requires SNP calling (Part II) to be done!
> script in **src/filtering.py** with elaborate help message
usage: python src/filtering.py
Caution: Perform this step seperately for individual SNPs and population SNPs.
usage: metaSNP_filtering.py
positional arguments:
perc_FILE input file with horizontal genome (taxon) coverage (breadth) per sample (percentage covered)
cov_FILE input file with average genome (taxon) coverage
(depth) per sample (average number reads per site)
snp_FILE input files from SNP calling
all_samples list of input BAM files, one per line
output_dir/ output folder
optional arguments:
-h, --help show this help message and exit
-p PERC, --perc PERC Coverage breadth: Horizontal coverage cutoff
(percentage taxon covered) per sample (default: 40.0)
-c COV, --cov COV Coverage depth: Average vertical coverage cutoff per
taxon, per sample (default: 5.0)
-m MINSAMPLES, --minsamples MINSAMPLES
Minimum number of sample required to pass the coverage
cutoffs per genome (default: 2)
-s SNPC, --snpc SNPC FILTERING STEP II: SNP coverage cutoff (default: 5.0)
-i SNPI, --snpi SNPI FILTERING STEP II: SNP occurence (incidence) cutoff
within samples_of_interest (default: 0.5)
### b) Analysis:
> TODO: include R scripts for pairwise distance and subspecies computation
......@@ -140,6 +170,8 @@ Example Tutorial
## 1. Run the setup & compilation steps and download the provided reference database.
$ getRefDB.sh
## 2. Go to the EXAMPLE directory and download the samples with the getSamplesScript.sh
$ cd EXAMPLE
......@@ -147,7 +179,7 @@ Example Tutorial
## 3. Initiate a new project in the parent directory
$ ./NewProject.sh tutorial
$ metaSNP_New tutorial
## 4. Generate the 'all_samples' file
......@@ -155,23 +187,23 @@ Example Tutorial
## 5. Prepare and run the coverage estimation
$ ./createCovComputationRun.sh tutorial/ tutorial/all_samples > runCoverage
$ metaSNP_COV tutorial/ tutorial/all_samples > runCoverage
$ bash runCoverage
## 6. Perform a work load balancing step for run time optimization.
$ ./createOptSplit tutorial/ db/Genomev9_definitions 5
$ metaSNP_OPT tutorial/ db/Genomev9_definitions 5
$ bash runCoverage
## 7. Prepare and run the SNP calling step
$ ./createSNPrunningBatch.sh tutorial/ tutorial/all_samples > runSNPcall
$ metaSNP_SNP tutorial/ tutorial/all_samples db/RepGenomesv9.fna -a db/RefOrganismDB_v9_gene.clean -l tutorial/bestsplits/ > runSNPcall
$ bash runSNPcall
## 8. Run the post processing / filtering steps
### a) Compute allele frequencies for each position that pass the given thresholds.
$ ./filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNPs.best_split_* tutorial/all_samples tutorial/filtered/pop/
$ metaSNP_filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNPs.best_split_* tutorial/all_samples tutorial/filtered/pop/
### b) Compute pair-wise distances between samples on their SNP profiles and create a PCoA plot.
......@@ -179,17 +211,17 @@ Example Tutorial
Basic usage
===========
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone and without a gene annotation file) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation via make.
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation.
Additional information and options (tools and scripts):
---------------------------------------------------------
### metaSNP caller
Calls SNPs from samtools pileup format and generates two outputs:
Calls SNPs from samtools pileup format and generates two outputs.
usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs
usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs
Options:
-f, faidx indexed reference genome.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment