README.md

# MetaSNV, a metagenomic SNV calling pipeline


The metaSNV pipeline performs variant calling on aligned metagenomic samples.


Download
========

Via Git:

    git clone git@git.embl.de:costea/metaSNV.git
    
or [download](https://git.embl.de/rmuench/metaSNP/repository/archive.zip?ref=master) a zip file of the repository.

Dependencies
============

* [Boost-1.53.0 or above](http://www.boost.org/users/download/)

* [htslib](http://www.htslib.org/)
 
* Python-2.7 or above

#### Installing dependencies on Ubuntu/debian

On an Ubuntu/debian system, the following sequence of commands will install all
required packages (the first two are only necessary if you have not enabled the
universe repository before):


    sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) universe"
    sudo apt-get update

    sudo apt-get install libhts-dev libboost-dev

### Installing dependencies using anaconda

If you use [anaconda](https://www.continuum.io/downloads), you can create an
environment with all necessary dependencies using the following commands:

    conda create --name metaSNV boost htslib pkg-config
    source activate metaSNV
    export CFLAGS=-I$CONDA_ENV_PATH/include
    export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH

If you do not have a C++ compiler, anaconda can also install G++:

    conda create --name metaSNV boost htslib pkg-config
    source activate metaSNV

    # Add this command:
    conda install gcc

    export CFLAGS=-I$CONDA_ENV_PATH/include
    export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH

Setup & Compilation
===================

    make

Workflow:
=========
## Required Files:

* **'all\_samples'**    = a list of all BAM files, one /path/2/sample.bam per line (no duplicates)
* **'ref\_db'**  = the reference database in fasta format (f.i. multi-sequence fasta)

## Optional Files:
* **'db\_ann'** = a gene annotation file for the reference database.

## To use one of the provided reference databases:

    ./getRefDB.sh
    
## 2. Run metaSNV

    metaSNV.py project_dir/ all_samples ref_db ref_fasta [options]

## 3. Part II: Post-Processing (Filtering & Analysis)

### a) Filtering:
Note: requires SNP calling (Part II) to be done!
Caution: Perform this step seperately for individual SNPs and population SNPs.

    usage: metaSNP_filtering.py

        positional arguments:
            perc_FILE               input file with horizontal genome (taxon) coverage (breadth) per sample (percentage covered)
            cov_FILE                input file with average genome (taxon) coverage
                                    (depth) per sample (average number reads per site)
            snp_FILE                input files from SNP calling
            all_samples             list of input BAM files, one per line
            output_dir/             output folder

        optional arguments:
            -h, --help              show this help message and exit
            -p PERC, --perc PERC    Coverage breadth: Horizontal coverage cutoff
                                    (percentage taxon covered) per sample (default: 40.0)
            -c COV, --cov COV       Coverage depth: Average vertical coverage cutoff per
                                    taxon, per sample (default: 5.0)
            -m MINSAMPLES, --minsamples MINSAMPLES
                                    Minimum number of sample required to pass the coverage
                                    cutoffs per genome (default: 2)
            -s SNPC, --snpc SNPC    FILTERING STEP II: SNP coverage cutoff (default: 5.0)
            -i SNPI, --snpi SNPI    FILTERING STEP II: SNP occurence (incidence) cutoff
                                    within samples_of_interest (default: 0.5)

### b) Analysis:
>   TODO: include R scripts for computing pairwise distances and visualization


Example Tutorial
================

## 1. Run the setup & compilation steps and download the provided reference database.

    ./getRefDB.sh

## 2. Go to the EXAMPLE directory and download the samples with the getSamplesScript.sh

    $ cd EXAMPLE
    $ ./getSamplesScript.sh

## 3. Initiate a new project in the parent directory

    $ metaSNV_New tutorial

## 4. Generate the 'all_samples' file

    $ find `pwd`/EXAMPLE/samples -name “*.bam” > tutorial/all_samples

## 5. Prepare and run the coverage estimation

    $ metaSNV_COV tutorial/ tutorial/all_samples > runCoverage
    $ bash runCoverage

## 6. Perform a work load balancing step for run time optimization.

    $ metaSNV_OPT tutorial/ db/Genomev9_definitions 5
    $ bash runCoverage

## 7. Prepare and run the SNV calling step

    $ metaSNV_SNP tutorial/ tutorial/all_samples db/RepGenomesv9.fna -a db/RefOrganismDB_v9_gene.clean -l tutorial/bestsplits/ > runSNPcall
    $ bash runSNPcall

## 8. Run the post processing / filtering steps
### a) Compute allele frequencies for each position that pass the given thresholds.

    $ metaSNV_filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNVs.best_split_* tutorial/all_samples tutorial/filtered/pop/

### b) Compute pair-wise distances between samples on their SNP profiles and create a PCoA plot.
    

Advanced usage (tools and scripts)
==================================

If you are interested in using the pipeline in a more manual way (for example
the metaSNV caller stand alone), you will find the executables for the
individual steps in the `src/` directory.

metaSNV caller
--------------
Calls SNVs from samtools pileup format and generates two outputs.

    usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs

    Options:
        -f,     faidx indexed reference genome.
        -g,     gene annotation file.
        -i,     individual SNVs.

Note: Expecting samtools mpileup as standard input

### __Output__
1. Population SNVs (pSNVs): 
Population wide variants that occur with a frequency of 1 % at positions with at least 4x coverage.

2. Individual specific SNVs (iSNVs):
Non population variants, that occur with a frequency of 10 % at positions with at least 10x coverage.


[qaCompute](https://github.com/CosteaPaul/qaTools)
-------------------------------------------------
Computes normal and span coverage from a bam/sam file. 
Also counts unmapped and sub-par quality reads.

### __Parameters:__
   m	    -	Compute median coverage for each contig/chromosome. 
   		Will make running a bit slower. Off by default.
   
   q [INT]  -   Quality threshold. Any read with a mapping quality under
                INT will be ignored when computing the coverage.
                
		NOTE: bwa outputs mapping quality 0 for reads that map with
		equal quality in multiple places. If you want to condier this,
		set q to 0.
		
   d        -   Print coverage histrogram over each individual contig/chromosome.
   	        These details will be printed in file <output>.detail
   	        
   p [INT]  -   Print coverage profile to bed file, averaged over given window size.  
   
   i        -   Silent run. Will not print running info to stdout.
    
   s [INT]  -   Compute span coverage. (Use for mate pair libs)
                Instead of actual read coverage, using the options will consider
                the entire span of the insert as a read, if insert size is
		lower than INT. 
 		For an accurate estimation of span coverage, I recommend
		setting an insert size limit INT around 3*std_dev of your lib's 
		insert size distribution.
    
   c [INT]  -   Maximum X coverage to consider in histogram.
    
   h [STR]  -   Use different header. 
                Because mappers sometimes break the headers or simply don't output them, 
		this is provieded as a non-kosher way around it. Use with care!
    
   For more info on the parameteres try ./qaCompute
   

metaSNV_filtering.py
--------------------   
usage: metaSNV filtering [-h] [-p PERC] [-c COV] [-m MINSAMPLES] [-s SNPC]
                         [-i SNPI] 
                         perc_FILE cov_FILE snp_FILE [snp_FILE ...]
                         all_samples output_dir/

metaSNV filtering

positional arguments:
  perc_FILE             input file with horizontal genome (taxon) coverage
                        (breadth) per sample (percentage covered)
  cov_FILE              input file with average genome (taxon) coverage
                        (depth) per sample (average number reads per site)
  snp_FILE              input files from SNP calling
  all_samples           list of input BAM files, one per line
  output_dir/           output folder

optional arguments:
  -h, --help            show this help message and exit
  -p PERC, --perc PERC  Coverage breadth: Horizontal coverage cutoff
                        (percentage taxon covered) per sample (default: 40.0)
  -c COV, --cov COV     Coverage depth: Average vertical coverage cutoff per
                        taxon, per sample (default: 5.0)
  -m MINSAMPLES, --minsamples MINSAMPLES
                        Minimum number of sample that have to pass the
                        filtering criteria in order to write an output for the
                        representative Genome (default: 2)
  -s SNPC, --snpc SNPC  FILTERING STEP II: SNP coverage cutoff (default: 5.0)
  -i SNPI, --snpi SNPI  FILTERING STEP II: SNP occurence (incidence) cutoff
                        within samples_of_interest (default: 0.5)