README.md 9.19 KB
Newer Older
Paul Costea's avatar
Paul Costea committed
1 2 3 4 5 6 7 8 9 10 11
# MetaSNV, a metagenomic SNV calling pipeline


The metaSNV pipeline performs variant calling on aligned metagenomic samples.


Download
========

Via Git:

Paul Igor Costea's avatar
Paul Igor Costea committed
12
    git clone git@git.embl.de:costea/metaSNV.git
Paul Costea's avatar
Paul Costea committed
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259
    
or [download](https://git.embl.de/rmuench/metaSNP/repository/archive.zip?ref=master) a zip file of the repository.

Dependencies
============

* [Boost-1.53.0 or above](http://www.boost.org/users/download/)

* [htslib](http://www.htslib.org/)
 
* Python-2.7 or above

#### Installing dependencies on Ubuntu/debian

On an Ubuntu/debian system, the following sequence of commands will install all
required packages (the first two are only necessary if you have not enabled the
universe repository before):


    sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) universe"
    sudo apt-get update

    sudo apt-get install libhts-dev libboost-dev

### Installing dependencies using anaconda

If you use [anaconda](https://www.continuum.io/downloads), you can create an
environment with all necessary dependencies using the following commands:

    conda create --name metaSNV boost htslib pkg-config
    source activate metaSNV
    export CFLAGS=-I$CONDA_ENV_PATH/include
    export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH

If you do not have a C++ compiler, anaconda can also install G++:

    conda create --name metaSNV boost htslib pkg-config
    source activate metaSNV

    # Add this command:
    conda install gcc

    export CFLAGS=-I$CONDA_ENV_PATH/include
    export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH

Setup & Compilation
===================

    make

Workflow:
=========
## Required Files:

* **'all\_samples'**    = a list of all BAM files, one /path/2/sample.bam per line (no duplicates)
* **'ref\_db'**  = the reference database in fasta format (f.i. multi-sequence fasta)

## Optional Files:
* **'db\_ann'** = a gene annotation file for the reference database.

## To use one of the provided reference databases:

    ./getRefDB.sh
    
## 2. Run metaSNV

    metaSNV.py project_dir/ all_samples ref_db ref_fasta [options]

## 3. Part II: Post-Processing (Filtering & Analysis)

### a) Filtering:
Note: requires SNP calling (Part II) to be done!
Caution: Perform this step seperately for individual SNPs and population SNPs.

    usage: metaSNP_filtering.py

        positional arguments:
            perc_FILE               input file with horizontal genome (taxon) coverage (breadth) per sample (percentage covered)
            cov_FILE                input file with average genome (taxon) coverage
                                    (depth) per sample (average number reads per site)
            snp_FILE                input files from SNP calling
            all_samples             list of input BAM files, one per line
            output_dir/             output folder

        optional arguments:
            -h, --help              show this help message and exit
            -p PERC, --perc PERC    Coverage breadth: Horizontal coverage cutoff
                                    (percentage taxon covered) per sample (default: 40.0)
            -c COV, --cov COV       Coverage depth: Average vertical coverage cutoff per
                                    taxon, per sample (default: 5.0)
            -m MINSAMPLES, --minsamples MINSAMPLES
                                    Minimum number of sample required to pass the coverage
                                    cutoffs per genome (default: 2)
            -s SNPC, --snpc SNPC    FILTERING STEP II: SNP coverage cutoff (default: 5.0)
            -i SNPI, --snpi SNPI    FILTERING STEP II: SNP occurence (incidence) cutoff
                                    within samples_of_interest (default: 0.5)

### b) Analysis:
>   TODO: include R scripts for computing pairwise distances and visualization


Example Tutorial
================

## 1. Run the setup & compilation steps and download the provided reference database.

    ./getRefDB.sh

## 2. Go to the EXAMPLE directory and download the samples with the getSamplesScript.sh

    $ cd EXAMPLE
    $ ./getSamplesScript.sh

## 3. Initiate a new project in the parent directory

    $ metaSNV_New tutorial

## 4. Generate the 'all_samples' file

    $ find `pwd`/EXAMPLE/samples -name “*.bam” > tutorial/all_samples

## 5. Prepare and run the coverage estimation

    $ metaSNV_COV tutorial/ tutorial/all_samples > runCoverage
    $ bash runCoverage

## 6. Perform a work load balancing step for run time optimization.

    $ metaSNV_OPT tutorial/ db/Genomev9_definitions 5
    $ bash runCoverage

## 7. Prepare and run the SNV calling step

    $ metaSNV_SNP tutorial/ tutorial/all_samples db/RepGenomesv9.fna -a db/RefOrganismDB_v9_gene.clean -l tutorial/bestsplits/ > runSNPcall
    $ bash runSNPcall

## 8. Run the post processing / filtering steps
### a) Compute allele frequencies for each position that pass the given thresholds.

    $ metaSNV_filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNVs.best_split_* tutorial/all_samples tutorial/filtered/pop/

### b) Compute pair-wise distances between samples on their SNP profiles and create a PCoA plot.
    



Advanced usage (tools and scripts)
==================================

If you are interested in using the pipeline in a more manual way (for example
the metaSNV caller stand alone), you will find the executables for the
individual steps in the `src/` directory.

metaSNV caller
--------------
Calls SNVs from samtools pileup format and generates two outputs.

    usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs

    Options:
        -f,     faidx indexed reference genome.
        -g,     gene annotation file.
        -i,     individual SNVs.

Note: Expecting samtools mpileup as standard input

### __Output__
1. Population SNVs (pSNVs): 
Population wide variants that occur with a frequency of 1 % at positions with at least 4x coverage.

2. Individual specific SNVs (iSNVs):
Non population variants, that occur with a frequency of 10 % at positions with at least 10x coverage.


[qaCompute](https://github.com/CosteaPaul/qaTools)
-------------------------------------------------
Computes normal and span coverage from a bam/sam file. 
Also counts unmapped and sub-par quality reads.

### __Parameters:__
   m	    -	Compute median coverage for each contig/chromosome. 
   		Will make running a bit slower. Off by default.
   
   q [INT]  -   Quality threshold. Any read with a mapping quality under
                INT will be ignored when computing the coverage.
                
		NOTE: bwa outputs mapping quality 0 for reads that map with
		equal quality in multiple places. If you want to condier this,
		set q to 0.
		
   d        -   Print coverage histrogram over each individual contig/chromosome.
   	        These details will be printed in file <output>.detail
   	        
   p [INT]  -   Print coverage profile to bed file, averaged over given window size.  
   
   i        -   Silent run. Will not print running info to stdout.
    
   s [INT]  -   Compute span coverage. (Use for mate pair libs)
                Instead of actual read coverage, using the options will consider
                the entire span of the insert as a read, if insert size is
		lower than INT. 
 		For an accurate estimation of span coverage, I recommend
		setting an insert size limit INT around 3*std_dev of your lib's 
		insert size distribution.
    
   c [INT]  -   Maximum X coverage to consider in histogram.
    
   h [STR]  -   Use different header. 
                Because mappers sometimes break the headers or simply don't output them, 
		this is provieded as a non-kosher way around it. Use with care!
    
   For more info on the parameteres try ./qaCompute
   

metaSNV_filtering.py
--------------------   
usage: metaSNV filtering [-h] [-p PERC] [-c COV] [-m MINSAMPLES] [-s SNPC]
                         [-i SNPI] 
                         perc_FILE cov_FILE snp_FILE [snp_FILE ...]
                         all_samples output_dir/

metaSNV filtering

positional arguments:
  perc_FILE             input file with horizontal genome (taxon) coverage
                        (breadth) per sample (percentage covered)
  cov_FILE              input file with average genome (taxon) coverage
                        (depth) per sample (average number reads per site)
  snp_FILE              input files from SNP calling
  all_samples           list of input BAM files, one per line
  output_dir/           output folder

optional arguments:
  -h, --help            show this help message and exit
  -p PERC, --perc PERC  Coverage breadth: Horizontal coverage cutoff
                        (percentage taxon covered) per sample (default: 40.0)
  -c COV, --cov COV     Coverage depth: Average vertical coverage cutoff per
                        taxon, per sample (default: 5.0)
  -m MINSAMPLES, --minsamples MINSAMPLES
                        Minimum number of sample that have to pass the
                        filtering criteria in order to write an output for the
                        representative Genome (default: 2)
  -s SNPC, --snpc SNPC  FILTERING STEP II: SNP coverage cutoff (default: 5.0)
  -i SNPI, --snpi SNPI  FILTERING STEP II: SNP occurence (incidence) cutoff
                        within samples_of_interest (default: 0.5)