README.md 3.95 KB
Newer Older
Paul Costea's avatar
Paul Costea committed
1 2 3 4 5 6 7 8 9 10 11
# MetaSNV, a metagenomic SNV calling pipeline


The metaSNV pipeline performs variant calling on aligned metagenomic samples.


Download
========

Via Git:

12
    git clone https://git.embl.de/costea/metaSNV.git
Paul Costea's avatar
Paul Costea committed
13
    
14
or [download](https://git.embl.de/costea/metaSNV/repository/archive.zip?ref=master) a zip file of the repository.
Paul Costea's avatar
Paul Costea committed
15 16 17 18 19 20 21 22 23

Dependencies
============

* [Boost-1.53.0 or above](http://www.boost.org/users/download/)

* [htslib](http://www.htslib.org/)
 
* Python-2.7 or above
24 25
    * numpy
    * pandas
Paul Costea's avatar
Paul Costea committed
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

#### Installing dependencies on Ubuntu/debian

On an Ubuntu/debian system, the following sequence of commands will install all
required packages (the first two are only necessary if you have not enabled the
universe repository before):


    sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) universe"
    sudo apt-get update
    sudo apt-get install libhts-dev libboost-dev

### Installing dependencies using anaconda

If you use [anaconda](https://www.continuum.io/downloads), you can create an
environment with all necessary dependencies using the following commands:

43
    conda create --name metaSNV boost htslib pkg-config numpy pandas
Paul Costea's avatar
Paul Costea committed
44 45 46 47 48 49
    source activate metaSNV
    export CFLAGS=-I$CONDA_ENV_PATH/include
    export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH

If you do not have a C++ compiler, anaconda can also install G++:

50
    conda create --name metaSNV boost htslib pkg-config numpy pandas
Paul Costea's avatar
Paul Costea committed
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
    source activate metaSNV
    # Add this command:
    conda install gcc
    export CFLAGS=-I$CONDA_ENV_PATH/include
    export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH

Setup & Compilation
===================

    make

Workflow:
=========
## Required Files:

Paul Costea's avatar
Paul Costea committed
66 67 68
* **'all\_samples'**  = a list of all BAM files, one /path/2/sample.bam per line (no duplicates)
* **'ref\_db'**       = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'**      = a list with start and end positions for each sequence in the reference (format: `sequence\_id  start end`)
Paul Costea's avatar
Paul Costea committed
69 70

## Optional Files:
Paul Costea's avatar
Paul Costea committed
71
* **'db\_ann'** = a gene annotation file for the reference database (format: ).
Paul Costea's avatar
Paul Costea committed
72 73 74 75 76 77 78

## To use one of the provided reference databases:

    ./getRefDB.sh
    
## 2. Run metaSNV

Paul Costea's avatar
Paul Costea committed
79
    metaSNV.py project_dir/ all_samples ref_db [options]
Paul Costea's avatar
Paul Costea committed
80 81 82 83

## 3. Part II: Post-Processing (Filtering & Analysis)
Note: requires SNP calling (Part II) to be done!

Paul Costea's avatar
Paul Costea committed
84
    metaSNV_post.py project_dir [options]
Paul Costea's avatar
Paul Costea committed
85 86 87 88

Example Tutorial
================

Paul Igor Costea's avatar
Paul Igor Costea committed
89
## 1. Run the setup & compilation steps and download the provided reference database. 
Paul Costea's avatar
Paul Costea committed
90

Paul Igor Costea's avatar
Paul Igor Costea committed
91
    $ ./getRefDB.sh
Paul Igor Costea's avatar
Paul Igor Costea committed
92
    Select freeze9, as the tutorial files have been mapped against this freeze. 
Paul Costea's avatar
Paul Costea committed
93 94 95 96 97 98

## 2. Go to the EXAMPLE directory and download the samples with the getSamplesScript.sh

    $ cd EXAMPLE
    $ ./getSamplesScript.sh

Paul Costea's avatar
Paul Costea committed
99
## 3. Make sample list
Paul Costea's avatar
Paul Costea committed
100

Paul Costea's avatar
Paul Costea committed
101
    $ find `pwd`/EXAMPLE/samples -name “*.bam” > sample_list
Paul Costea's avatar
Paul Costea committed
102

Paul Costea's avatar
Paul Costea committed
103
## 4. Run the SNV calling step
Paul Costea's avatar
Paul Costea committed
104

105
    $ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --threads 8
Paul Costea's avatar
Paul Costea committed
106

Paul Costea's avatar
Paul Costea committed
107
## 5. Run filtering and post processing
Paul Costea's avatar
Paul Costea committed
108

Paul Costea's avatar
Paul Costea committed
109
    $ python metaSNV_post.py tutorial
Paul Costea's avatar
Paul Costea committed
110
    
Paul Costea's avatar
Paul Costea committed
111
    Voila! Your distances will be in the tutorial/distances folder. Enjoy!
Paul Costea's avatar
Paul Costea committed
112

Paul Costea's avatar
Paul Costea committed
113
Advanced usage 
Paul Costea's avatar
Paul Costea committed
114 115
==================================

Paul Costea's avatar
Paul Costea committed
116 117 118 119
If you want to run a lot of samples and would like to use the power of your cluster, we will print out the commands you need to
run and you can decide on how to schedule and manage them.

## 1. Get the first set of commands
Paul Costea's avatar
Paul Costea committed
120
    
121
    $ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --n_splits 8 --print-commands
Paul Costea's avatar
Paul Costea committed
122
    
Paul Costea's avatar
Paul Costea committed
123 124 125 126
    Note the addition of the "--print-commnads". This will print out one-liners that you need to run. When done, run same again.

## 2. Get the second set of commands
 
127
    $ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --n_splits 8 --print-commands
Paul Costea's avatar
Paul Costea committed
128
    
Paul Costea's avatar
Paul Costea committed
129
    This will calculate the "load balancing" and give you the commands for running the SNV calling.
Paul Costea's avatar
Paul Costea committed
130
    
Paul Costea's avatar
Paul Costea committed
131
## 3. Run post-processing as usual
Paul Costea's avatar
Paul Costea committed
132

Paul Costea's avatar
Paul Costea committed
133
    $ python metaSNV_post.py tutorial
Paul Costea's avatar
Paul Costea committed
134