README.md 3.95 KB
Newer Older
Paul Costea's avatar
moveing  
Paul Costea committed
1
2
3
4
5
6
7
8
9
10
11
# MetaSNV, a metagenomic SNV calling pipeline


The metaSNV pipeline performs variant calling on aligned metagenomic samples.


Download
========

Via Git:

12
    git clone https://git.embl.de/costea/metaSNV.git
Paul Costea's avatar
moveing  
Paul Costea committed
13
    
14
or [download](https://git.embl.de/costea/metaSNV/repository/archive.zip?ref=master) a zip file of the repository.
Paul Costea's avatar
moveing  
Paul Costea committed
15
16
17
18
19
20
21
22
23

Dependencies
============

* [Boost-1.53.0 or above](http://www.boost.org/users/download/)

* [htslib](http://www.htslib.org/)
 
* Python-2.7 or above
24
25
    * numpy
    * pandas
Paul Costea's avatar
moveing  
Paul Costea committed
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

#### Installing dependencies on Ubuntu/debian

On an Ubuntu/debian system, the following sequence of commands will install all
required packages (the first two are only necessary if you have not enabled the
universe repository before):


    sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) universe"
    sudo apt-get update
    sudo apt-get install libhts-dev libboost-dev

### Installing dependencies using anaconda

If you use [anaconda](https://www.continuum.io/downloads), you can create an
environment with all necessary dependencies using the following commands:

43
    conda create --name metaSNV boost htslib pkg-config numpy pandas
Paul Costea's avatar
moveing  
Paul Costea committed
44
45
46
47
48
49
    source activate metaSNV
    export CFLAGS=-I$CONDA_ENV_PATH/include
    export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH

If you do not have a C++ compiler, anaconda can also install G++:

50
    conda create --name metaSNV boost htslib pkg-config numpy pandas
Paul Costea's avatar
moveing  
Paul Costea committed
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
    source activate metaSNV
    # Add this command:
    conda install gcc
    export CFLAGS=-I$CONDA_ENV_PATH/include
    export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH

Setup & Compilation
===================

    make

Workflow:
=========
## Required Files:

Paul Costea's avatar
Paul Costea committed
66
67
68
* **'all\_samples'**  = a list of all BAM files, one /path/2/sample.bam per line (no duplicates)
* **'ref\_db'**       = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'**      = a list with start and end positions for each sequence in the reference (format: `sequence\_id  start end`)
Paul Costea's avatar
moveing  
Paul Costea committed
69
70

## Optional Files:
Paul Costea's avatar
Paul Costea committed
71
* **'db\_ann'** = a gene annotation file for the reference database (format: ).
Paul Costea's avatar
moveing  
Paul Costea committed
72
73
74
75
76
77
78

## To use one of the provided reference databases:

    ./getRefDB.sh
    
## 2. Run metaSNV

Paul Costea's avatar
Paul Costea committed
79
    metaSNV.py project_dir/ all_samples ref_db [options]
Paul Costea's avatar
moveing  
Paul Costea committed
80
81
82
83

## 3. Part II: Post-Processing (Filtering & Analysis)
Note: requires SNP calling (Part II) to be done!

Paul Costea's avatar
Paul Costea committed
84
    metaSNV_post.py project_dir [options]
Paul Costea's avatar
moveing  
Paul Costea committed
85
86
87
88

Example Tutorial
================

Paul Igor Costea's avatar
Paul Igor Costea committed
89
## 1. Run the setup & compilation steps and download the provided reference database. 
Paul Costea's avatar
moveing  
Paul Costea committed
90

Paul Igor Costea's avatar
Paul Igor Costea committed
91
    $ ./getRefDB.sh
Paul Igor Costea's avatar
Paul Igor Costea committed
92
    Select freeze9, as the tutorial files have been mapped against this freeze. 
Paul Costea's avatar
moveing  
Paul Costea committed
93
94
95
96
97
98

## 2. Go to the EXAMPLE directory and download the samples with the getSamplesScript.sh

    $ cd EXAMPLE
    $ ./getSamplesScript.sh

Paul Costea's avatar
Paul Costea committed
99
## 3. Make sample list
Paul Costea's avatar
moveing  
Paul Costea committed
100

Paul Costea's avatar
Paul Costea committed
101
    $ find `pwd`/EXAMPLE/samples -name “*.bam” > sample_list
Paul Costea's avatar
moveing  
Paul Costea committed
102

Paul Costea's avatar
Paul Costea committed
103
## 4. Run the SNV calling step
Paul Costea's avatar
moveing  
Paul Costea committed
104

105
    $ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --threads 8
Paul Costea's avatar
moveing  
Paul Costea committed
106

Paul Costea's avatar
Paul Costea committed
107
## 5. Run filtering and post processing
Paul Costea's avatar
moveing  
Paul Costea committed
108

Paul Costea's avatar
Paul Costea committed
109
    $ python metaSNV_post.py tutorial
Paul Costea's avatar
moveing  
Paul Costea committed
110
    
Paul Costea's avatar
Paul Costea committed
111
    Voila! Your distances will be in the tutorial/distances folder. Enjoy!
Paul Costea's avatar
moveing  
Paul Costea committed
112

Paul Costea's avatar
Paul Costea committed
113
Advanced usage 
Paul Costea's avatar
moveing  
Paul Costea committed
114
115
==================================

Paul Costea's avatar
Paul Costea committed
116
117
118
119
If you want to run a lot of samples and would like to use the power of your cluster, we will print out the commands you need to
run and you can decide on how to schedule and manage them.

## 1. Get the first set of commands
Paul Costea's avatar
moveing  
Paul Costea committed
120
    
121
    $ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --n_splits 8 --print-commands
Paul Costea's avatar
moveing  
Paul Costea committed
122
    
Paul Costea's avatar
Paul Costea committed
123
124
125
126
    Note the addition of the "--print-commnads". This will print out one-liners that you need to run. When done, run same again.

## 2. Get the second set of commands
 
127
    $ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --n_splits 8 --print-commands
Paul Costea's avatar
moveing  
Paul Costea committed
128
    
Paul Costea's avatar
Paul Costea committed
129
    This will calculate the "load balancing" and give you the commands for running the SNV calling.
Paul Costea's avatar
moveing  
Paul Costea committed
130
    
Paul Costea's avatar
Paul Costea committed
131
## 3. Run post-processing as usual
Paul Costea's avatar
moveing  
Paul Costea committed
132

Paul Costea's avatar
Paul Costea committed
133
    $ python metaSNV_post.py tutorial
Paul Costea's avatar
moveing  
Paul Costea committed
134