Commit 6012fddc authored by Robin Erich Muench's avatar Robin Erich Muench
Browse files

Merge branch 'master' of git.embl.de:rmuench/metaSNP

parents 95089c6b 25f9cb25
......@@ -28,73 +28,76 @@ Dependencies
Setup & Compilation
==================
===================
MetaSNP is mainly implemented in C/C++ and needs to be compiled.
I. Setup:
---------
I. Setup
--------
a) Change the SAMTOOLS variable in the **SETUPFILE** to your path to Samtools.
If you don't have samtools, you can find it [here](http://samtools.sourceforge.net/):
http://samtools.sourceforge.net/
We assume that ``samtools`` is in your system path variable.
You can set this variable by putting the following line into your bashrc.
export PATH=/path/2/samtools:$PATH
b) Change the BOOST variable in the **SETUPFILE** to your path to BOOST library.
If you don't have boost, you can find boost [here](http://www.boost.org/users/download/):
http://www.boost.org/users/download/
c) Setup your own reference database or acquire the database we use.
In order to acquire our the database run the provided script:
c) Setup your own **reference database** or acquire the database we use.
In order to acquire our database run the provided script in the parent directory:
usage: ./getRefDB.sh
II. Compilation:
----------------
1) run *make* in the parent directory of metaSNP to compile qaTools and the snpCaller.
1) run `make` in the parent directory of metaSNP to compile qaTools and the snpCaller.
usage: make
Usage:
======
Required Files:
----------------
* a list of all BAM files to be used, one /path/to/sample.bam per line ('all_samples')
* the reference genome (fasta or multi-sequence fasta)
* genome/contig positions list (the provided one in 'src/genome_definitions' corresponds to our freeze9!)
* [optional] a gene annotation file for the reference genome (f.i. db/RefOrganismDB\_v9\_gene.clean)
===============
* **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (avoid duplicates!)
* **'ref_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'** = a list with start and end positions for each sequence (format: sequence_id start end)
* **TODO! Format 'db\_ann'** = [optional] a gene annotation file for the reference in [general feature format](http://www.ensembl.org/info/website/upload/gff.html) (GFF).
Workflow:
---------
=========
**__Note:__ Each part is dependened on the successful completion of the previous one.**
### 1. Initiate a NewProject with NewProject.sh
## 1. Initiate a NewProject with NewProject.sh
usage: ./NewProject.sh 'project_dir'
usage: ./NewProject.sh project_dir
Generates a structured results directory for your project.
### 2. Part I: Pre-processing [optional]
## 2. Part I: Pre-processing [optional]
Note: This part can be skipped if the workload balancing has already been computed.
#### a) run createCoverageComputationRun.sh
### a) run createCoverageComputationRun.sh
usage: ./createCoverageComputationRun.sh all_samples project_dir/ > toRunCov
The script generates a list of commandline jobs.
Run these commandline jobs before proceeding with the next step.
#### b) run createOptSplit
### b) run createOptSplit
Helper script for workload balancing (reference genome splitting).
usage: ./createOptSplit project_dir/ [nr_splits]
Note: This step requires an appropriate reference genome_definitions file in src/.
Note: This step requires an appropriate reference genome_definitions file.
### 3. Part II: SNP calling
## 3. Part II: SNP calling
Note: Requires pre-processing to be computed or you use an already existing one (e.g. src/bestsplits).
### a) createSNPrunningBatch.sh
......@@ -113,31 +116,77 @@ Note: Requires pre-processing to be computed or you use an already existing one
The script generates a list of commandline jobs.
Submit each commandline to your GE or compute locally.
### 4. Part III: Post-Processing (Filtering & Analysis)
## 4. Part III: Post-Processing (Filtering & Analysis)
#### a) Filtering:
### a) Filtering:
Note: requires SNP calling (Part II) to be done!
> script in **src/filtering.py** with elaborate help message
usage: python src/filtering.py
#### b) Analysis:
### b) Analysis:
> TODO: include R scripts for pairwise distance and subspecies computation
Tools and scripts (additional information):
==========================================
__metaSNP caller__
-------------------
Example Tutorial
================
## 1. Run the setup & compilation steps and acquire the provided reference database.
## 2. Go to the EXAMPLE directory and acquire the samples with the getSamplesScript.sh
$ cd EXAMPLE
$ ./getSamplesScript.sh
## 3. Initiate a new project in the parent directory
$ ./NewProject.sh tutorial
## 4. Generate the 'all_samples' file
$ find `pwd`/EXAMPLE/samples -name “*.bam” > tutorial/all_samples
## 5. Prepare and run the coverage estimation
$ ./createCovComputationRun.sh tutorial/ tutorial/all_samples > runCoverage
$ bash runCoverage
## 6. Perform the work load balancing for parallelization into five jobs.
$ ./createOptSplit tutorial/ db/Genomev9_definitions 5
$ bash runCoverage
## 7. Prepare and run the SNP calling step
$ ./createSNPrunningBatch.sh tutorial/ tutorial/all_samples > runSNPcall
$ bash runSNPcall
## 8. Run the post processing / filtering steps
Compute allele frequencies for each position that pass the given thresholds.
$ ./filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNPs.best_split_* tutorial/all_samples tutorial/filtered/pop/
Basic usage
===========
If you are interested in using the pipeline in a more manual way (for example the metaSNP caller stand alone and without a gene annotation file) feel free to explore the src/ directory.
You will find scripts as well as the binaries for qaCompute and the metaSNP caller in their corresponding directories (src/qaCompute and src/snpCaller) post compilation via make.
Additional information and options (tools and scripts):
---------------------------------------------------------
###metaSNP caller
Calls SNPs from samtools pileup format and generates two outputs:
usage: ./snpCall [options] < stdin.mpileup
usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs
Options:
-f, faidx indexed reference genome .
-f, faidx indexed reference genome.
-g, gene annotation file.
-i, the gene annotation file.
-i, individual SNPs.
Note: Expecting samtools mpileup as standard input
#### __Output__
......@@ -147,14 +196,12 @@ Population wide variants that occur with a frequency of 1 % at positions with at
2. Individual specific SNPs (iSNPs):
Non population variants, that occur with a frequency of 10 % at positions with at least 10x coverage.
__qaCompute__
--------------
###[qaComput](https://github.com/CosteaPaul/qaTools)
Computes normal and span coverage from a bam/sam file.
Also counts unmapped and sub-par quality reads.
### __Parameters:__
#### __Parameters:__
m - Compute median coverage for each contig/chromosome.
Will make running a bit slower. Off by default.
......@@ -186,4 +233,39 @@ Also counts unmapped and sub-par quality reads.
Because mappers sometimes break the headers or simply don't output them,
this is provieded as a non-kosher way around it. Use with care!
For more info on the parameteres try ./qaCompute
\ No newline at end of file
For more info on the parameteres try ./qaCompute
###filtering.py
usage: metaSNP filtering [-h] [-p PERC] [-c COV] [-m MINSAMPLES] [-s SNPC]
[-i SNPI]
perc_FILE cov_FILE snp_FILE [snp_FILE ...]
all_samples output_dir/
metaSNP filtering
positional arguments:
perc_FILE input file with horizontal genome (taxon) coverage
(breadth) per sample (percentage covered)
cov_FILE input file with average genome (taxon) coverage
(depth) per sample (average number reads per site)
snp_FILE input files from SNP calling
all_samples list of input BAM files, one per line
output_dir/ output folder
optional arguments:
-h, --help show this help message and exit
-p PERC, --perc PERC Coverage breadth: Horizontal coverage cutoff
(percentage taxon covered) per sample (default: 40.0)
-c COV, --cov COV Coverage depth: Average vertical coverage cutoff per
taxon, per sample (default: 5.0)
-m MINSAMPLES, --minsamples MINSAMPLES
Minimum number of sample that have to pass the
filtering criteria in order to write an output for the
representative Genome (default: 2)
-s SNPC, --snpc SNPC FILTERING STEP II: SNP coverage cutoff (default: 5.0)
-i SNPI, --snpi SNPI FILTERING STEP II: SNP occurence (incidence) cutoff
within samples_of_interest (default: 0.5)
import os # interacting with operating system
import sys # interact with system commandline
import time # print current date/time
import argparse # manage command line options and arguments
#directory = os.path.realpath(__file__))
cwdir = os.getcwd()
print("\n\nmetaSNP post-processing:")
#############################
# Parse Commandline Arguments
#############################
def get_arguments():
'''
Get commandline arguments and return namespace
'''
## Initialize Parser
parser = argparse.ArgumentParser(prog='metaSNP post-processing', description='metaSNP post-processing', epilog='''Note:''', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--version', action='version', version='%(prog)s 1.0')
## Verbose or Quiet -- More or Less information (Mutually Exclusive Options) --- Not Used!
#group = parser.add_mutually_exclusive_group()
#group.add_argument("-e", "--verbose", action="store_true")
#group.add_argument("-q", "--quiet", action="store_true")
# Flags:
## debugging flag
parser.add_argument("-d", "--debug", action="store_true")
## VCF format
parser.add_argument("-v", "--vcf", action="store_true")
## Population Statistics (Fst - ) TODO:
parser.add_argument("-f", "--fst", action="store_true")
## pNpS ratio TODO:
parser.add_argument("-n", "--pnps", action="store_true")
# OPTIONAL arguments:
## Optional Cutoffs - SAMPLE FILTERING:
parser.add_argument('-p','--perc', type=float, help="Coverage breadth: Horizontal coverage cutoff (percentage taxon covered) per sample", default=40.0)
parser.add_argument('-c','--cov', type=float, help="Coverage depth: Average vertical coverage cutoff per taxon, per sample", default=5.0)
parser.add_argument('-m','--minsamples', type=int, help="Minimum number of sample that have to pass the filtering criteria in order to write an output for the representative Genome", default=10)
## Optional Cufoffs - SNP FILTERING:
parser.add_argument('-s','--snpc', type=float, help="FILTERING STEP II: SNP coverage cutoff", default=5.0)
parser.add_argument('-i','--snpi', type=float, help="FILTERING STEP II: SNP occurence (incidence) cutoff within samples_of_interest", default=0.50)
# FILES:
## Required Input Files
# Horizontal Genome Coverage
parser.add_argument('percentage_file', help='input file with horizontal genome (taxon) coverage (breadth) per sample (percentage covered)', metavar='perc_FILE')
# Vertical Genome Coverage
parser.add_argument('coverage_file', help='input file with average genome (taxon) coverage (depth) per sample (average number reads per site)', metavar='cov_FILE')
# SNP FILE (list of one or more files)
parser.add_argument('snp_files', nargs='+', help='input files from SNP calling', metavar='snp_FILE')
## Optional OUTPUT DIRECTOY
parser.add_argument('-out', '--out_folder', nargs='?', const='./filtered/', default='./filtered/', help='output folder', metavar='FILE_OUT')
#parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
# File with all samples (one per line) TODO: snpCall with headers!
parser.add_argument('all_samples', help='list of input BAM files, one per line', metavar='all_samples')
return parser.parse_args()
#args = get_arguments() # Namespace with commandline arguments
##############################
def debugging(taxids_of_interest, header_cov):
'''SanityCheck for the Dict of TaxID:Samples'''
nr_keys = 0
taxids_of_interest = []
for key in samples_of_interest:
nr_keys += 1
taxids_of_interest.append(key)
taxids_of_interest.sort()
print("DEBUGGING:\nNr_TaxIDs_of_Interest.keys: {}".format(nr_keys))
print(taxids_of_interest)
# print("\n")
# print("nr_TaxIDs_of_Interest.keys: {}".format(nr_keys))
header = "\t".join(header_cov)
print("DEBUGGING: no. samples in header: {}\n".format(len(header_cov)))
# print("header: {}".format(header))
sys.exit("only filter 1")
def file_check():
''' Check if required files exist (True / False)'''
print("\nchecking for files...")
if os.path.isfile(args.coverage_file) and os.path.isfile(args.percentage_file):
print("found: '{}' \nfound:'{}'".format(args.coverage_file, args.percentage_file))
else:
sys.exit("\nERROR: No such file '{}',\nERROR: No such file '{}'".format(args.coverage_file, args.percentage_file))
if os.path.isfile(args.all_samples):
print("found: '{}'\n".format(args.all_samples))
else:
sys.exit("\nERROR: No such file '{}'".format(args.all_samples))
def print_arguments():
# for argument in args:
# print(argument)
# print("end-loop")
## TODO write nice loop over args!
## Print Defined Thresholds:
print("Options:")
if args.perc:
print("threshold: percentage covered (breadth) {}".format(args.perc) )
if args.cov:
print("threshold: average coverage (depth) {}".format(args.cov) )
if args.minsamples:
print("threshold: Min. number samples_of_interest per taxid_of_interest {}".format(args.minsamples) )
if args.snpc:
print("threshold: Min. coverage per position within a sample_of_interest {}".format(args.snpc) )
if args.snpi:
print("threshold: Min. position incidence among samples_of_interest {}".format(args.snpi) )
#if args.quiet:
# print("threshold: quiet {}".format(args.quiet) )
if args.vcf:
print("threshold: vcf {}".format(args.vcf) )
if args.out_folder:
print("Output: {}".format(args.out_folder) )
# Check VCF meta info:
if args.vcf == True:
print(write_vcf_meta())
# print(write_vcf_header())
# sys.exit("DEBUG: vcf_test")
else:
print("Printing no vcf")
print("\n")
# VCF format representation
def write_vcf_meta():
"function that generates meta-information lines for VCF version 4.2"
vcf_meta = """##fileformat=VCFv4.2
##fileDate={0}
##source=metaSNP_frequency.py
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##INFO=<ID=PS,Number=1,Type=Integer,Description="Total Number of Samples in Sample Pool">
##INFO=<ID=NT,Number=1,Type=Integer,Description="Total Number of Samples of Interest (SoI) that Passed Filtering)">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of SoIs With Data at the current Position">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth (only SoIs)">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=MA,Number=1,Type=String,Description="Major Allele">
##INFO=<ID=RF,Number=0,Type=Flag,Description="Available Gene Annotation">
##FILTER=<ID=bCV,Description="SoI: Genome (Species) Coverage below {1}%">
##FILTER=<ID=dCV,Description="SoI: Average Genome Coverage (Depth) below {2}X">
##FILTER=<ID=mSoI,Description="Taxon of Interest: SoI below {3}">
##FILTER=<ID=pCov,Description="Position Coverage below {4}">
##FILTER=<ID=pInc,Description="Position Incidence below {5}">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Reads with ALT base">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Position Read Depth">
##FORMAT=<ID=AF,Number=2,Type=Integer,Description="Haplotype Quality">""".format(time.strftime("%Y%m%d"),args.perc,args.cov,args.minsamples,args.snpc,args.snpi)
return vcf_meta
# VCF format line header
def write_vcf_header():
"function that generates the line header for VCF version 4.2"
vcf_header = """#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT"""
return vcf_header
# VCF format line
def write_vcf_line(chrom, pos, id, ref, alt, qual, filt, info, format):
"function that generates the line for each variant (VCFv4.2)"
# Define Function Arguments
# chromosom = contig
# position = pos
# id = (We put the gene annotations here)
# ref = refBase
# alt = altBase
# qual = qualScore
# filter = (Hack: put PASS and define filters in meta!)
# infO = (info string - global variables)
# format =
# foreach sample append(string\t)
# vcf_fields = ("#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003")
vcf_line = """{0} {1} {2} {3} {4} {5} {6} {7} {8}
""".format(chrom, pos, id, ref, alt, qual, filt, info, format)
return vcf_line
###############################
## FILTER I: Count SAMPLES_OF_INTEREST per TAXON_OF_INTEREST
# --Coverage Conditions: (Example: "Which taxon has at least 10 SoI?")
# Sample of Interest:
# 1. breadth > 40 %
# 2. depth > 5 X
# --Min. number Samples:
# 3. min samples/TaxID > 10
#
## .Open both files and discard first and second row.
## ..first row = sample_names, second_row = garbage
## .Check Requirements Line by Line
## ..store samples meeting the requirements in a list
## .Map list with samples_of_interest to the TaxID
def relevant_taxa(args):
'''function that goes through the coverage files and determines taxa and samples of interest'''
samples_of_interest = {}
with open(args.coverage_file, 'r') as COV, open(args.percentage_file, 'r') as PER:
header_cov = COV.readline().split()
header_per = PER.readline().split()
COV.readline() # skip second row in COV_FILE
PER.readline() # skip second row in PER_FILE
# Check if header match:
if header_cov != header_per:
sys.exit("ERROR: Coverage file headers do not match!") #Exit with error message
# Read taxon by taxon (line by line) check coverage conditions (thresholds)
for cov,perc in zip(COV,PER): # Line_cov: TaxID \t cov_valueS1 \t cov_value_S2[...]\n (no white spaces)
coverage = list(map(float, cov.split())) # convert list_of_strings 2 list_of_floats (TaxID=INT!)
percentage = list(map(float, perc.split())) # convert list_of_strings 2 list_of_floats
cov_taxID = int(coverage.pop(0)) # get and remove taxID (TaxID = INT)
perc_taxID = int(percentage.pop(0)) # get and remove taxID (TaxID = INT)
sample_count = 0
sample_names = []
if cov_taxID != perc_taxID: # check if taxIDs match!
sys.exit("ERROR: TaxIDs are not in the same order!") #Exit with error message
# Now loop through the samples and find those matching the criteria:
else:
for c,p in zip(coverage,percentage): # two arrays with floats (taxID pop[removed])
sample_count += 1 # which sample are we at?
# Predefined Threshold defaults: coverage[5X], percentage[40%]
if c >= args.cov and p >= args.perc:
# print("sample_count{}, TaxID: {}. Sample: {} matched the criteria. cov: {}, perc: {}".format(sample_count,cov_taxID, header_cov[sample_count-1], c, p))
sample_names.append(header_cov[sample_count-1]) # starting at 0 in array
# print ('COVERAGE:',cov,'PERCENTAGE',perc)
# sys.exit(0) # Exit with 0 (=Success
# Generate DICT of [LIST] where: dict { TaxID : [samples_of_interest] }i
# Only if Counter went through all samples and sample_names not empty
if sample_count == len(header_cov) and len(sample_names) >= args.minsamples:#float or integer?
samples_of_interest[cov_taxID] = sample_names
return {'SoI':samples_of_interest, 'h':header_cov}#return dict()
COV.close()
PER.close()
#################################
# FILTER I: END 'returns(<'samples_of_interest', type=dict>)'
#################################
def header_comparison(header_cov):
'''Comparing the sample order in the coverage and all_samples file'''
# TODO: Header comparison
#
# sort snp_indices by sample_of_interest names
#
# Header from all.cov and all.perc files
header = "\t".join(header_cov)
print("\nNr.Samples_cov_header: {}".format(len(header_cov)))
print("cov_header: {}\n\n".format(header_cov[:10]))
#print("\ncov_header: {}\n {}\n\n".format(len(header_cov),header))
### read <all_samples> for snp_file header:
all_samples = open(args.all_samples,'r')
snp_header = all_samples.read().splitlines()
snp_header = [i.split('/')[-1] for i in snp_header] #get name /trim/off/path/to/sample.name.bam
snp_header_joined = "\t".join(snp_header)
print("Nr.Samples in (all_samples): {}".format(len(snp_header)))
print("all_samples: {}\n".format(snp_header[:10]))
# Compare the sample order in ALL.COV/ALL.PERC and all_samples
print("Comparing headers..")
if header == snp_header_joined:
print("Headers match nicely\n")
# sys.exit("Debugging stop")
else:
print("CAUTION: Header in COV_FILE does not match the order of samples in the SNP_FILES,\n\t no problem, we took care of it!\n")
# sys.exit("ERROR: SNP and COV/PERC Headers do not match")
return snp_header
#################################
## FILTER II: ACQUIRE SNPs WITH SUFFICIENT OCCURENCE WITHIN SAMPLES_OF_INTERES
# --SNP Conditions (Default):
# 1. Position covered by at least (5) reads
# 2. Position present in at least 50 % of the accepted samples_of_interes
#
#
##.Open SNP files (parse list of best_splits_[0-9])
##..compare header
##..filter for SNPs with >=5X coverage and >=50% occurence within per TaxID_filtered_Samples
##..compute allele frequencies for SNPs_OF_Interest for each genome (TaxID)
##..Write output file (one per Genome=TaxID)
def filter_two(args, snp_header):
'''position wise filtering'''
# Initiate for writing appropriate header (according to SoIs)
header_taxID = 0
## FILTER II:
for best_split_x in args.snp_files:
# print(best_split_x, type(best_split_x))
with open(best_split_x, 'r') as file:
for snp_line in file: #go through the file snp_position wise
# print(snp_line)
# TODO: Compare Header !!! --- Fix snpCall to output Header for each best_split from all_samples
# get taxID
snp_taxID = int(snp_line.split()[0].split('.')[0])
# print(snp_taxID)
# Find Taxon of Interest (Taxon with sufficient samples_of_interest)
if snp_taxID not in samples_of_interest.keys():
# print("nope, snp_taxID:{} not in dictionary".format(snp_taxID))
continue #Taxon is not relevant, NEXT!
else:
# Load SAMPLE_LIST: Get SoI's within ToI
sample_list = samples_of_interest[snp_taxID]#list of sample_names
# Get indices for samples according to SNP header (for now, from all_samples file):TODO!
# !!! INDICES based on ORDER in COV/PERC file!!!
# print("sample_list: {}".format(sample_list))
sample_indices = []
for name in sample_list:
# print("name: {} and snp_header.index(name): {}".format(name,snp_header.index(name)))
# Get only indices for SoI's!
sample_indices.append(snp_header.index(name))
# print(snp_header.index(name))
# SANITY CHECKS:
# print(len(sample_indices),len(sample_list))
# print("\t".join(sample_list))
# sys.exit("only filter 1")
##
# Do FST computation here or after position wise filtering?!
#
#
#OPEN outfiles and write header:
# Write the appropriate Header (only Samples_of_interest)
if header_taxID != snp_taxID:#first of new taxID
outfile = open(args.out_folder+'%i.filtered.freq' % snp_taxID, 'a')
outfile.write('\t' + "\t".join(sample_list) + '\n')
if args.vcf == True:#VCF format
outfile_vcf = open(args.out_folder+'%i.filtered.vcf' % snp_taxID, 'a')
print("Generating: {}".format(args.out_folder+'%i.filtered.vcf' % snp_taxID))
outfile_vcf.write(write_vcf_meta() + '\n')#Write VCF meta header
outfile_vcf.write(write_vcf_header()+'\n')#Write VCF line header
header_taxID = snp_taxID#Jump to current TaxID
# CONDITIONS:
#.Loop through samples_of_interest and compare criteria:
#..IF (Example)
#..coverage at position < 5 X (=we consider the position not seen)
#..occurrence < 50% within considered samples (SoIs)
#..->Drop SNPs
whole_line = snp_line.split()
site_coverage = list(map(int, whole_line[4].split('|'))) # Site coverages as list of ints
#FILTER: Position coverage below threshold -> no increment:
nr_good = 0
for index in sample_indices:
if site_coverage[index] < args.snpc or site_coverage[index] == 0:
continue#NOT enough coverage depth at position in sample, no increment
else:
nr_good += 1
#FILTER: Position incidence with sufficient coverage below threshold: