Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • rossum/metaSNV
  • mantas/metaSNV
  • milanese/metaSNV
  • costea/metaSNV
4 results
Show changes
Commits on Source (37)
*.o
...@@ -9,9 +9,9 @@ Download ...@@ -9,9 +9,9 @@ Download
Via Git: Via Git:
git clone git@git.embl.de:rmuench/metaSNP.git git clone https://git.embl.de/costea/metaSNV.git
or [download](https://git.embl.de/rmuench/metaSNP/repository/archive.zip?ref=master) a zip file of the repository. or [download](https://git.embl.de/costea/metaSNV/repository/archive.zip?ref=master) a zip file of the repository.
Dependencies Dependencies
============ ============
...@@ -21,6 +21,8 @@ Dependencies ...@@ -21,6 +21,8 @@ Dependencies
* [htslib](http://www.htslib.org/) * [htslib](http://www.htslib.org/)
* Python-2.7 or above * Python-2.7 or above
* numpy
* pandas
#### Installing dependencies on Ubuntu/debian #### Installing dependencies on Ubuntu/debian
...@@ -31,7 +33,6 @@ universe repository before): ...@@ -31,7 +33,6 @@ universe repository before):
sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) universe" sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) universe"
sudo apt-get update sudo apt-get update
sudo apt-get install libhts-dev libboost-dev sudo apt-get install libhts-dev libboost-dev
### Installing dependencies using anaconda ### Installing dependencies using anaconda
...@@ -39,19 +40,17 @@ universe repository before): ...@@ -39,19 +40,17 @@ universe repository before):
If you use [anaconda](https://www.continuum.io/downloads), you can create an If you use [anaconda](https://www.continuum.io/downloads), you can create an
environment with all necessary dependencies using the following commands: environment with all necessary dependencies using the following commands:
conda create --name metaSNV boost htslib pkg-config conda create --name metaSNV boost htslib pkg-config numpy pandas
source activate metaSNV source activate metaSNV
export CFLAGS=-I$CONDA_ENV_PATH/include export CFLAGS=-I$CONDA_ENV_PATH/include
export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH
If you do not have a C++ compiler, anaconda can also install G++: If you do not have a C++ compiler, anaconda can also install G++:
conda create --name metaSNV boost htslib pkg-config conda create --name metaSNV boost htslib pkg-config numpy pandas
source activate metaSNV source activate metaSNV
# Add this command: # Add this command:
conda install gcc conda install gcc
export CFLAGS=-I$CONDA_ENV_PATH/include export CFLAGS=-I$CONDA_ENV_PATH/include
export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=$CONDA_ENV_PATH/lib:$LD_LIBRARY_PATH
...@@ -64,11 +63,12 @@ Workflow: ...@@ -64,11 +63,12 @@ Workflow:
========= =========
## Required Files: ## Required Files:
* **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (no duplicates) * **'all\_samples'** = a list of all BAM files, one /path/2/sample.bam per line (no duplicates)
* **'ref\_db'** = the reference database in fasta format (f.i. multi-sequence fasta) * **'ref\_db'** = the reference database in fasta format (f.i. multi-sequence fasta)
* **'gen\_pos'** = a list with start and end positions for each sequence in the reference (format: `sequence\_id start end`)
## Optional Files: ## Optional Files:
* **'db\_ann'** = a gene annotation file for the reference database. * **'db\_ann'** = a gene annotation file for the reference database (format: ).
## To use one of the provided reference databases: ## To use one of the provided reference databases:
...@@ -76,184 +76,59 @@ Workflow: ...@@ -76,184 +76,59 @@ Workflow:
## 2. Run metaSNV ## 2. Run metaSNV
metaSNV.py project_dir/ all_samples ref_db ref_fasta [options] metaSNV.py project_dir/ all_samples ref_db [options]
## 3. Part II: Post-Processing (Filtering & Analysis) ## 3. Part II: Post-Processing (Filtering & Analysis)
### a) Filtering:
Note: requires SNP calling (Part II) to be done! Note: requires SNP calling (Part II) to be done!
Caution: Perform this step seperately for individual SNPs and population SNPs.
usage: metaSNP_filtering.py
positional arguments:
perc_FILE input file with horizontal genome (taxon) coverage (breadth) per sample (percentage covered)
cov_FILE input file with average genome (taxon) coverage
(depth) per sample (average number reads per site)
snp_FILE input files from SNP calling
all_samples list of input BAM files, one per line
output_dir/ output folder
optional arguments:
-h, --help show this help message and exit
-p PERC, --perc PERC Coverage breadth: Horizontal coverage cutoff
(percentage taxon covered) per sample (default: 40.0)
-c COV, --cov COV Coverage depth: Average vertical coverage cutoff per
taxon, per sample (default: 5.0)
-m MINSAMPLES, --minsamples MINSAMPLES
Minimum number of sample required to pass the coverage
cutoffs per genome (default: 2)
-s SNPC, --snpc SNPC FILTERING STEP II: SNP coverage cutoff (default: 5.0)
-i SNPI, --snpi SNPI FILTERING STEP II: SNP occurence (incidence) cutoff
within samples_of_interest (default: 0.5)
### b) Analysis:
> TODO: include R scripts for computing pairwise distances and visualization
metaSNV_post.py project_dir [options]
Example Tutorial Example Tutorial
================ ================
## 1. Run the setup & compilation steps and download the provided reference database. ## 1. Run the setup & compilation steps and download the provided reference database.
./getRefDB.sh $ ./getRefDB.sh
Select freeze9, as the tutorial files have been mapped against this freeze.
## 2. Go to the EXAMPLE directory and download the samples with the getSamplesScript.sh ## 2. Go to the EXAMPLE directory and download the samples with the getSamplesScript.sh
$ cd EXAMPLE $ cd EXAMPLE
$ ./getSamplesScript.sh $ ./getSamplesScript.sh
## 3. Initiate a new project in the parent directory ## 3. Make sample list
$ metaSNV_New tutorial
## 4. Generate the 'all_samples' file
$ find `pwd`/EXAMPLE/samples -name “*.bam” > tutorial/all_samples
## 5. Prepare and run the coverage estimation
$ metaSNV_COV tutorial/ tutorial/all_samples > runCoverage $ find `pwd`/EXAMPLE/samples -name “*.bam” > sample_list
$ bash runCoverage
## 6. Perform a work load balancing step for run time optimization. ## 4. Run the SNV calling step
$ metaSNV_OPT tutorial/ db/Genomev9_definitions 5 $ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --threads 8
$ bash runCoverage
## 7. Prepare and run the SNV calling step ## 5. Run filtering and post processing
$ metaSNV_SNP tutorial/ tutorial/all_samples db/RepGenomesv9.fna -a db/RefOrganismDB_v9_gene.clean -l tutorial/bestsplits/ > runSNPcall $ python metaSNV_post.py tutorial
$ bash runSNPcall
## 8. Run the post processing / filtering steps
### a) Compute allele frequencies for each position that pass the given thresholds.
$ metaSNV_filtering.py tutorial/tutorial.all_perc.tab tutorial/tutorial.all_cov.tab tutorial/snpCaller/called_SNVs.best_split_* tutorial/all_samples tutorial/filtered/pop/
### b) Compute pair-wise distances between samples on their SNP profiles and create a PCoA plot.
Voila! Your distances will be in the tutorial/distances folder. Enjoy!
Advanced usage
Advanced usage (tools and scripts)
================================== ==================================
If you are interested in using the pipeline in a more manual way (for example If you want to run a lot of samples and would like to use the power of your cluster, we will print out the commands you need to
the metaSNV caller stand alone), you will find the executables for the run and you can decide on how to schedule and manage them.
individual steps in the `src/` directory.
## 1. Get the first set of commands
metaSNV caller
--------------
Calls SNVs from samtools pileup format and generates two outputs.
usage: ./snpCall [options] < stdin.mpileup > std.out.popSNPs
Options:
-f, faidx indexed reference genome.
-g, gene annotation file.
-i, individual SNVs.
Note: Expecting samtools mpileup as standard input
### __Output__
1. Population SNVs (pSNVs):
Population wide variants that occur with a frequency of 1 % at positions with at least 4x coverage.
2. Individual specific SNVs (iSNVs):
Non population variants, that occur with a frequency of 10 % at positions with at least 10x coverage.
[qaCompute](https://github.com/CosteaPaul/qaTools)
-------------------------------------------------
Computes normal and span coverage from a bam/sam file.
Also counts unmapped and sub-par quality reads.
### __Parameters:__
m - Compute median coverage for each contig/chromosome.
Will make running a bit slower. Off by default.
q [INT] - Quality threshold. Any read with a mapping quality under
INT will be ignored when computing the coverage.
NOTE: bwa outputs mapping quality 0 for reads that map with
equal quality in multiple places. If you want to condier this,
set q to 0.
d - Print coverage histrogram over each individual contig/chromosome.
These details will be printed in file <output>.detail
p [INT] - Print coverage profile to bed file, averaged over given window size.
i - Silent run. Will not print running info to stdout.
s [INT] - Compute span coverage. (Use for mate pair libs) $ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --n_splits 8 --print-commands
Instead of actual read coverage, using the options will consider
the entire span of the insert as a read, if insert size is
lower than INT.
For an accurate estimation of span coverage, I recommend
setting an insert size limit INT around 3*std_dev of your lib's
insert size distribution.
c [INT] - Maximum X coverage to consider in histogram. Note the addition of the "--print-commnads". This will print out one-liners that you need to run. When done, run same again.
## 2. Get the second set of commands
$ python metaSNV.py tutorial sample_list db/freeze9.genomes.RepGenomesv9.fna --n_splits 8 --print-commands
h [STR] - Use different header. This will calculate the "load balancing" and give you the commands for running the SNV calling.
Because mappers sometimes break the headers or simply don't output them,
this is provieded as a non-kosher way around it. Use with care!
For more info on the parameteres try ./qaCompute ## 3. Run post-processing as usual
metaSNV_filtering.py
--------------------
usage: metaSNV filtering [-h] [-p PERC] [-c COV] [-m MINSAMPLES] [-s SNPC]
[-i SNPI]
perc_FILE cov_FILE snp_FILE [snp_FILE ...]
all_samples output_dir/
metaSNV filtering
positional arguments:
perc_FILE input file with horizontal genome (taxon) coverage
(breadth) per sample (percentage covered)
cov_FILE input file with average genome (taxon) coverage
(depth) per sample (average number reads per site)
snp_FILE input files from SNP calling
all_samples list of input BAM files, one per line
output_dir/ output folder
optional arguments:
-h, --help show this help message and exit
-p PERC, --perc PERC Coverage breadth: Horizontal coverage cutoff
(percentage taxon covered) per sample (default: 40.0)
-c COV, --cov COV Coverage depth: Average vertical coverage cutoff per
taxon, per sample (default: 5.0)
-m MINSAMPLES, --minsamples MINSAMPLES
Minimum number of sample that have to pass the
filtering criteria in order to write an output for the
representative Genome (default: 2)
-s SNPC, --snpc SNPC FILTERING STEP II: SNP coverage cutoff (default: 5.0)
-i SNPI, --snpi SNPI FILTERING STEP II: SNP occurence (incidence) cutoff
within samples_of_interest (default: 0.5)
$ python metaSNV_post.py tutorial
/net/mraid08/export/jafar/Microbiome/Analyses/Costea/metaSNP/EXAMPLE/samples/s5.sorted.bam
/net/mraid08/export/jafar/Microbiome/Analyses/Costea/metaSNP/EXAMPLE/samples/s2.sorted.bam
/net/mraid08/export/jafar/Microbiome/Analyses/Costea/metaSNP/EXAMPLE/samples/s1.sorted.bam
/net/mraid08/export/jafar/Microbiome/Analyses/Costea/metaSNP/EXAMPLE/samples/s6.sorted.bam
/net/mraid08/export/jafar/Microbiome/Analyses/Costea/metaSNP/EXAMPLE/samples/s3.sorted.bam
/net/mraid08/export/jafar/Microbiome/Analyses/Costea/metaSNP/EXAMPLE/samples/s4.sorted.bam
...@@ -4,12 +4,12 @@ download_9() { ...@@ -4,12 +4,12 @@ download_9() {
[ -d "db" ] || mkdir -p db [ -d "db" ] || mkdir -p db
cd db cd db
echo "Downloading the archive." echo "Downloading the archive."
wget vm-lux.embl.de/~rmuench/files/db_f9.tar.bz2 wget vm-lux.embl.de/~rmuench/files/freeze9.tar.bz2
echo "This might take up to 10 minutes." echo "This might take up to 10 minutes."
echo "Extracting files!" echo "Extracting files!"
echo "Patience please..." echo "Patience please..."
tar xjvf db_f9.tar.bz2 tar xjvf freeze9.tar.bz2
rm db_f9.tar.bz2 rm freeze9.tar.bz2
exit exit
} }
...@@ -17,12 +17,12 @@ download_11() { ...@@ -17,12 +17,12 @@ download_11() {
[ -d "db" ] || mkdir -p db [ -d "db" ] || mkdir -p db
cd db cd db
echo "Downloading the archive." echo "Downloading the archive."
wget vm-lux.embl.de/~rmuench/files/db_f11.tar.bz2 wget vm-lux.embl.de/~rmuench/files/freeze11.tar.bz2
echo "This might take up to 20 minutes." echo "This might take up to 20 minutes."
echo "Extracting files!" echo "Extracting files!"
echo "Patience please..." echo "Patience please..."
tar xjvf db_f11.tar.bz2 tar xjvf freeze11.tar.bz2
rm db_f11.tar.bz2 rm freeze11.tar.bz2
exit exit
} }
......
...@@ -51,10 +51,7 @@ def run_sample(sample, command): ...@@ -51,10 +51,7 @@ def run_sample(sample, command):
def compute_opt(args): def compute_opt(args):
out_dir = path.join(args.project_dir, 'cov') out_dir = path.join(args.project_dir, 'cov')
try: mkdir_p(out_dir)
os.makedirs(out_dir)
except:
pass
p = multiprocessing.Pool(args.threads, init_worker) p = multiprocessing.Pool(args.threads, init_worker)
results = [] results = []
for line in open(args.all_samples): for line in open(args.all_samples):
...@@ -76,30 +73,38 @@ def compute_opt(args): ...@@ -76,30 +73,38 @@ def compute_opt(args):
stderr.write("Failure in sample {}".format(sample)) stderr.write("Failure in sample {}".format(sample))
stderr.write("Call to {} failed.".format(' '.join(cmd))) stderr.write("Call to {} failed.".format(' '.join(cmd)))
exit(1) exit(1)
def split_opt(args):
if args.n_splits > 100: def get_header(args):
stderr.write("Maximum number of splits is 100.\n") use = open(args.all_samples).readline().rstrip()
exit(1) o = subprocess.check_output(["samtools","view","-H",use]).decode("utf-8")
f = open(args.project_dir + '/bed_header','w')
older_files = glob(args.project_dir + '/bestsplits/*') for line in o.split('\n')[1:]:
if older_files: line = line.rstrip().split('\t')
stderr.write("\nremoving old splits.\n") if len(line) != 3:
for f in older_files: continue
os.unlink(f) if line[0] == "@SQ":
line[1] = line[1].replace('SN:','')
line[2] = line[2].replace('LN:','')
f.write(line[1]+'\t1\t'+line[2]+'\n')
f.close()
args.ctg_len = args.project_dir + '/bed_header'
def compute_summary(args):
'''This information is required by metaSNV_post.py'''
project_name = path.basename(args.project_dir)
cov_dir = path.join(args.project_dir, 'cov') cov_dir = path.join(args.project_dir, 'cov')
cov_files = glob(cov_dir + '/*.cov') cov_files = glob(cov_dir + '/*.cov')
project_name = path.basename(args.project_dir)
if not cov_files: if not cov_files:
if not args.print_commands: if not args.print_commands:
stderr.write("Coverage files not found.\n") stderr.write("Coverage files not found.\n")
else: else:
stderr.write("Coverage files not found.\nFinish running the commands prinnted above and then run this command again.\n") stderr.write("Coverage files not found.\nFinish running the commands printed above and then run this command again.\n")
exit(1) exit(1)
for f in cov_files: for f in cov_files:
cmd = ['python', cmd = ['python',
path.join(basedir, 'src/computeGenomeCoverage.py'), path.join(basedir, 'src/computeGenomeCoverage.py'),
f, f,
f + '.detail', f + '.detail',
f + '.summary'] f + '.summary']
...@@ -107,11 +112,25 @@ def split_opt(args): ...@@ -107,11 +112,25 @@ def split_opt(args):
print("\nCoverage summary here: {}".format(args.project_dir)) print("\nCoverage summary here: {}".format(args.project_dir))
print(" Average vertical genome coverage: '{}/{}.all_cov.tab'".format(args.project_dir, project_name)) print(" Average vertical genome coverage: '{}/{}.all_cov.tab'".format(args.project_dir, project_name))
print(" Horizontal genome coverage (1X): '{}/{}.all_perc.tab'".format(args.project_dir, project_name)) print(" Horizontal genome coverage (1X): '{}/{}.all_perc.tab'".format(args.project_dir, project_name))
print("")
cmd = ['python', cmd = ['python',
'{}/src/collapse_coverages.py'.format(basedir), '{}/src/collapse_coverages.py'.format(basedir),
args.project_dir] args.project_dir]
subprocess.call(cmd) subprocess.call(cmd)
def split_opt(args):
if args.n_splits > 100:
stderr.write("Maximum number of splits is 100.\n")
args.n_splits = 100
older_files = glob(args.project_dir + '/bestsplits/*')
if older_files:
stderr.write("\nremoving old splits.\n")
for f in older_files:
os.unlink(f)
project_name = path.basename(args.project_dir)
print("\nCalculating best database split:") print("\nCalculating best database split:")
# usage createOptimumSplit.sh <all_cov.tab> <all_perc.tab> <geneDefinitions> <INT_NrSplits> <.outfile> # usage createOptimumSplit.sh <all_cov.tab> <all_perc.tab> <geneDefinitions> <INT_NrSplits> <.outfile>
...@@ -206,7 +225,7 @@ def main(): ...@@ -206,7 +225,7 @@ def main():
help='File with an input list of bam files, one file per line') help='File with an input list of bam files, one file per line')
parser.add_argument("ref_db", metavar='REF_DB_FILE', parser.add_argument("ref_db", metavar='REF_DB_FILE',
help='reference multi-sequence FASTA file used for the alignments.') help='reference multi-sequence FASTA file used for the alignments.')
parser.add_argument('--db_ann', metavar='BED_FILE',default='', parser.add_argument('--db_ann', metavar='DB_ANN_FILE',default='',
help='Database gene annotation.') help='Database gene annotation.')
parser.add_argument('--print-commands', default=False, action='store_true', parser.add_argument('--print-commands', default=False, action='store_true',
help='Instead of executing the commands, simply print them out') help='Instead of executing the commands, simply print them out')
...@@ -214,9 +233,9 @@ def main(): ...@@ -214,9 +233,9 @@ def main():
help='Number of jobs to run simmultaneously. Will create same number of splits, unless n_splits set differently.') help='Number of jobs to run simmultaneously. Will create same number of splits, unless n_splits set differently.')
parser.add_argument('--n_splits', metavar='INT', default=1,type=int, parser.add_argument('--n_splits', metavar='INT', default=1,type=int,
help='Number of bins to split ref into') help='Number of bins to split ref into')
parser.add_argument('--ctg_len',metavar='BED_FILE',default='',
help='Per contig lengths, required for splitting.')
args = parser.parse_args() args = parser.parse_args()
args.project_dir = args.project_dir.rstrip('/')
if not path.isfile(args.ref_db): if not path.isfile(args.ref_db):
stderr.write(''' stderr.write('''
ERROR: No reference database or annotation file found!" ERROR: No reference database or annotation file found!"
...@@ -228,31 +247,27 @@ SOLUTION: run getRefDB.sh or set up a custom database before running metaSNP cal ...@@ -228,31 +247,27 @@ SOLUTION: run getRefDB.sh or set up a custom database before running metaSNP cal
exit(1) exit(1)
if not path.isfile(basedir+"/src/qaTools/qaCompute") or not path.isfile(basedir+"/src/snpCaller/snpCall"): if not path.isfile(basedir+"/src/qaTools/qaCompute") or not path.isfile(basedir+"/src/snpCaller/snpCall"):
stderr.write(''' stderr.write('''
ERROR: No binaries found ERROR: No binaries found
SOLUTION: make\n\n'''.format(basedir)) SOLUTION: make\n\n'''.format(basedir))
exit(1) exit(1)
if (args.n_splits > 1 or args.threads > 1) and args.ctg_len=='':
stderr.write('''
ERROR: No contig information supplied. Cannot compute splits
SOLUTION: Supply contig information through --ctg_len\n\n'''.format(basedir))
exit(1)
if args.threads > 1 and args.n_splits==1: if args.threads > 1 and args.n_splits==1:
args.n_splits=args.threads args.n_splits=args.threads
if path.exists(args.project_dir) and not args.print_commands: if path.exists(args.project_dir) and not args.print_commands:
stderr.write("Project directory '{}' already exists\n\n\n".format(args.project_dir)) stderr.write("Project directory '{}' already exists\n\n\n".format(args.project_dir))
exit(1) exit(1)
create_directories(args.project_dir) create_directories(args.project_dir)
compute_opt(args) compute_opt(args)
compute_summary(args)
get_header(args)
if args.n_splits > 1: if args.n_splits > 1:
split_opt(args) split_opt(args)
snp_call(args) snp_call(args)
if __name__ == '__main__': if __name__ == '__main__':
main() main()
#!/usr/bin/env python #!/usr/bin/env python
import os # interacting with operating system import os
import sys # interact with system commandline import sys
import time # print current date/time import argparse
import argparse # manage command line options and arguments
import glob import glob
import numpy as np try:
import pandas as pd import numpy as np
except ImportError:
sys.stderr.write("Numpy is necessary to run postfiltering.\n")
sys.exit(1)
try:
import pandas as pd
except ImportError:
sys.stderr.write("Pandas is necessary to run postfiltering.\n")
sys.exit(1)
basedir = os.path.dirname(os.path.abspath(__file__)) basedir = os.path.dirname(os.path.abspath(__file__))
...@@ -16,98 +23,101 @@ basedir = os.path.dirname(os.path.abspath(__file__)) ...@@ -16,98 +23,101 @@ basedir = os.path.dirname(os.path.abspath(__file__))
# Parse Commandline Arguments # Parse Commandline Arguments
############################# #############################
def get_arguments(): def get_arguments():
''' '''
Get commandline arguments and return namespace Get commandline arguments and return namespace
''' '''
## Initialize Parser ## Initialize Parser
parser = argparse.ArgumentParser(prog='metaSNV_post.py', description='metaSNV post processing', epilog='''Note:''', formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser = argparse.ArgumentParser(prog='metaSNV_post.py', description='metaSNV post processing', epilog='''Note:''', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# Not Showns: # Not Showns:
parser.add_argument('--version', action='version', version='%(prog)s 2.0', help=argparse.SUPPRESS) parser.add_argument('--version', action='version', version='%(prog)s 2.0', help=argparse.SUPPRESS)
parser.add_argument("--debug", action="store_true", help=argparse.SUPPRESS) parser.add_argument("--debug", action="store_true", help=argparse.SUPPRESS)
# TODO: POPULATION STATISTICS # TODO: POPULATION STATISTICS
## Population Statistics (Fst - ): ## Population Statistics (Fst - ):
# parser.add_argument("-f", "--fst", action="store_true") # parser.add_argument("-f", "--fst", action="store_true")
## pNpS ratio: ## pNpS ratio:
# parser.add_argument("-n", "--pnps", action="store_true") # parser.add_argument("-n", "--pnps", action="store_true")
# OPTIONAL arguments: # OPTIONAL arguments:
parser.add_argument('-b', metavar='FLOAT', type=float, help="Coverage breadth: Horizontal genome coverage percentage per sample per species", default=40.0) parser.add_argument('-b', metavar='FLOAT', type=float, default=40.0,
parser.add_argument('-d', metavar='FLOAT', type=float, help="Coverage depth: Average vertical genome coverage per sample per species", default=5.0) help="Coverage breadth: minimal horizontal genome coverage percentage per sample per species")
parser.add_argument('-m',metavar='INT', type=int, help="Minimum number of samples per species", default=2) parser.add_argument('-d', metavar='FLOAT', type=float, default=5.0,
parser.add_argument('-c',metavar='FLOAT', type=float, help="FILTERING STEP II: minimum coverage per position per sample per species", default=5.0) help="Coverage depth: minimal average vertical genome coverage per sample per species")
parser.add_argument('-p',metavar='FLOAT', type=float, help="FILTERING STEP II: required proportion of informative samples (coverage non-zero) per position", default=0.50) parser.add_argument('-m',metavar='INT', type=int, help="Minimum number of samples per species", default=2)
parser.add_argument('-c',metavar='FLOAT', type=float, help="FILTERING STEP II: minimum coverage per position per sample per species", default=5.0)
parser.add_argument('-p',metavar='FLOAT', type=float, help="FILTERING STEP II: required proportion of informative samples (coverage non-zero) per position", default=0.50)
parser.add_argument('-div',action='store_true',help="Compute diversity measures")
# REQUIRED arguments: # REQUIRED arguments:
parser.add_argument('projdir', help='project name', metavar='Proj') parser.add_argument('projdir', help='project name', metavar='Proj')
return parser.parse_args()
return parser.parse_args()
############################## ##############################
def debugging(taxids_of_interest, header_cov): def debugging(taxids_of_interest, header_cov):
'''Sanity check for the Dict of TaxID:Samples''' '''Sanity check for the Dict of TaxID:Samples'''
nr_keys = 0 nr_keys = 0
taxids_of_interest = [] taxids_of_interest = []
for key in samples_of_interest: for key in samples_of_interest:
nr_keys += 1 nr_keys += 1
taxids_of_interest.append(key) taxids_of_interest.append(key)
taxids_of_interest.sort() taxids_of_interest.sort()
print("DEBUGGING:\nNr_TaxIDs_of_Interest.keys: {}".format(nr_keys)) print("DEBUGGING:\nNr_TaxIDs_of_Interest.keys: {}".format(nr_keys))
print(taxids_of_interest) print(taxids_of_interest)
# print("\n") # print("\n")
# print("nr_TaxIDs_of_Interest.keys: {}".format(nr_keys)) # print("nr_TaxIDs_of_Interest.keys: {}".format(nr_keys))
header = "\t".join(header_cov) print("DEBUGGING: no. samples in header: {}\n".format(len(header_cov)))
print("DEBUGGING: no. samples in header: {}\n".format(len(header_cov))) # header = "\t".join(header_cov)
# print("header: {}".format(header)) # print("header: {}".format(header))
sys.exit("only filter 1") sys.exit("only filter 1")
def file_check(): def file_check():
''' Check if required files exist (True / False)''' ''' Check if required files exist (True / False)'''
args.coverage_file = args.projdir+'/'+args.projdir+'.all_cov.tab' args.projdir = args.projdir.rstrip('/')
args.percentage_file = args.projdir+'/'+args.projdir+'.all_perc.tab' args.coverage_file = args.projdir+'/'+args.projdir.split('/')[-1]+'.all_cov.tab'
args.all_samples = args.projdir+'/'+'all_samples' args.percentage_file = args.projdir+'/'+args.projdir.split('/')[-1]+'.all_perc.tab'
args.all_samples = args.projdir+'/'+'all_samples'
print("Checking for necessary input files...")
if os.path.isfile(args.coverage_file) and os.path.isfile(args.percentage_file): print("Checking for necessary input files...")
print("found: '{}' \nfound:'{}'".format(args.coverage_file, args.percentage_file)) if os.path.isfile(args.coverage_file) and os.path.isfile(args.percentage_file):
else: print("found: '{}' \nfound:'{}'".format(args.coverage_file, args.percentage_file))
sys.exit("\nERROR: No such file '{}',\nERROR: No such file '{}'".format(args.coverage_file, args.percentage_file)) else:
sys.exit("\nERROR: No such file '{}',\nERROR: No such file '{}'".format(args.coverage_file, args.percentage_file))
if os.path.isfile(args.all_samples):
print("found: '{}'\n".format(args.all_samples)) if os.path.isfile(args.all_samples):
else: print("found: '{}'\n".format(args.all_samples))
sys.exit("\nERROR: No such file '{}'".format(args.all_samples)) else:
sys.exit("\nERROR: No such file '{}'".format(args.all_samples))
def print_arguments(): def print_arguments():
## Print Defined Thresholds: ## Print Defined Thresholds:
print("Options:") print("Options:")
if args.b: if args.b:
print("threshold: percentage covered (breadth) {}".format(args.b) ) print("threshold: percentage covered (breadth) {}".format(args.b) )
if args.d: if args.d:
print("threshold: average coverage (depth) {}".format(args.d) ) print("threshold: average coverage (depth) {}".format(args.d) )
if args.m: if args.m:
print("threshold: Min. number samples_of_interest per taxid_of_interest {}".format(args.m) ) print("threshold: Min. number samples_of_interest per taxid_of_interest {}".format(args.m) )
if args.c: if args.c:
print("threshold: Min. position coverage per sample within samples_of_interest {}".format(args.c) ) print("threshold: Min. position coverage per sample within samples_of_interest {}".format(args.c) )
if args.p: if args.p:
print("threshold: Min. proportion of covered samples in samples_of_interest {}".format(args.p) ) print("threshold: Min. proportion of covered samples in samples_of_interest {}".format(args.p) )
print("") print("")
############################### ###############################
## FILTER I: Count SAMPLES_OF_INTEREST per TAXON_OF_INTEREST ## FILTER I: Count SAMPLES_OF_INTEREST per TAXON_OF_INTEREST
# Example: "Which taxon has at least 10 SoI?" # Example: "Which taxon has at least 10 SoI?"
# --Coverage Conditions: # --Coverage Conditions:
# Sample of Interest: # Sample of Interest:
# 1. breadth > 40 % # 1. breadth > 40 %
# 2. depth > 5 X # 2. depth > 5 X
# --Min. number Samples: # --Min. number Samples:
# 3. min samples/TaxID > 10 # 3. min samples/TaxID > 10
...@@ -126,9 +136,9 @@ def relevant_taxa(args): ...@@ -126,9 +136,9 @@ def relevant_taxa(args):
sys.exit("ERROR: Coverage file headers do not match!") #Exit with error message sys.exit("ERROR: Coverage file headers do not match!") #Exit with error message
# Read taxon by taxon (line by line) check coverage conditions (thresholds) # Read taxon by taxon (line by line) check coverage conditions (thresholds)
for cov,perc in zip(COV,PER): # Line_cov: TaxID \t cov_valueS1 \t cov_value_S2[...]\n (no white spaces) for cov,perc in zip(COV,PER): # Line_cov: TaxID \t cov_valueS1 \t cov_value_S2[...]\n (no white spaces)
cstring = cov.split() cstring = cov.split()
pstring = perc.split() pstring = perc.split()
cov_taxID = cstring.pop(0) cov_taxID = cstring.pop(0)
perc_taxID = pstring.pop(0) perc_taxID = pstring.pop(0)
coverage = list(map(float, cstring)) # convert list_of_strings 2 list_of_floats (TaxID=INT!) coverage = list(map(float, cstring)) # convert list_of_strings 2 list_of_floats (TaxID=INT!)
...@@ -141,10 +151,10 @@ def relevant_taxa(args): ...@@ -141,10 +151,10 @@ def relevant_taxa(args):
if cov_taxID != perc_taxID: # check if taxIDs match! if cov_taxID != perc_taxID: # check if taxIDs match!
sys.exit("ERROR: TaxIDs in the coverage files are not in the same order!") sys.exit("ERROR: TaxIDs in the coverage files are not in the same order!")
for c,p in zip(coverage,percentage): # two arrays with floats (taxID pop[removed]) for c,p in zip(coverage,percentage): # two arrays with floats (taxID pop[removed])
sample_count += 1 sample_count += 1
if c >= args.d and p >= args.b: if c >= args.d and p >= args.b:
sample_names.append(header_cov[sample_count-1]) # starting at 0 in array sample_names.append(header_cov[sample_count-1]) # starting at 0 in array
...@@ -165,7 +175,7 @@ def header_comparison(header_cov): ...@@ -165,7 +175,7 @@ def header_comparison(header_cov):
'''Comparing the sample order in the coverage and all_samples file''' '''Comparing the sample order in the coverage and all_samples file'''
# sort snp_indices by sample_of_interest names # sort snp_indices by sample_of_interest names
header = "\t".join(header_cov) #header = "\t".join(header_cov)
#print("\nNr.Samples_cov_header: {}".format(len(header_cov))) #print("\nNr.Samples_cov_header: {}".format(len(header_cov)))
#print("cov_header: {}\n\n".format(header_cov[:10])) #print("cov_header: {}\n\n".format(header_cov[:10]))
#print("\ncov_header: {}\n {}\n\n".format(len(header_cov),header)) #print("\ncov_header: {}\n {}\n\n".format(len(header_cov),header))
...@@ -174,21 +184,21 @@ def header_comparison(header_cov): ...@@ -174,21 +184,21 @@ def header_comparison(header_cov):
all_samples = open(args.all_samples,'r') all_samples = open(args.all_samples,'r')
snp_header = all_samples.read().splitlines() snp_header = all_samples.read().splitlines()
snp_header = [i.split('/')[-1] for i in snp_header] #get name /trim/off/path/to/sample.name.bam snp_header = [i.split('/')[-1] for i in snp_header] #get name /trim/off/path/to/sample.name.bam
snp_header_joined = "\t".join(snp_header)
#print("Nr.Samples in (all_samples): {}".format(len(snp_header))) #print("Nr.Samples in (all_samples): {}".format(len(snp_header)))
#print("all_samples: {}\n".format(snp_header[:10])) #print("all_samples: {}\n".format(snp_header[:10]))
# Compare the sample order in ALL.COV/ALL.PERC and all_samples # Compare the sample order in ALL.COV/ALL.PERC and all_samples
#print("Comparing headers..") #print("Comparing headers..")
#snp_header_joined = "\t".join(snp_header)
#if header == snp_header_joined: #if header == snp_header_joined:
# print("Headers match nicely\n") # print("Headers match nicely\n")
#else: #else:
# print("CAUTION: Header in COV_FILE does not match the order of samples in the SNP_FILES,\n\t no problem, we took care of it!\n") # print("CAUTION: Header in COV_FILE does not match the order of samples in the SNP_FILES,\n\t no problem, we took care of it!\n")
return snp_header
return snp_header
################################# #################################
## FILTER II: ACQUIRE SNPs WITH SUFFICIENT OCCURRENCE WITHIN SAMPLES_OF_INTEREST ## FILTER II: ACQUIRE SNPs WITH SUFFICIENT OCCURRENCE WITHIN SAMPLES_OF_INTEREST
# --SNP Conditions (Default): # --SNP Conditions (Default):
...@@ -197,21 +207,21 @@ def header_comparison(header_cov): ...@@ -197,21 +207,21 @@ def header_comparison(header_cov):
def filter_two(args, snp_header, snp_files, outdir): def filter_two(args, snp_header, snp_files, outdir):
'''position wise filtering''' '''position wise filtering'''
header_taxID = '' header_taxID = ''
snp_taxID = '_' snp_taxID = '_'
# print(locals()) # print(locals())
for best_split_x in snp_files: for best_split_x in snp_files:
with open(best_split_x, 'r') as file: with open(best_split_x, 'r') as file:
for snp_line in file: #position wise loop for snp_line in file: #position wise loop
snp_taxID = snp_line.split()[0].split('.')[0]#Name of Genome change from . to ] snp_taxID = snp_line.split()[0].split('.')[0]#Name of Genome change from . to ]
## SPECIES FILTER: ## SPECIES FILTER:
if snp_taxID not in samples_of_interest.keys():#Check if Genome is of interest if snp_taxID not in samples_of_interest.keys():#Check if Genome is of interest
continue #Taxon is not relevant, NEXT! continue #Taxon is not relevant, NEXT!
else: else:
## SAMPLE FILTER: only load samples with enough coverage ## SAMPLE FILTER: only load samples with enough coverage
sample_list = samples_of_interest[snp_taxID]#Load Sample List sample_list = samples_of_interest[snp_taxID]#Load Sample List
...@@ -235,9 +245,9 @@ def filter_two(args, snp_header, snp_files, outdir): ...@@ -235,9 +245,9 @@ def filter_two(args, snp_header, snp_files, outdir):
nr_good += 1 nr_good += 1
#FILTER: Position incidence with sufficient coverage: #FILTER: Position incidence with sufficient coverage:
if float(nr_good)/len(sample_indices) < args.p:#if snp_incidence < x % drop SNP if float(nr_good)/len(sample_indices) < args.p:#if snp_incidence < x % drop SNP
continue#mainly uninformative position, SNP incidence < x %, SNP droped! continue#mainly uninformative position, SNP incidence < x %, SNP droped!
else: else:
# CALCULATE SNP ALLELE FREQUENCY: # CALCULATE SNP ALLELE FREQUENCY:
# If at least one position passed cutoffs open file (if new TaxID): # If at least one position passed cutoffs open file (if new TaxID):
...@@ -248,42 +258,42 @@ def filter_two(args, snp_header, snp_files, outdir): ...@@ -248,42 +258,42 @@ def filter_two(args, snp_header, snp_files, outdir):
outfile = open(outdir+'/'+'%s.filtered.freq' % snp_taxID, 'w') outfile = open(outdir+'/'+'%s.filtered.freq' % snp_taxID, 'w')
print("Generating: {}".format(outdir+'/'+'%s.filtered.freq' % snp_taxID)) print("Generating: {}".format(outdir+'/'+'%s.filtered.freq' % snp_taxID))
outfile.write('\t' + "\t".join(sample_list) + '\n') outfile.write('\t' + "\t".join(sample_list) + '\n')
header_taxID = snp_taxID#Jump to current TaxID header_taxID = snp_taxID#Jump to current TaxID
#Loop through alternative alleles [5](comma separated): #Loop through alternative alleles [5](comma separated):
line_id = ":".join(whole_line[:4])#line_id composed of CHROM:REFGENE:POS:REFBASE line_id = ":".join(whole_line[:4])#line_id composed of CHROM:REFGENE:POS:REFBASE
reference_base = whole_line[3] reference_base = whole_line[3]
alt_bases_totalcov = []#total coverage per alt allele alt_bases_totalcov = []#total coverage per alt allele
#VCF format #VCF format
#LOOP Start: #LOOP Start:
for snp in whole_line[5].split(','): for snp in whole_line[5].split(','):
xS = snp.split('|') xS = snp.split('|')
snp_coverage = list(map(float, xS[3:]))#coverage string (unfiltered!) snp_coverage = list(map(float, xS[3:]))#coverage string (unfiltered!)
#Sanity check: #Sanity check:
if len(site_coverage) != len(snp_coverage): if len(site_coverage) != len(snp_coverage):
print("ERROR: SNP FILE {} is corrupted".format(best_split_x)) print("ERROR: SNP FILE {} is corrupted".format(best_split_x))
sys.exit("ERROR: Site coverage and SNP coverage string have uneven length!") sys.exit("ERROR: Site coverage and SNP coverage string have uneven length!")
#Compute allele frequency tables: #Compute allele frequency tables:
alt_base = snp.split('|')[1]#alternative base alt_base = snp.split('|')[1]#alternative base
#FREQUENCY Computation #FREQUENCY Computation
total_reads = 0 total_reads = 0
snp_frq = []#frequencies for SNPs (pos >5X in at least 50% of the SoIs) snp_frq = []#frequencies for SNPs (pos >5X in at least 50% of the SoIs)
for index in sample_indices: for index in sample_indices:
#prevent division by zero! TODO:Notify usr about Nonsense Value (f.i. -c = 0! #prevent division by zero! TODO:Notify usr about Nonsense Value (f.i. -c = 0!
if site_coverage[index] >= args.c and site_coverage[index] != 0: if site_coverage[index] >= args.c and site_coverage[index] != 0:
snp_frq.append(snp_coverage[index]/site_coverage[index]) snp_frq.append(snp_coverage[index]/site_coverage[index])
else: else:
snp_frq.append(-1) snp_frq.append(-1)
# WRITE OUTPUT Allele Frequencies (Default) # WRITE OUTPUT Allele Frequencies (Default)
outfile.write(":".join(snp_line.split()[:4])+'>'+alt_base +':'+ xS[2] + '\t' + "\t".join(str(x) for x in snp_frq) + '\n') outfile.write(":".join(snp_line.split()[:4])+'>'+alt_base +':'+ xS[2] + '\t' + "\t".join(str(x) for x in snp_frq) + '\n')
...@@ -300,20 +310,95 @@ def alleledist(d1,d2, threshold=.6): ...@@ -300,20 +310,95 @@ def alleledist(d1,d2, threshold=.6):
def computeAllDist(args): def computeAllDist(args):
print "Computing distances" print("Computing distances")
allFreq = glob.glob(args.projdir + '/filtered/pop/*.freq') allFreq = glob.glob(args.projdir + '/filtered/pop/*.freq')
for f in allFreq: for f in allFreq:
data = pd.read_table(f, index_col=0, na_values=['-1']).T data = pd.read_table(f, index_col=0, na_values=['-1']).T
dist = [[l1nonans(data.iloc[i], data.iloc[j]) for i in range(len(data))] for j in range(len(data))] dist = [[l1nonans(data.iloc[i], data.iloc[j]) for i in range(len(data))] for j in range(len(data))]
dist = pd.DataFrame(dist, index=data.index, columns=data.index) dist = pd.DataFrame(dist, index=data.index, columns=data.index)
dist.to_csv(args.projdir+'/distances/'+'%s.mann.dist' % f.split('/')[-1].replace('.freq',''), sep='\t')
dist = [[alleledist(data.iloc[i], data.iloc[j]) for i in range(len(data))] for j in range(len(data))]
dist = pd.DataFrame(dist, index=data.index, columns=data.index)
dist.to_csv(args.projdir+'/distances/'+'%s.allele.dist' % f.split('/')[-1].replace('.freq',''), sep='\t')
dist.to_csv(args.projdir+'/distances/'+'%s.mann.dist' % f.split('/')[-1].replace('.freq',''), sep='\t')
dist = [[alleledist(data.iloc[i], data.iloc[j]) for i in range(len(data))] for j in range(len(data))]
dist = pd.DataFrame(dist, index=data.index, columns=data.index)
dist.to_csv(args.projdir+'/distances/'+'%s.allele.dist' % f.split('/')[-1].replace('.freq',''), sep='\t') def genetic_distance(sample1, sample2):
'''Pairwise genetic distance'''
# The expression used to compute dist_nd would compute all the necessary
# values if appplied to sample[12]. However, the case where there are no
# duplicates is the majority (often >90% of cases) and can be done much
# faster, so it is special cased here:
sample1nd = sample1.reset_index().drop_duplicates(subset='index', keep=False).set_index('index')
sample2nd = sample2.reset_index().drop_duplicates(subset='index', keep=False).set_index('index')
sample2nd = sample2nd.reindex(index=sample1nd.index)
s1 = sample1nd.values
s2 = sample2nd.values
valid = ~(np.isnan(s1) | np.isnan(s2))
s1 = s1[valid]
s2 = s2[valid]
s1 = np.vstack([s1, 1 - s1])
s2 = np.vstack([s2, 1 - s2])
dist_nd = (s1[0]*s2[1]+s1[1]*s2[0]).sum()
def compute_diversity(x):
out = np.outer(x.s1.values, x.s2.values)
return np.nansum(out) - np.nansum(out.diagonal())
sample1d = sample1.ix[sample1.index[sample1.index.duplicated()]]
sample2d = sample2.ix[sample2.index[sample2.index.duplicated()]]
if not len(sample1d) or not len(sample2d):
# No duplicates
return dist_nd
both = pd.DataFrame({'s1' : sample1d, 's2' : sample2d})
both = both.reset_index()
both = pd.concat([both,(1. - both.groupby('index').sum()).reset_index()])
dist_d = both.groupby('index', group_keys=False).apply(compute_diversity).sum()
return dist_d + dist_nd
def computeAllDiv(args):
# Load external info : Coverage, genomes size, genes size
horizontal_coverage = pd.read_table(args.percentage_file, skiprows=[1], index_col=0)
vertical_coverage = pd.read_table(args.coverage_file, skiprows=[1], index_col=0)
bedfile_tab = pd.read_table(args.projdir + '/bed_header', index_col=0, header=None)
bed_index = [i.split('.')[0] for i in list(bedfile_tab.index)]
bedfile_tab = bedfile_tab.set_index(pd.Index(bed_index))
allFreq = glob.glob(args.projdir + '/filtered/pop/*.freq')
for f in allFreq:
species = int(f.split('/')[-1].split('.')[0])
genome_length = bedfile_tab.loc[str(species), 2].sum()
data = pd.read_table(f, index_col=0, na_values=['-1'])
pre_index = [i.split(':') for i in list(data.index)]
data = data.set_index(pd.Index([item[0] + ':' + item[1] + ':' + item[2] for item in pre_index]))
data = data.sort_index()
########
# Correcting the coverage by the genome length observed in each pairwise comparison
correction_coverage = [[(min(horizontal_coverage.loc[species, i], horizontal_coverage.loc[species, j]) * genome_length) / 100 for i in data.columns] for j in data.columns]
########
# Vertical coverage in pi within : AvgCov / (AvgCov - 1)
for j,i in enumerate(data.columns):
correction_within = vertical_coverage.loc[species,i] / (vertical_coverage.loc[species,i] - 1)
correction_coverage[j][j] = correction_coverage[j][j] / correction_within
dist = [[genetic_distance(data.iloc[:, i], data.iloc[:, j]) / correction_coverage[j][i] for i in range(j + 1)] for j in range(len(data.columns))]
FST = [[(1-(dist[i][i]+dist[j][j])/(2*dist[j][i])) for i in range(j + 1)] for j in range(len(dist))]
dist = pd.DataFrame(dist, index=data.columns, columns=data.columns)
FST = pd.DataFrame(FST, index=data.columns, columns=data.columns)
dist.to_csv(args.projdir + '/distances/' + '%s.diversity' % species, sep='\t')
FST.to_csv(args.projdir + '/distances/' + '%s.FST' % species, sep='\t')
if __name__ == "__main__": if __name__ == "__main__":
# print("globals: {}".format(globals())) # print("globals: {}".format(globals()))
...@@ -346,3 +431,6 @@ if __name__ == "__main__": ...@@ -346,3 +431,6 @@ if __name__ == "__main__":
computeAllDist(args) computeAllDist(args)
if args.div:
computeAllDiv(args)
...@@ -28,7 +28,7 @@ def write_matrix(cov, header, ofile): ...@@ -28,7 +28,7 @@ def write_matrix(cov, header, ofile):
out.write('TaxId\t') out.write('TaxId\t')
out.write('\t'.join([header for _ in bamfiles])) out.write('\t'.join([header for _ in bamfiles]))
out.write('\n') out.write('\n')
for taxid in sorted(avg_cov.keys(), key=int): for taxid in sorted(avg_cov.keys()):
c = cov[taxid] c = cov[taxid]
out.write('{}\t'.format(taxid)) out.write('{}\t'.format(taxid))
out.write('\t'.join([c[bf] for bf in bamfiles])) out.write('\t'.join([c[bf] for bf in bamfiles]))
......
computeInsertSizeHistogram
doBWAQualTrimming
qaCompute
removeUnmapped
...@@ -526,13 +526,37 @@ int main(int argc, char *argv[]) ...@@ -526,13 +526,37 @@ int main(int argc, char *argv[])
++duplicates; ++duplicates;
} else { } else {
if (!userOpt.spanCov) { if (!userOpt.spanCov) {
//All entries in SAM file are represented on the forward strand! (See specs of SAM format for details) //All entries in SAM file are represented on the forward strand! (See specs of SAM format for details)
++entireChr[core->pos]; uint32_t* cigar = bam_get_cigar(b);
++usedReads; uint32_t pp = core->pos+1;
if ((uint32_t)(core->pos+core->l_qseq) >= chrSize) int i = 0;
--entireChr[chrSize-1]; if(((*cigar) & BAM_CIGAR_MASK) == BAM_CSOFT_CLIP || ((*cigar) & BAM_CIGAR_MASK) == BAM_CHARD_CLIP) {//BWA is actively fucking with me.
else //The start of the alignment is reported without this clip.
--entireChr[core->pos+core->l_qseq]; ++cigar; ++i;
}
while (i<core->n_cigar) {
++i;
if (((*cigar) & BAM_CIGAR_MASK) != BAM_CMATCH) {
pp = pp + ((*cigar) >> BAM_CIGAR_SHIFT);
} else {
++entireChr[pp];
pp = pp + ((*cigar) >> BAM_CIGAR_SHIFT);
if (pp >= chrSize) {//These are mismatches that actually hang over the end. Silly mapper!
//fprintf(stderr,"you are fucked %d %d\n",pp,chrSize);
--entireChr[chrSize-1];
} else {
--entireChr[pp];
}
}
++cigar;
}
//++entireChr[core->pos];
++usedReads;
//if ((uint32_t)(core->pos+core->l_qseq) >= chrSize)
// --entireChr[chrSize-1];
//else
// --entireChr[core->pos+core->l_qseq];
} else { } else {
//Computing span coverage. //Computing span coverage.
//Only consider first read in pair! and extend a bit to the end of the insert //Only consider first read in pair! and extend a bit to the end of the insert
...@@ -593,23 +617,23 @@ int main(int argc, char *argv[]) ...@@ -593,23 +617,23 @@ int main(int argc, char *argv[])
bam_destroy1(b); bam_destroy1(b);
free(entireChr); free(entireChr);
fprintf(stdout,"\nDuplicates:%u \n", duplicates); //fprintf(stdout,"\nDuplicates:%u \n", duplicates);
//Print header for next table in output file //Print header for next table in output file
fprintf(outputFile,"\nCov*X\tPercentage\tNr. of bases\n"); fprintf(outputFile,"\nCov*X\tPercentage\tNr. of bases\n");
fprintf(stdout,"Total genome lenght %lu \n", totalGenomeLength); //fprintf(stdout,"Total genome lenght %lu \n", totalGenomeLength);
//Compute procentages of genome cover!. //Compute procentages of genome cover!.
int i; int i;
for (i=0; i<=userOpt.maxCoverage; ++i) { for (i=0; i<=userOpt.maxCoverage; ++i) {
if (i == 0) { if (i == 0) {
//Non-covered! //Non-covered!
fprintf(stdout,"%3.2f of genome has not been covered\n", (double)(coverageHist[i])/totalGenomeLength*100); //fprintf(stdout,"%3.2f of genome has not been covered\n", (double)(coverageHist[i])/totalGenomeLength*100);
} else { } else {
uint64_t coverage = 0; uint64_t coverage = 0;
//All that has been covered i, had been covered i+1, i+2 and so on times. Thus, do this addition //All that has been covered i, had been covered i+1, i+2 and so on times. Thus, do this addition
for (int x = i; x<=userOpt.maxCoverage; ++x) coverage += coverageHist[x]; for (int x = i; x<=userOpt.maxCoverage; ++x) coverage += coverageHist[x];
fprintf(stdout,"%3.2f of genome has been covered at least %dX \n", (double)(coverage)/totalGenomeLength*100, i); //fprintf(stdout,"%3.2f of genome has been covered at least %dX \n", (double)(coverage)/totalGenomeLength*100, i);
fprintf(outputFile,"%d\t%3.5f\t%lu\n",i, (double)(coverage)/totalGenomeLength*100, coverage); fprintf(outputFile,"%d\t%3.5f\t%lu\n",i, (double)(coverage)/totalGenomeLength*100, coverage);
} }
...@@ -632,7 +656,7 @@ int main(int argc, char *argv[]) ...@@ -632,7 +656,7 @@ int main(int argc, char *argv[])
fprintf(outputFile, "Number of interchromosomal pairs: %u\n",interChr); fprintf(outputFile, "Number of interchromosomal pairs: %u\n",interChr);
} }
printf("Out of %u reads, you have %3.5f unmapped reads\n and %3.5f sub-par quality mappings\n", totalNumberOfReads ,procentageOfUnmapped, procentageOfZeroQuality); //printf("Out of %u reads, you have %3.5f unmapped reads\n and %3.5f sub-par quality mappings\n", totalNumberOfReads ,procentageOfUnmapped, procentageOfZeroQuality);
free(coverageHist); free(coverageHist);
......
snpCall
...@@ -130,7 +130,6 @@ bool indexGenomeAndGenes(FILE* refGenome, FILE* refGenes) { ...@@ -130,7 +130,6 @@ bool indexGenomeAndGenes(FILE* refGenome, FILE* refGenes) {
filePosStart = strlen(line);//This includes the \n filePosStart = strlen(line);//This includes the \n
filePosition p1; filePosition p1;
while (fgets(line,10000,refGenes)) { while (fgets(line,10000,refGenes)) {
fileConsumed += strlen(line);
int pos = 0; int pos = 0;
const char* rest = toksplit(line,'\t',tok,10000); const char* rest = toksplit(line,'\t',tok,10000);
while (*rest) { while (*rest) {
...@@ -152,6 +151,7 @@ bool indexGenomeAndGenes(FILE* refGenome, FILE* refGenes) { ...@@ -152,6 +151,7 @@ bool indexGenomeAndGenes(FILE* refGenome, FILE* refGenes) {
++pos; ++pos;
rest = toksplit(rest,'\t',tok,10000); rest = toksplit(rest,'\t',tok,10000);
} }
fileConsumed += strlen(line);
lineCount += 1; lineCount += 1;
} }
//Add the last one! //Add the last one!
...@@ -248,7 +248,7 @@ bool loadGenome(std::string gName, FILE* refGenes, bool* hasGenes) { ...@@ -248,7 +248,7 @@ bool loadGenome(std::string gName, FILE* refGenes, bool* hasGenes) {
} }
if (pos == 2) {//This is the name! if (pos == 2) {//This is the name!
if (gName.compare(tok) != 0) { if (gName.compare(tok) != 0) {
fprintf(stderr,"Reading wrong gene defintions"); fprintf(stderr,"Reading wrong gene defintion for %s\n. Scafold supposed to be %s, but is %s\n.",geneName.c_str(),tok,gName.c_str());
break; break;
} }
} else if (pos == 6) {//Start } else if (pos == 6) {//Start
......