Commit 94647e9c authored by Martin Larralde's avatar Martin Larralde
Browse files

Merge branch 'paper' after paper acceptance in openjournals/joss-papers#3171

parents be141692 d0629237
......@@ -375,6 +375,8 @@ jobs:
steps:
- name: Checkout code
uses: actions/checkout@v1
with:
submodules: true
- name: Release a Changelog
uses: rasmus-saks/release-a-changelog-action@v1.0.1
with:
......
name: Paper
on:
push:
branches:
- paper
jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
paper-path: paper/paper.md
- name: Upload
uses: actions/upload-artifact@v1
with:
name: paper.pdf
path: paper/paper.pdf
......@@ -10,6 +10,10 @@ sys.path.append(os.path.realpath(os.path.join(__file__, "..", "..", "..")))
import tqdm
sys.path.append(os.path.realpath(os.path.join(__file__, "..", "..", "..")))
from pyrodigal import Nodes, Sequence
from pyrodigal._pyrodigal import METAGENOMIC_BINS, ConnectionScorer
from pyrodigal.tests.fasta import parse
......
......@@ -10,8 +10,6 @@ import matplotlib.pyplot as plt
import scipy.stats
from palettable.cartocolors.qualitative import Bold_4
plt.rcParams["svg.fonttype"] = "none"
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input", required=True)
......@@ -44,6 +42,7 @@ for color, (backend, group) in zip(
plt.plot([ 0, max(X) ], [ reg.intercept, reg.slope*max(X) + reg.intercept ], color=color, linestyle="--", marker="")
# ci = [1.96 * r["stddev"] / math.sqrt(len(r["times"])) for r in group]
plt.scatter(X, Y, marker="+", color=color, label=f"{backend} (R²={reg.rvalue**2:.3f})")
plt.legend()
plt.xlabel("Node count")
plt.ylabel("Time (s)")
......@@ -61,11 +60,13 @@ for color, (backend, group) in zip(
plt.plot([ 0, max(X) ], [ reg.intercept, reg.slope*max(X) + reg.intercept ], color=color, linestyle="--", marker="")
# ci = [1.96 * r["stddev"] / math.sqrt(len(r["times"])) for r in group]
plt.scatter(X, Y, marker="+", color=color, label=f"{backend} (R²={reg.rvalue**2:.3f})")
plt.legend()
plt.xlabel("Nucleotide count (Mbp)")
plt.ylabel("Time (s)")
plt.tight_layout()
output = args.output or args.input.replace(".json", ".svg")
plt.savefig(output, transparent=True)
if args.show:
......
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
../benches/connection_scoring/v1.0.0.svg
\ No newline at end of file
@article{Bioconda:2018,
title = {Bioconda: Sustainable and Comprehensive Software Distribution for the Life Sciences},
shorttitle = {Bioconda},
author = {Gr{\"u}ning, Bj{\"o}rn and Dale, Ryan and Sj{\"o}din, Andreas and Chapman, Brad A. and Rowe, Jillian and {Tomkins-Tinch}, Christopher H. and Valieris, Renan and K{\"o}ster, Johannes},
year = {2018},
month = jul,
journal = {Nature Methods},
volume = {15},
number = {7},
pages = {475--476},
publisher = {{Nature Publishing Group}},
issn = {1548-7105},
doi = {10.1038/s41592-018-0046-7},
copyright = {2018 The Author(s)},
langid = {english},
keywords = {Publishing,Software},
}
@inproceedings{Valgrind:2007,
author = {Nethercote, Nicholas and Seward, Julian},
title = {Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation},
year = {2007},
isbn = {9781595936332},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1250734.1250746},
doi = {10.1145/1250734.1250746},
booktitle = {Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation},
pages = {89–100},
numpages = {12},
keywords = {dynamic binary instrumentation, Memcheck, shadow values, Valgrind, dynamic binary analysis},
location = {San Diego, California, USA},
series = {PLDI '07}
}
@article{proGenomes2:2020,
title = {{{proGenomes2}}: An Improved Database for Accurate and Consistent Habitat, Taxonomic and Functional Annotations of Prokaryotic Genomes},
shorttitle = {{{proGenomes2}}},
author = {Mende, Daniel R and Letunic, Ivica and Maistrenko, Oleksandr M and Schmidt, Thomas S B and Milanese, Alessio and Paoli, Lucas and {Hern{\'a}ndez-Plaza}, Ana and Orakov, Askarbek N and Forslund, Sofia K and Sunagawa, Shinichi and Zeller, Georg and {Huerta-Cepas}, Jaime and Coelho, Luis Pedro and Bork, Peer},
year = {2020},
month = jan,
journal = {Nucleic Acids Research},
volume = {48},
number = {D1},
pages = {D621-D625},
issn = {0305-1048},
doi = {10.1093/nar/gkz1002},
}
@article{Prodigal:2010,
title = {Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification},
shorttitle = {Prodigal},
author = {Hyatt, Doug and Chen, Gwo-Liang and LoCascio, Philip F and Land, Miriam L and Larimer, Frank W and Hauser, Loren J},
year = {2010},
month = mar,
journal = {BMC Bioinformatics},
volume = {11},
pages = {119},
issn = {1471-2105},
doi = {10.1186/1471-2105-11-119},
pmcid = {PMC2848648},
pmid = {20211023},
}
@techreport{GECCO:2021,
type = {Preprint},
title = {Accurate de Novo Identification of Biosynthetic Gene Clusters with {{GECCO}}},
author = {Carroll, Laura M. and Larralde, Martin and Fleck, Jonas Simon and Ponnudurai, Ruby and Milanese, Alessio and Cappio, Elisa and Zeller, Georg},
year = {2021},
month = may,
pages = {2021.05.03.442509},
institution = {{bioRxiv}},
doi = {10.1101/2021.05.03.442509},
chapter = {New Results},
copyright = {\textcopyright{} 2021, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/},
langid = {english},
}
@article{PhageBoost:2021,
title = {Rapid Discovery of Novel Prophages Using Biological Feature Engineering and Machine Learning},
author = {Sir{\'e}n, Kimmo and Millard, Andrew and Petersen, Bent and Gilbert, M~Thomas~P and Clokie, Martha R J and {Sicheritz-Pont{\'e}n}, Thomas},
year = {2021},
month = mar,
journal = {NAR Genomics and Bioinformatics},
volume = {3},
number = {1},
pages = {lqaa109},
issn = {2631-9268},
doi = {10.1093/nargab/lqaa109},
}
@techreport{hafeZ:2021,
type = {Preprint},
title = {{{hafeZ}}: {{Active}} Prophage Identification through Read Mapping},
shorttitle = {{{hafeZ}}},
author = {Turkington, Christopher J. R. and Abadi, Neda Nezam and Edwards, Robert A. and Grasis, Juris A.},
year = {2021},
month = jul,
institution = {{Bioinformatics}},
doi = {10.1101/2021.07.21.453177},
langid = {english},
}
@article{AssessORF:2019,
title = {{{AssessORF}}: Combining Evolutionary Conservation and Proteomics to Assess Prokaryotic Gene Predictions},
shorttitle = {{{AssessORF}}},
author = {Korandla, Deepank R and Wozniak, Jacob M and Campeau, Anaamika and Gonzalez, David J and Wright, Erik S},
year = {2019},
month = sep,
journal = {Bioinformatics},
volume = {36},
number = {4},
pages = {1022--1029},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btz714},
pmcid = {PMC7998711},
pmid = {31532487},
}
@software{Cython,
author = {Behnel, Stefan. and Bradshaw, Robert. and Dalcín, Lisandro. and Florisson, Mark. and Makarov, Vitja. and Seljebotn, Dag Sverre, and {the Cython contributors}.},
title = {Cython: {C}-{E}xtensions for {P}ython},
year = {2020},
url = {https://cython.org}
}
@software{AlphaMine:2021,
author = {Hernández, Joel R. and Guirado, Isaac C. and Sánchez, Marc B. and Verdaguer, Francesc C. and {the iGem 2021 ARIA team}.},
title = {AlphaMine: An alignment-free genomic analysis system that can build core, shell, and cloud prokaryotic pangenomes by applying set-theory operations to genome collections.},
year = {2021},
url = {https://2021.igem.org/Team:UPF_Barcelona/Software_AMine},
}
---
title: 'Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes'
tags:
- Python
- Cython
- bioinformatics
- gene prediction
authors:
- name: Martin Larralde
orcid: 0000-0002-3947-4444
affiliation: 1
affiliations:
- name: Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
index: 1
date: 24 February 2022
bibliography: paper.bib
---
# Summary
Improvements in sequencing technologies have seen the amount of available
genomic data expand considerably over the last twenty years. One of the key
steps for analysing is the prediction of protein-coding regions in genomic
sequences, known as Open Reading Frames (ORFs), which span between
a start and a stop codon. A recent comparison of several ORF prediction methods
[@AssessORF:2019] has shown that Prodigal [@Prodigal:2010], a prokaryotic gene
finder that uses dynamic programming, is one of the highest performing
*ab initio* ORF finders. Pyrodigal is a Python package that provides bindings
and an interface to Prodigal to make it easier to use in Python applications.
# Statement of need
Prodigal is used in thousands of applications as the gene calling method
for processing genomic sequences. It is implemented in ANSI C, making it
extremely efficient and relatively easy to compile on different platforms.
However, the only way to use Prodigal is through an executable, making it
inconvenient for bioinformaticians who rely on the increasingly popular
Python language. In addition to the hassle caused by the invocation of
the executable from Python code, the distribution of Python programs relying
on Prodigal is also problematic, since they now require an external binary
that cannot be installed from the [Python Package Index](https://pypi.org) (PyPI).
To address these issues, we developed Pyrodigal, a Python module implemented
in Cython [@Cython] that binds to the Prodigal internals, resulting in identical
predictions and similar performance through a friendly object-oriented interface.
The predicted genes are returned as Python objects, with properties for retrieving
confidence scores or coordinates, and methods for translating the gene sequence
with a default or user-provided translation table. Prodigal is compiled from
source and statically linked into a compiled Python extension, which allows
it to be installed with a single `pip install` command, even on a target
machine that requires compilation.
Pyrodigal has already been used as the implementation for the initial ORF
finding stage in several domains, including biosynthetic gene cluster
prediction [@GECCO:2021], prophage identification [@PhageBoost:2021; @hafeZ:2021],
and pangenome analysis [@AlphaMine:2021].
# Method
Internally, Prodigal identifies start and stop codons throughout a genomic
sequence, which are represented as nodes. It then uses a dynamic programming
approach to compute scores for every pair of nodes (i.e. putative genes) as
shown in \autoref{fig:method}. Scores assigned to predicted genes are based
primarily on the frequency of nucleotide hexamers inside the gene sequence.
![Graphical depiction of the Prodigal method for identifying genes in a sequence.
First, the sequence is analysed to find start and stop codons in the 6 reading frames (1).
Dynamic programming nodes are then created for each codon, each storing the strand
(shown with the triangle direction, right for the direct strand, left for the reverse strand)
and the type (green for start, red for stop) of the codon they were obtained from.
Nodes are then scored on biological criteria (2). Putative genes are identified between all
start and stop codons in a given window (a gene cannot span between any pair of nodes;
some invalid connections are shown with dashed lines as examples). Then putative
genes are scored using a dynamic programming approach (3). Once all connections
have been processed, the dynamic programming matrix is traversed to find the highest scoring
path, giving the final predictions (4). \label{fig:method}](figure1.svg){width=100%}
Pyrodigal adapts the first two steps so that the dynamic programming nodes
can be extracted directly from a Python string containing the sequence data,
rather than requiring formatting to an external file. In addition, the node
storage has been reworked to use reallocating buffers, saving memory on
smaller sequences. The node and connection scoring steps use the original
Prodigal code.
# Optimization
Prodigal was profiled with Valgrind [@Valgrind:2007] to identify critical
parts of the original code. Using bacterial genomes from the proGenomes v2.1
database [@proGenomes2:2020], we found that for long enough sequences,
a large fraction of the CPU cycles was spent in the `score_connection`
function.
There are pairs of codons between which a gene can never span, such as two
stop codons, or a forward start codon and a reverse stop codon, as shown in
\autoref{fig:method}. Upon inspection, we realized the `score_connection`
was still called in invalid cases that could be labelled as such beforehand.
Identifying these invalid connections is feasible by checking the strand, type
and reading frame of a node pair. Considering two nodes $i$ and $j$, the
connection between them is invalid if any of these boolean equations is true:
- $(T_i \ne STOP) \land (T_j \ne STOP) \land (S_i = S_j)$
- $(S_i = 1) \land (T_i \ne STOP) \land (S_j = -1)$
- $(S_i = -1) \land (T_i = STOP) \land (S_j = 1)$
- $(S_i = -1) \land (T_i \ne STOP) \land (S_j = 1) \land (T_j = STOP)$
- $(S_i = S_j) \land (S_i = 1) \land (T_i \ne STOP) \land (T_j = STOP) \land (F_i \ne F_j)$
- $(S_i = S_j) \land (S_i = -1) \land (T_i = STOP) \land (T_j \ne STOP) \land (F_i \ne F_j)$
where $T_i$, $S_i$ and $F_i$ are respectively the type, strand, and reading
frame of the node $i$, with forward and reverse strands encoded as $+1$ and
$-1$ respectively.
We developed a heuristic filter to quickly identify node pairs forming an
invalid connection prior to the scoring step using the above formulas.
Since all these attributes have a small number of possible values
($+1$ or $-1$ for the forward or reverse strand; $ATG$, $GTG$, $TTG$ or $STOP$
for the codon type; $-1$, $-2$, $-3$, $+1$, $+2$, $+3$ for the reading frame),
they can all be stored in a single byte. Use of the SIMD features of modern CPUs
allows several nodes to be processed at once (8 nodes with NEON and SSE2 features,
16 nodes with AVX2). This first pass produces a look-up table used to bypass the
scoring of invalid connections.
The performance of the connection scoring was evaluated on 50 bacterial
sequences of various length, as shown in \autoref{fig:benchmark}. It suggests
that even with the added cost of the additional pass for each node, enabling
the heuristic filter in Pyrodigal saves about half of the time needed to score
connections between all the nodes of a sequence.
![Evaluation of the connection scoring performance with different heuristic
filter SIMD backends (SSE2 or AVX2), with a generic backend (Generic) or without enabling
the filter (None).
*Each sequence was processed 10 times on a quiet i7-10710U CPU @ 1.10GHz*. \label{fig:benchmark}](figure2.svg){width=100%}
# Availability
Pyrodigal is distributed on PyPI under the
[GNU Lesser General Public License](https://www.gnu.org/licenses/lgpl-3.0).
Pre-compiled distributions are provided for MacOS, Linux and Windows x86-64,
as well as Linux Aarch64 machines. A [Conda](https://conda.io/) package is
also available in the Bioconda channel [@Bioconda:2018].
The source code is available in a git repository on [GitHub](https://github.com/althonos/pyrodigal),
and features a Continuous Integration workflow to run integration tests on changes.
Documentation is hosted on [ReadTheDocs](https://pyrodigal.readthedocs.io) and
built for each new release.
# Acknowledgments
We thank Laura M. Carroll for her input on the redaction of this manuscript,
and Georg Zeller for his supervision. This work was funded by the European
Molecular Biology Laboratory and the German Research Foundation
(Deutsche Forschungsgemeinschaft, DFG, grant no. 395357507 – [SFB 1371](https://www.sfb1371.tum.de/)).
# References
File added
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment