Commit 95b0fdf6 authored by Martin Larralde's avatar Martin Larralde
Browse files

Add an initial JOSS paper draft in the `paper` directory

parent be141692
......@@ -375,6 +375,8 @@ jobs:
steps:
- name: Checkout code
uses: actions/checkout@v1
with:
submodules: true
- name: Release a Changelog
uses: rasmus-saks/release-a-changelog-action@v1.0.1
with:
......
@article{Prodigal:2010,
title = {Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification},
shorttitle = {Prodigal},
author = {Hyatt, Doug and Chen, Gwo-Liang and LoCascio, Philip F and Land, Miriam L and Larimer, Frank W and Hauser, Loren J},
year = {2010},
month = mar,
journal = {BMC Bioinformatics},
volume = {11},
pages = {119},
issn = {1471-2105},
doi = {10.1186/1471-2105-11-119},
pmcid = {PMC2848648},
pmid = {20211023},
}
@techreport{GECCO:2021,
title = {Accurate de Novo Identification of Biosynthetic Gene Clusters with {{GECCO}}},
author = {Carroll, Laura M. and Larralde, Martin and Fleck, Jonas Simon and Ponnudurai, Ruby and Milanese, Alessio and Cappio, Elisa and Zeller, Georg},
year = {2021},
month = may,
pages = {2021.05.03.442509},
institution = {{bioRxiv}},
doi = {10.1101/2021.05.03.442509},
chapter = {New Results},
copyright = {\textcopyright{} 2021, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/},
langid = {english},
}
@article{PhageBoost:2021,
title = {Rapid Discovery of Novel Prophages Using Biological Feature Engineering and Machine Learning},
author = {Sir{\'e}n, Kimmo and Millard, Andrew and Petersen, Bent and Gilbert, M~Thomas~P and Clokie, Martha R J and {Sicheritz-Pont{\'e}n}, Thomas},
year = {2021},
month = mar,
journal = {NAR Genomics and Bioinformatics},
volume = {3},
number = {1},
pages = {lqaa109},
issn = {2631-9268},
doi = {10.1093/nargab/lqaa109},
}
@techreport{hafeZ:2021,
type = {Preprint},
title = {{{hafeZ}}: {{Active}} Prophage Identification through Read Mapping},
shorttitle = {{{hafeZ}}},
author = {Turkington, Christopher J. R. and Abadi, Neda Nezam and Edwards, Robert A. and Grasis, Juris A.},
year = {2021},
month = jul,
institution = {{Bioinformatics}},
doi = {10.1101/2021.07.21.453177},
langid = {english},
}
---
title: 'Pyrodigal: Python bindings and interface to Prodigal, an efficient ORF finder for prokaryotes.'
tags:
- Python
- Cython
- bioinformatics
- gene prediction
authors:
- name: Martin Larralde
orcid: 0000-0002-3947-4444
affiliation: 1
affiliations:
- name: Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
date: 24 February 2022
bibliography: paper.bib
---
# Summary
Improvements in sequencing technologies have seen the amount of available
genomic data expand considerably in the last twenty years. One of the key
step for analysing this data is the prediction of protein-coding regions in
the genomic sequences, known as Open Reading Frames (ORFs), which span between
a start and a STOP codon.
# Statement of need
Prodigal [@Prodigal:2010] is an ORF finder for prokaryotes used in thousands
of applications as the gene calling method for processing genomic sequences.
It is implemented in ANSI C, making it extremely efficient and relatively
easy to compile on different platforms. However, the only way to use Prodigal
is through an executable making it less convenient for less knowledgeable
bioinformaticians only familiar with the Python language.
`pyrodigal` is a Python module implemented in Cython that binds to the
Prodigal interinals, resulting in identical predictions and similar performance
through a friendly object-oriented interface. It removes the need for the
Prodigal binary to be installed on the host machine. The predicted genes are
returned as Python objects, with properties for retrieving confidence scores
or coordinates, and methods for translating the gene sequence.
`pyrodigal` has already been used in several works in preparation [@hafeZ:2021, @GECCO:2021]
and scientific publications [@PhageBoost:2021].
# Performance
Although using the same data structures and scoring method as Prodigal,
`pyrodigal` improves on several aspects of the original software by
re-implementing key parts of the original software:
- Sequence data is not stored as a bitmap anymore, which comes at the cost
of slightly increased memory consumption in exchange of faster memory
access. Memory profiling has revealed that the bulk of memory consumption
in Prodigal is not caused by sequence data but node data, so this trade-off
is acceptable.
- Memory management has been reworked to use buffers growing on insertion,
making the memory consumption slightly more conservative on smaller sequences
and addressing edges cases on sequences with a large number of start and
STOP codons.
- Use of local buffers allows for thread-locality in the `OrfFinder.find_genes`
method. In addition, `pyrodigal` makes use of the Cython feature for
releasing the Python Global Interpreter Lock (GIL), which makes the ORF
finder class usable in multi-threaded code.
In addition to these changes, `pyrodigal` implements a SIMD-accelerated
heuristic filter for scoring connections between dynamic programming nodes.
# Distribution
`pyrodigal` is distributed on the Python Package Index (PyPI), which make it
extremely simple to install. Pre-compiled distributions are made available
for MacOS, Linux and Windows x86-64, as well as Linux Aarch64 machines.
Overall, distribution through PyPI makes it easier to develop a Python
workflow or application relying on `pyrodigal` for the gene calling, since
end-users are not required to handle the setup of the Prodigal binaries
themselves.
# Acknowledgments
This work was funded by the European Molecular Biology Laboratory and the
German Research Foundation (Deutsche Forschungsgemeinschaft, DFG, grant no. 395357507 – SFB 1371).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment