Cython bindings and Python interface to FastANI, a method for fast whole-genome similarity estimation.
🗺️ Overview
FastANI is a method published in 2018 by Jain et al. for high-throughput computATION of whole-genome Average Nucleotide Identity (ANI). It uses MashMap to compute orthologous mappings without the need for expensive alignments.
pyfastani
is a Python module, implemented using the Cython
language, that provides bindings to FastANI. It directly interacts with the
FastANI internals, which has the following advantages over CLI wrappers:
-
simpler compilation: FastANI requires several additional libraries,
which make compilation of the original binary non-trivial. In pyFastANI,
libraries that were needed for threading or I/O are provided as stubs,
so you only need to have
boost::math
to build. Or even better, just install from one of the provided wheels! -
single dependency: If your software or your analysis pipeline is
distributed as a Python package, you can add
pyfastani
as a dependency to your project, and stop worrying about the FastANI binary being present on the end-user machine. - sans I/O: Everything happens in memory, in Python objects you control, making it easier to pass your sequences to FastANI without needing to write them to a temporary file.
This library is still a work-in-progress, and in an experimental stage, but it should already pack enough features to run one-to-one computations.
💡 Example
The following snippets show how to compute the ANI between two genomes,
with the reference being a draft genome. For one-to-many or many-to-many
searches, simply add additional references with m.add_draft
before indexing.
Note that any name can be given to the reference sequences, this will just
affect the name
attribute of the hits returned for a query.
Biopython
🔬Biopython does not let us access to the sequence directly, so we need to
convert it to bytes first with the bytes
builtin function. For older
versions of Biopython (earlier than 1.79), use record.seq.encode()
instead of bytes(record.seq).
import pyfastani
import Bio.SeqIO
m = pyfastani.Mapper()
# add a single draft genome to the mapper, and index it
ref = list(Bio.SeqIO.parse("vendor/FastANI/data/Shigella_flexneri_2a_01.fna", "fasta"))
m.add_draft("Shigella_flexneri_2a_01", (bytes(record.seq) for record in ref))
m.index()
# read the query and query the mapper
query = Bio.SeqIO.read("vendor/FastANI/data/Escherichia_coli_str_K12_MG1655.fna", "fasta")
hits = m.query_sequence(bytes(query.seq))
for hit in hits:
print("Escherichia_coli_str_K12_MG1655", hit.name, hit.identity, hit.matches, hit.fragments)
Scikit-bio
🧪Scikit-bio lets us access to the sequence directly as a numpy
array, but
shows the values as byte strings by default. To make them readable as
char
(for compatibility with the C code), they must be cast with
seq.values.view('B')
.
import pyfastani
import skbio.io
m = pyfastani.Mapper()
ref = list(skbio.io.read("vendor/FastANI/data/Shigella_flexneri_2a_01.fna", "fasta"))
m.add_draft("Shigella_flexneri_2a_01", (seq.values.view('B') for seq in ref))
m.index()
# read the query and query the mapper
query = next(skbio.io.read("vendor/FastANI/data/Escherichia_coli_str_K12_MG1655.fna", "fasta"))
hits = m.query_genome(query.values.view('B'))
for hit in hits:
print("Escherichia_coli_str_K12_MG1655", hit.name, hit.identity, hit.matches, hit.fragments)
💭 Feedback
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
🏗️ Contributing
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
⚖️ License
This library is provided under the MIT License.
The FastANI code was written by Chirag Jain
and is distributed under the terms of the
Apache License 2.0 license,
unless otherwise specified in vendored sources. See vendor/FastANI/LICENSE
for more information.
This project is in no way not affiliated, sponsored, or otherwise endorsed by the original FastANI authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.