Commits (16)
......@@ -2,6 +2,9 @@
# Created by https://www.gitignore.io/api/python
# Edit at https://www.gitignore.io/?templates=python
# HMM files
gecco/hmmer/*.h3m
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
......@@ -31,7 +34,6 @@ share/python-wheels/
.installed.cfg
*.egg
MANIFEST
pyproject.toml
# PyInstaller
# Usually these files are written by a python script from a template
......
......@@ -25,6 +25,7 @@ variables:
- python -m coverage combine
- python -m coverage xml
- python -m coverage report
coverage: '/^TOTAL.+?(\d+\%)$/'
artifacts:
expire_in: 1 week
reports:
......@@ -133,6 +134,20 @@ deploy:changelog:
script:
- chandler push --github="zellerlab/GECCO" --changelog="CHANGELOG.md"
deploy:releases:
image: python:3.9
stage: deploy
only:
- tags
before_script:
- python -m pip install -U tqdm pyhmmer
- wget "https://github.com/github-release/github-release/releases/download/v0.10.0/linux-amd64-github-release.bz2" -O- | bunzip2 > ./github-release
- chmod +x ./github-release
script:
- python setup.py build_data --inplace --rebuild
after_script:
- for hmm in gecco/hmmer/*.h3m; do gzip $hmm; ./github-release upload --user zellerlab --repo GECCO --tag "$CI_COMMIT_TAG" --name "$(basename $hmm).gz" --file "$hmm.gz" ; done
deploy:sdist:
image: python:3.9
stage: deploy
......@@ -143,7 +158,7 @@ deploy:sdist:
script:
- python setup.py sdist
- twine check dist/*.tar.gz
- twine upload --repository testpypi --skip-existing dist/*.tar.gz
# - twine upload --repository testpypi --skip-existing dist/*.tar.gz
deploy:wheel:
image: python:3.9
......
......@@ -5,11 +5,22 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
## [Unreleased]
[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.1...master
[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.2...master
## [v0.5.2] - 2021-01-29
[v0.5.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.1...v0.5.2
### Added
- Support for downloading HMM files directly from GitHub releases assets.
- Validation of filtered HMMs with MD5 checksum.
### Fixed
- Invalid coordinates of protein domains in GenBank output files.
- `gecco.interpro` module not being added to wheel distribution.
### Changed
- Bump required `pyhmmer` version to`v0.2.1`.
## [v0.5.1] - 2021-01-15
[v0.5.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.0...v0.5.1
### Fixed
### Fixed
- `--hmm` flag being ignored in in `gecco run` command.
- `PyHMMER` using HMM names instead of accessions, causing issues with Pfam HMMs.
......
# Contributing to GECCO
# Contributing
For bug fixes or new features, please file an issue before submitting a pull request.
If the change is not trivial, it may be best to wait for feedback.
......@@ -38,7 +38,7 @@ To bump the version of the internal HMMs (for instance, to switch to a newer
version of Pfam), simply edit the INI file for that HMM in the
``gecco/hmmer`` folder.
The simply clean and rebuild data files to download the latest version of
Then simply clean and rebuild data files to download the latest version of
the HMMs:
```console
......
<img align="left" width="100" height="100" src="static/gecco-square.png">
<img align="right" width="180" height="180" src="static/gecco-square.png">
# Hi, I'm GECCO!
[![GitLabCI](https://img.shields.io/gitlab/pipeline/grp-zeller/GECCO/master?gitlab_url=https%3A%2F%2Fgit.embl.de&logo=gitlab&style=flat-square&maxAge=600)](https://git.embl.de/grp-zeller/GECCO)
[![License](https://img.shields.io/badge/license-GPLv3-blue.svg?style=flat-square&maxAge=2678400)](https://choosealicense.com/licenses/gpl-3.0/)
[![Source](https://img.shields.io/badge/source-GitHub-303030.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/)
[![Mirror](https://img.shields.io/badge/mirror-EMBL-009f4d?style=flat-square&maxAge=2678400)](https://git.embl.de/grp-zeller/GECCO/)
[![Changelog](https://img.shields.io/badge/keep%20a-changelog-8A0707.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/blob/master/CHANGELOG.md)
<!-- [![Coverage](https://img.shields.io/codecov/c/gh/zellerlab/GECCO?style=flat-square&maxAge=3600)](https://codecov.io/gh/zellerlab/GECCO/) -->
<!-- [![PyPI](https://img.shields.io/pypi/v/GECCO.svg?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-bgc) -->
<!-- [![Wheel](https://img.shields.io/pypi/wheel/GECCO.svg?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-bgc/#files) -->
<!-- [![Python Versions](https://img.shields.io/pypi/pyversions/gecco-bgc.svg?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-bgc/#files) -->
<!-- [![GitHub issues](https://img.shields.io/github/issues/zellerlab/GECCO.svg?style=flat-square&maxAge=600)](https://github.com/zellerlab/GECCO/issues) -->
<!-- [![Docs](https://img.shields.io/readthedocs/gecco/latest?style=flat-square&maxAge=600)](https://gecco.readthedocs.io) -->
<!-- [![Downloads](https://img.shields.io/badge/dynamic/json?style=flat-square&color=303f9f&maxAge=86400&label=downloads&query=%24.total_downloads&url=https%3A%2F%2Fapi.pepy.tech%2Fapi%2Fprojects%2Fgecco)](https://pepy.tech/project/GECCO) -->
## 🦎 ️Overview
GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and
scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs)
in genomic and metagenomic data using Conditional Random Fields (CRFs).
[![GitLabCI](https://img.shields.io/gitlab/pipeline/grp-zeller/GECCO/master?gitlab_url=https%3A%2F%2Fgit.embl.de&style=flat-square&maxAge=600)](https://git.embl.de/grp-zeller/GECCO/-/pipelines/)
[![License](https://img.shields.io/badge/license-GPLv3-blue.svg?style=flat-square&maxAge=2678400)](https://choosealicense.com/licenses/gpl-3.0/)
[![Coverage](https://img.shields.io/codecov/c/gh/zellerlab/GECCO?style=flat-square&maxAge=600)]( https://codecov.io/gh/zellerlab/GECCO/)
[![Docs](https://img.shields.io/badge/docs-gecco.embl.de-green.svg?maxAge=2678400&style=flat-square)](https://gecco.embl.de)
[![Source](https://img.shields.io/badge/source-GitHub-303030.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/)
[![Mirror](https://img.shields.io/badge/mirror-EMBL-009f4d?style=flat-square&maxAge=2678400)](https://git.embl.de/grp-zeller/GECCO/)
[![Changelog](https://img.shields.io/badge/keep%20a-changelog-8A0707.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/blob/master/CHANGELOG.md)
[![Issues](https://img.shields.io/github/issues/zellerlab/GECCO.svg?style=flat-square&maxAge=600)](https://github.com/zellerlab/GECCO/issues)
## 🔧 Installing GECCO
......@@ -32,11 +26,13 @@ PyPI, the Python Package Index.
Use `pip` to install GECCO on your machine:
```console
$ pip install gecco
$ pip install https://github.com/zellerlab/GECCO/archive/master.zip
```
This will download GECCO, its dependencies, and the data needed to run
predictions. Once done, you will have a ``gecco`` command in your $PATH.
This will install GECCO, its dependencies, and the data needed to run
predictions. This requires around 100MB of data to be downloaded, so
it could take some time depending on your connection. Once done, you will
have a ``gecco`` command available in your $PATH.
*Note that GECCO uses [HMMER3](http://hmmer.org/), which can only run
on PowerPC and and recent x86-64 machines running a POSIX operating system.
......
......@@ -38,7 +38,7 @@ def setup(app):
import gecco
project = 'GECCO'
copyright = '2020, {}'.format(gecco.__author__)
copyright = '2020-{}, Zeller group, EMBL'.format(datetime.datetime.now().year)
author = gecco.__author__
# The parsed semantic version
......@@ -122,12 +122,14 @@ html_theme_options = {
"navbar_pagenav": False,
# A list of tuples containing pages or urls to link to.
"navbar_links": [
("GitLab", _parser.get("metadata", "home-page").strip(), True)
] + [
(k, v, True)
for k, v in project_urls.items()
if k not in {"Documentation", "Changelog"}
("GitHub", _parser.get("metadata", "home-page").strip(), True),
("CI", project_urls["Builds"])
],
# + [
# (k, v, True)
# for k, v in project_urls.items()
# if k in {"Builds"}
# ],
# Render admonition using panels instead of alerts (this is a PR under
# review there: https://github.com/ryan-roemer/sphinx-bootstrap-theme/pull/199)
# but hopefully it will get merged one day.
......
......@@ -4,10 +4,108 @@ GECCO
*Biosynthetic Gene Cluster prediction with Conditional Random Fields.*
|GitLabCI| |Coverage| |License| |Source| |Mirror| |Issues|
.. |GitLabCI| image:: https://img.shields.io/gitlab/pipeline/grp-zeller/GECCO/master?gitlab_url=https%3A%2F%2Fgit.embl.de&logo=gitlab&style=flat-square&maxAge=600
:target: https://git.embl.de/grp-zeller/GECCO/-/pipelines
.. |Coverage| image:: https://img.shields.io/codecov/c/gh/zellerlab/GECCO?logo=codecov&style=flat-square&maxAge=600
:target: https://codecov.io/gh/zellerlab/GECCO/
.. |License| image:: https://img.shields.io/badge/license-GPLv3-blue.svg?logo=gnu&style=flat-square&maxAge=3600
:target: https://choosealicense.com/licenses/gpl-3.0/
.. |Source| image:: https://img.shields.io/badge/source-GitHub-303030.svg?logo=git&maxAge=2678400&style=flat-square
:target: https://github.com/zellerlab/GECCO/
.. |Mirror| image:: https://img.shields.io/badge/mirror-EMBL-009f4d?logo=git&style=flat-square&maxAge=2678400
:target: https://git.embl.de/grp-zeller/GECCO/
.. |Issues| image:: https://img.shields.io/github/issues/zellerlab/GECCO.svg?logo=github&style=flat-square&maxAge=600
:target: https://github.com/zellerlab/GECCO/issues
Overview
--------
GECCO is a fast and scalable method for identifying putative novel Biosynthetic
Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random
Fields (CRFs).
GECCO is developed in the `Zeller group <https://www.embl.de/research/units/scb/zeller/index.html>`_
and is part of the suite of computational microbiome analysis tools hosted
at `EMBL <https://www.embl.de>`_.
Quickstart
----------
Setup
^^^^^
GECCO is implemented in `Python <https://www.python.org/>`_, and supports
`all versions <https://endoflife.date/python>`_ from Python 3.6. Install
GECCO with ``pip``:
.. code:: console
$ pip install https://github.com/zellerlab/GECCO/archive/master.zip
Predictions
^^^^^^^^^^^
GECCO works with DNA sequences, and loads them using Biopython, allowing it
to support a `large variety of formats <https://biopython.org/wiki/SeqIO>`_,
including the common FASTA and GenBank files.
Run a prediction on a FASTA file named ``sequence.fna`` and output the
predictions to the current directory:
.. code:: console
$ gecco run --genome sequence.fna
Output
^^^^^^
GECCO will create the following files once done (using the same prefix as
the input file):
- ``{sequence}.features.tsv``: The *features* file, containing the identified
proteins and domains in the input sequences.
- ``{sequence}.clusters.tsv``: If any were found, a *clusters* file, containing
the coordinates of the predicted clusters, along their putative biosynthetic
type.
- ``{sequence}_cluster_{N}.gbk``: If any were found, a GenBank file per cluster,
containing the cluster sequence annotated with its member proteins and domains.
Feedback
--------
Contact
^^^^^^^
If you have any question about GECCO, if you run into any issue, or if you
would like to make a feature request, please create an
`issue in the GitHub repository <https://github.com/zellerlab/GECCO/issues/new>`_.
You can also directly contact `Martin Larralde via email <mailto:martin.larralde@embl.de>`_.
Contributing
^^^^^^^^^^^^
If you want to contribute to GECCO, please have a look at the
contribution guide first, and feel free to open a pull
request on the `GitHub repository <https://github.com/zellerlab/GECCO>`_.
Documentation
-------------
.. rubric:: Guides
Guides
^^^^^^
.. toctree::
:maxdepth: 1
......@@ -16,10 +114,11 @@ Documentation
Training <training>
Contributing <contributing>
.. rubric:: Library
Library
^^^^^^^
.. toctree::
:maxdepth: 2
:maxdepth: 1
API reference <api/index>
Changelog <changes>
......@@ -33,22 +132,14 @@ GECCO is released under the
*or later*, and is fully open-source. The ``LICENSE`` file distributed with
the software contains the complete license text.
About
-----
GECCO is developped by the Zeller Team at the European Molecular Biology Laboratory
in Heidelberg. The following individuals contributed to the development of
GECCO:
GECCO is developped by the `Zeller group <https://www.embl.de/research/units/scb/zeller/index.html>`_
at the European Molecular Biology Laboratory in Heidelberg. The following
individuals contributed to the development of GECCO:
- `Laura M. Carroll <https://github.com/lmc297>`_
- `Martin Larralde <https://github.com/althonos>`_
- `Jonas S. Fleck <https://github.com/astair>`_
Indices and tables
------------------
- :ref:`genindex`
- :ref:`modindex`
- :ref:`search`
- `Jonas S. Fleck <https://github.com/joschif>`_
Installation
============
PyPi
^^^^
GECCO is hosted on the EMBL Git server, but the easiest way to install it is
to download the latest release from its `PyPi repository <https://pypi.python.org/pypi/gecco>`_.
It will install all dependencies then install the ``gecco`` module:
Requirements
^^^^^^^^^^^^
.. code:: console
GECCO requires additional libraries that can be installed directly from PyPI,
the Python Package Index. Contrary to other tools in the field
(such as DeepBGC or AntiSMASH), it does not require any external binary.
$ pip install gecco
.. PyPi
.. ^^^^
..
.. GECCO is hosted on the EMBL Git server, but the easiest way to install it is
.. to download the latest release from its `PyPi repository <https://pypi.python.org/pypi/gecco>`_.
.. It will install all dependencies then install the ``gecco`` module:
..
.. .. code:: console
..
.. $ pip install gecco
.. Conda
.. ^^^^^
......@@ -27,12 +36,15 @@ It will install all dependencies then install the ``gecco`` module:
Git + ``pip``
^^^^^^^^^^^^^
If, for any reason, you prefer to download the library from the git repository,
you can clone the repository and install the repository by running:
Until GECCO is released on PyPI, you can install it from the GitHub repository
directly with ``pip``:
.. If, for any reason, you prefer to download the library from the git repository,
.. you can clone the repository and install the repository by running:
.. code:: console
$ pip install https://git.embl.de/grp-zeller/GECCO/-/archive/master/GECCO-master.zip
$ pip install https://github.com/zellerlab/GECCO
Keep in mind this will install the development version of the library, so not
......
......@@ -9,7 +9,8 @@ dependencies:
.. code-block:: console
$ pip install gecco[train]
$ git clone https://github.com/zellerlab/GECCO
$ pip install ./GECCO[train]
This will install additional Python packages, such as `pandas <https://pandas.pydata.org/>`_
which is needed to process the feature tables, or `fisher <https://pypy.org/project/fisher>`_
......
......@@ -90,7 +90,7 @@ class Run(Command): # noqa: D101
self.args["--threshold"] = float(self.args["--threshold"])
# Check the `--cpu`flag
self.args["--jobs"] = int(self.args["--jobs"]) or multiprocessing.cpu_count()
self.args["--jobs"] = int(self.args["--jobs"])
# Check the input exists
if not os.path.exists(self.args["--genome"]):
......@@ -146,19 +146,12 @@ class Run(Command): # noqa: D101
self.logger.info("Running domain annotation")
# Run all HMMs over ORFs to annotate with protein domains
def annotate(hmm: HMM) -> None:
self.logger.debug(
"Starting annotation with HMM {} v{}", hmm.id, hmm.version
)
hmms = self._custom_hmms() if self.args["--hmm"] else embedded_hmms()
for hmm in hmms:
self.logger.debug("Starting annotation with HMM {} v{}", hmm.id, hmm.version)
features = PyHMMER(hmm, self.args["--jobs"]).run(genes)
self.logger.debug("Finished running HMM {}", hmm.id)
with multiprocessing.pool.ThreadPool(min(self.args["--jobs"], 2)) as pool:
if self.args["--hmm"]:
pool.map(annotate, self._custom_hmms())
else:
pool.map(annotate, embedded_hmms())
# Count number of annotated domains
count = sum(1 for gene in genes for domain in gene.protein.domains)
self.logger.debug("Found {} domains across all proteins", count)
......
*.hmm
*.hmm.h3*
*.hmm.gz
*.xml.gz
......@@ -3,3 +3,5 @@ id = Pfam
version = 33.1
url = ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/Pfam-A.hmm.gz
relabel_with = s/(PF\d+).\d+/\1/
md5 = 7795fabbbbc68b9c93aa14dbec371894
......@@ -2,3 +2,5 @@
id = Tigrfam
version = 15.0
url = http://ftp.ncbi.nlm.nih.gov/hmm/TIGRFAMs/release_15.0/TIGRFAMs_15.0_HMM.LIB.gz
md5 = 050d50759f60d102466c319fe6da69e6
......@@ -85,6 +85,7 @@ class HMM(typing.NamedTuple):
url: str
path: str
relabel_with: Optional[str] = None
md5: Optional[str] = None
def relabel(self, domain: str) -> str:
"""Rename a domain obtained by this HMM to the right accession.
......
......@@ -125,7 +125,7 @@ class Domain:
"""
stride = 1 if protein_coordinates else 3
loc = FeatureLocation(0, (self.end - self.start)*stride)
loc = FeatureLocation((self.start-1)*stride, self.end*stride)
qualifiers = dict(self.qualifiers)
qualifiers.setdefault("standard_name", self.name)
return SeqFeature(location=loc, type="misc_feature", qualifiers=qualifiers)
......@@ -207,8 +207,7 @@ class Gene:
"""
# NB(@althonos): we use inclusive 1-based ranges in the data model
# but Biopython expects 0-based ranges with exclusive ends
end = self.end - self.start + 1
loc = FeatureLocation(start=0, end=end, strand=int(self.strand))
loc = FeatureLocation(start=self.start-1, end=self.end, strand=int(self.strand))
qualifiers = dict(self.qualifiers)
qualifiers.setdefault("locus_tag", self.protein.id)
qualifiers.setdefault("translation", str(self.protein.seq))
......@@ -361,9 +360,8 @@ class Cluster:
for gene in self.genes:
# write gene as a /cds GenBank record
cds = gene.to_seq_feature()
cds.location += gene.start - self.start
bgc.features.append(cds)
# # write domains as /misc_feature annotations
# write domains as /misc_feature annotations
for domain in gene.protein.domains:
misc = domain.to_seq_feature(protein_coordinates=False)
misc.location += cds.location.start
......
[build-system]
requires = ['setuptools >=39.2', 'tqdm ~=4.41', 'pyhmmer ~=0.2.1', 'wheel >=0.30']
build-backend = "setuptools.build_meta"
......@@ -3,7 +3,7 @@ name = gecco
version = file: gecco/_version.txt
author = Martin Larralde
author-email = martin.larralde@embl.de
home-page = https://git.embl.de/grp-zeller/GECCO
home-page = https://github.com/zellerlab/GECCO
description = Gene cluster prediction with Conditional random fields.
long-description = file: README.md
long_description_content_type = text/markdown
......@@ -22,6 +22,12 @@ classifiers =
Topic :: Scientific/Engineering :: Bio-Informatics
Topic :: Scientific/Engineering :: Medical Science Apps.
Typing :: Typed
project_urls =
Documentation = https://gecco.embl.de/
Bug Tracker = https://github.com/zellerlab/GECCO/issues
Changelog = https://github.com/zellerlab/GECCO/blob/master/CHANGELOG.md
Coverage = https://codecov.io/gh/zellerlab/GECCO/
Builds = https://git.embl.de/grp-zeller/GECCO/-/pipelines
[options]
zip_safe = false
......@@ -31,7 +37,8 @@ python_requires = >= 3.6
setup_requires =
setuptools >=39.2
tqdm ~=4.41
pyhmmer ~=0.1.4
pyhmmer ~=0.2.1
wheel >=0.30
install_requires =
better-exceptions ~=0.2.2
biopython ~=1.78
......@@ -39,7 +46,7 @@ install_requires =
dataclasses ~=0.8 ; python_version < '3.7'
docopt ~=0.6.2
numpy ~=1.16
pyhmmer ~=0.1.4
pyhmmer ~=0.2.1
pyrodigal ~=0.4.1
scikit-learn ~=0.24.0
scipy ~=1.4
......@@ -56,11 +63,12 @@ train =
[options.packages.find]
include =
gecco
gecco.cli
gecco.cli.commands
gecco.crf
gecco.hmmer
gecco.interpro
gecco.types
gecco.cli
gecco.cli.commands
[options.package_data]
gecco = _version.txt, py.typed
......
......@@ -13,6 +13,7 @@ import os
import re
import shutil
import ssl
import struct
import sys
import tarfile
import urllib.request
......@@ -111,8 +112,13 @@ class update_model(setuptools.Command):
with gzip.open(path, "wt") as dest:
json.dump(entries, dest)
# Rebuild the HMMs using the new domains from the model
build_data = self.get_finalized_command("build_data")
build_data.force = build_data.rebuild = True
build_data.run()
def download_interpro_entries(self, db):
next = "https://www.ebi.ac.uk:443/interpro/api/entry/all/{}/?page_size=200".format(db)
next = f"https://www.ebi.ac.uk:443/interpro/api/entry/all/{db}/?page_size=200"
entries = []
context = ssl._create_unverified_context()
with tqdm(desc=db, leave=False) as pbar:
......@@ -132,11 +138,16 @@ class build_data(setuptools.Command):
description = "download the HMM libraries used by GECCO to annotate proteins"
user_options = [
("inplace", "i", "ignore build-lib and put data alongside your Python code")
("force", "f", "force downloading the files even if they exist"),
("inplace", "i", "ignore build-lib and put data alongside your Python code"),
("rebuild", "r", "rebuild the HMM from the source instead of getting "
"the pre-filtered HMM files from GitHub"),
]
def initialize_options(self):
self.force = False
self.inplace = False
self.rebuild = False
def finalize_options(self):
_build_py = self.get_finalized_command("build_py")
......@@ -146,6 +157,7 @@ class build_data(setuptools.Command):
self.announce(msg, level=2)
def run(self):
# make sure the build/lib/ folder exists
self.mkpath(self.build_lib)
# Check `tqdm` and `pyhmmer` are installed
......@@ -156,7 +168,7 @@ class build_data(setuptools.Command):
# Load domain whitelist from the type classifier data
domains_file = os.path.join("gecco", "types", "domains.tsv")
self.info("loading domain accesssions from {}".format(domains_file))
self.info(f"loading domain accesssions from {domains_file}")
with open(domains_file, "rb") as f:
domains = [line.strip() for line in f]
......@@ -164,33 +176,71 @@ class build_data(setuptools.Command):
for in_ in glob.iglob(os.path.join("gecco", "hmmer", "*.ini")):
local = os.path.join(self.build_lib, in_).replace(".ini", ".h3m")
self.mkpath(os.path.dirname(local))
self.make_file([in_], local, self.download, (in_, domains))
self.download(in_, local, domains)
if self.inplace:
copy = in_.replace(".ini", ".h3m")
self.make_file([local], copy, shutil.copy, (local, copy))
def download(self, in_, domains):
def download(self, in_, local, domains):
# read the configuration file for each HMM database
cfg = configparser.ConfigParser()
cfg.read(in_)
out = os.path.join(self.build_lib, in_.replace(".ini", ".h3m"))
options = dict(cfg.items("hmm"))
# download the HMM to `local`, and delete the file if any error
# or interruption happens during the download
try:
self.download_hmm(out, domains, dict(cfg.items("hmm")))
except:
if os.path.exists(out):
os.remove(out)
self.make_file([in_], local, self.download_hmm, (local, domains, options))
except BaseException:
if os.path.exists(local):
os.remove(local)
raise
# update the MD5 if the HMMs are being rebuilt, otherwise
# check the hashes are consistent
if self.rebuild:
self._rebuild_checksum(in_, local, cfg)
else:
self._validate_checksum(local, options)
def download_hmm(self, output, domains, options):
base = "https://github.com/zellerlab/GECCO/releases/download/v{version}/{id}.hmm.gz"
# try getting the GitHub artifacts first, unless asked to rebuild
if not self.rebuild:
try:
self._download_release_hmm(output, domains, options)
except urllib.error.HTTPError:
pass
else:
return
# fallback to filtering the HMMs from their release location
self._rebuild_hmm(output, domains, options)
def _download_release_hmm(self, output, domains, options):
# build the GitHub releases URL
base = "https://github.com/zellerlab/GECCO/releases/download/v{version}/{id}.h3m.gz"
url = base.format(id=options["id"], version=self.distribution.get_version())
# attempt to use the GitHub releases URL, otherwise fallback to official URL
try:
self.announce("fetching {}".format(url), level=2)
response = urllib.request.urlopen(url)
except urllib.error.HTTPError:
self.announce("using fallback {}".format(options["url"]), level=2)
response = urllib.request.urlopen(options["url"])
# fetch the resource
self.info(f"fetching {url}")
response = urllib.request.urlopen(url)
# use `tqdm` to make a progress bar
pbar = tqdm.wrapattr(
response,
"read",
total=int(response.headers["Content-Length"]),
desc=os.path.basename(output),
leave=False,
)
# download to `output`
with contextlib.ExitStack() as ctx:
dl = ctx.enter_context(pbar)
src = ctx.enter_context(gzip.open(dl))
dst = ctx.enter_context(open(output, "wb"))
shutil.copyfileobj(src, dst)
def _rebuild_hmm(self, output, domains, options):
# fetch the resource from the URL in the ".ini" files
self.info(f"using fallback {options['url']}")
response = urllib.request.urlopen(options["url"])
# use `tqdm` to make a progress bar
pbar = tqdm.wrapattr(
response,
......@@ -199,14 +249,42 @@ class build_data(setuptools.Command):
desc=os.path.basename(output),
leave=False,
)
# download the HMM
with pbar as src:
with open(output, "wb") as dst:
nwritten = 0
for hmm in HMMFile(gzip.open(src)):
if hmm.accession.split(b".")[0] in domains:
hmm.write(dst, binary=True)
nwritten += 1
# download to `output`
nsource = 0
nwritten = 0
with contextlib.ExitStack() as ctx:
dl = ctx.enter_context(pbar)
src = ctx.enter_context(gzip.open(dl))
dst = ctx.enter_context(open(output, "wb"))
for hmm in HMMFile(src):
nsource += 1
if hmm.accession.split(b".")[0] in domains:
nwritten += 1
hmm.write(dst, binary=True)
# log number of HMMs kept in final files
self.info(f"kept {nwritten} HMMs out of {nsource} in the source file")
def _checksum(self, path):
hasher = hashlib.md5()
with HMMFile(path) as hmm_file:
for hmm in hmm_file:
hasher.update(struct.pack("<I", hmm.checksum))
return hasher.hexdigest()
def _validate_checksum(self, local, options):
self.info(f"checking HMM/MD5 signature of {local}")
md5 = self._checksum(local)
if md5 != options["md5"]:
self.info("local HMM/MD5 does not match the expected HMM/MD5 hash")
self.info(f"(expected {options['md5']}, found {md5}")
self.info("rerun `python setup.py build_data --force`")
raise ValueError("HMM/MD5 hash mismatch")
def _rebuild_checksum(self, in_, local, cfg):
self.info(f"updating HMM/MD5 signature in {in_}")
cfg.set("hmm", "md5", self._checksum(local))
with open(in_, "w") as f:
cfg.write(f)
class build(_build):
......