Commits (19)
......@@ -2,6 +2,7 @@ stages:
- test
- pages
- lint
- deploy
variables:
TERM: ansi
......@@ -25,9 +26,11 @@ variables:
- python -m coverage xml
- python -m coverage report
artifacts:
expire_in: 1 week
reports:
cobertura: coverage.xml
# --- Stages -------------------------------------------------------------------
test:python3.6:
......@@ -45,15 +48,22 @@ test:python3.8:
test:python3.9:
image: python:3.9
<<: *test
artifacts:
expire_in: 1 week
paths:
- dist/*.whl
- coverage.xml
reports:
cobertura: coverage.xml
docs:
image: python:3.9
stage: pages
dependencies:
- test:python3.9
before_script:
- python -m pip install -U wheel coverage tqdm pyhmmer
- pip install -U -r docs/requirements.txt
- python setup.py build_data --inplace bdist_wheel
- python -m pip install --find-links=dist gecco
- pip install -U --find-links dist gecco[train]
script:
- sphinx-build -b html docs public
artifacts:
......@@ -94,3 +104,56 @@ lint:radon:
- pip install -U radon
script:
- radon cc -a gecco
deploy:codecov:
image: python:3.9
stage: deploy
dependencies:
- test:python3.9
before_script:
- pip install -U codecov
script:
- python -m codecov
deploy:codacy:
image: python:3.9
stage: deploy
dependencies:
- test:python3.9
before_script:
- pip install -U codacy-coverage
script:
- python -m codacy -r coverage.xml
deploy:changelog:
image: ruby
stage: deploy
before_script:
- gem install chandler
script:
- chandler push --github="zellerlab/GECCO" --changelog="CHANGELOG.md"
deploy:sdist:
image: python:3.9
stage: deploy
only:
- tags
before_script:
- python -m pip install -U wheel twine
script:
- python setup.py sdist
- twine check dist/*.tar.gz
- twine upload --repository testpypi --skip-existing dist/*.tar.gz
deploy:wheel:
image: python:3.9
stage: deploy
dependencies:
- test:python3.9
only:
- tags
before_script:
- python -m pip install -U wheel twine
script:
- twine check dist/*.whl
# - twine upload --repository testpypi dist/*.whl
......@@ -5,10 +5,16 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
## [Unreleased]
[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.0...master
[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.1...master
## [v0.5.1] - 2021-01-15
[v0.5.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.0...v0.5.1
### Fixed
- `--hmm` flag being ignored in in `gecco run` command.
- `PyHMMER` using HMM names instead of accessions, causing issues with Pfam HMMs.
## [v0.5.0] - 2021-01-11
[v0.4.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.5...v0.5.0
[v0.5.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.5...v0.5.0
### Added
- Explicit support for Python 3.9.
### Changed
......
# Contributing to GECCO
For bug fixes or new features, please file an issue before submitting a pull request.
If the change is not trivial, it may be best to wait for feedback.
## Running tests
Tests are written as usual Python unit tests with the `unittest` module of the
Python standard library. Running them can be done as follow:
```console
$ python -m unittest discover -vv
```
## Managing versions
As it is a good project management practice, we should follow
[semantic versioning](https://semver.org/), so remember the following:
* As long as the model predicts the same thing, retraining/updating the model
should be considered a non-breaking change, so you should bump the MINOR
version of the program.
* Upgrading the internal HMMs could potentially change the output but won't
break the program, they should be treated as non-breaking change, so you
should bump the MINOR version of the program.
* If the model changes prediction (e.g. predicted classes change), then you
should bump the MAJOR version of the program as it it a breaking change.
* Changes in the code should be treated following semver the usual way.
* Changed in the CLI should be treated as changed in the API (e.g. a new
CLI option or a new command bumps the MINOR version, removal of an option
bumps the MAJOR version).
### Upgrading the internal HMMs
To bump the version of the internal HMMs (for instance, to switch to a newer
version of Pfam), simply edit the INI file for that HMM in the
``gecco/hmmer`` folder.
The simply clean and rebuild data files to download the latest version of
the HMMs:
```console
$ python setup.py clean build_data --inplace
```
### Upgrading the internal CRF model
After having trained a new version of the model, simply run the following
command to update the internal GECCO model as well as the hash signature file:
```console
$ python setup.py update_model --model <path_to_new_crf.model>
```
![](static/gecco.png)
<img align="left" width="100" height="100" src="static/gecco-square.png">
# Hi, I'm GECCO!
[![GitLabCI](https://img.shields.io/gitlab/pipeline/grp-zeller/GECCO/master?gitlab_url=https%3A%2F%2Fgit.embl.de&logo=gitlab&style=flat-square&maxAge=600)](https://git.embl.de/grp-zeller/GECCO)
[![License](https://img.shields.io/badge/license-GPLv3-blue.svg?style=flat-square&maxAge=2678400)](https://choosealicense.com/licenses/gpl-3.0/)
[![Source](https://img.shields.io/badge/source-GitHub-303030.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/)
[![Mirror](https://img.shields.io/badge/mirror-EMBL-009f4d?style=flat-square&maxAge=2678400)](https://git.embl.de/grp-zeller/GECCO/)
[![Changelog](https://img.shields.io/badge/keep%20a-changelog-8A0707.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/blob/master/CHANGELOG.md)
<!-- [![Coverage](https://img.shields.io/codecov/c/gh/zellerlab/GECCO?style=flat-square&maxAge=3600)](https://codecov.io/gh/zellerlab/GECCO/) -->
<!-- [![PyPI](https://img.shields.io/pypi/v/GECCO.svg?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-bgc) -->
<!-- [![Wheel](https://img.shields.io/pypi/wheel/GECCO.svg?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-bgc/#files) -->
<!-- [![Python Versions](https://img.shields.io/pypi/pyversions/gecco-bgc.svg?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-bgc/#files) -->
<!-- [![GitHub issues](https://img.shields.io/github/issues/zellerlab/GECCO.svg?style=flat-square&maxAge=600)](https://github.com/zellerlab/GECCO/issues) -->
<!-- [![Docs](https://img.shields.io/readthedocs/gecco/latest?style=flat-square&maxAge=600)](https://gecco.readthedocs.io) -->
<!-- [![Downloads](https://img.shields.io/badge/dynamic/json?style=flat-square&color=303f9f&maxAge=86400&label=downloads&query=%24.total_downloads&url=https%3A%2F%2Fapi.pepy.tech%2Fapi%2Fprojects%2Fgecco)](https://pepy.tech/project/GECCO) -->
## Requirements
* [Python](https://www.python.org/downloads/) 3.6 or higher
## 🦎 ️Overview
GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and
scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs)
in genomic and metagenomic data using Conditional Random Fields (CRFs).
## Installing GECCO
### Development version
## 🔧 Installing GECCO
To install GECCO from the EMBL git server, run:
```console
$ pip install git+https://git.embl.de/grp-zeller/GECCO/
```
Note that this command can take a long time to complete as it need to download
around 250MB of data from the EBI FTP server. Once the install is finished, a
`gecco` command should be available in your path.
## Running GECCO
Once `gecco.py` is available in your `PATH`, you can run it from everywhere by
giving it a FASTA or GenBank file with the genome you want to analyze, as well
as an output directory.
GECCO is implemented in [Python](https://www.python.org/), and supports [all
versions](https://endoflife.date/python) from Python 3.6. It requires
additional libraries that can be installed directly from
PyPI, the Python Package Index.
Use `pip` to install GECCO on your machine:
```console
$ gecco run --genome some_genome.fna -o some_output_dir
$ pip install gecco
```
This will download GECCO, its dependencies, and the data needed to run
predictions. Once done, you will have a ``gecco`` command in your $PATH.
## Training GECCO
By default, GECCO is only installed with prediction support; to be able to train GECCO,
you need to install it with its training requirements:
*Note that GECCO uses [HMMER3](http://hmmer.org/), which can only run
on PowerPC and and recent x86-64 machines running a POSIX operating system.
Therefore, Linux and OSX are supported platforms, but GECCO will not be able
to run on Windows.*
```console
$ pip install GECCO[train]
```
### Resources
## 🧬 Running GECCO
For this, you need to get the FASTA file from MIBiG containing all the proteins
of BGCs. It can be downloaded [from there](https://mibig.secondarymetabolites.org/download).
Then you also need some bacterial genomes free of BGCs, also in FASTA format. A
common approach is to download a bunch of complete genomes from the ENA and to
remove the ones where a BGC is detected with either DeepBGC or AntiSMASH.
### Build feature tables
With your training sequences ready, first build a feature table using the
`gecco annotate` subcommand:
Once `gecco` is installed, you can run it from the terminal by giving it a
FASTA or GenBank file with the genomic sequence you want to analyze, as
well as an output directory:
```console
$ gecco annotate --genome reference_genome.fna -o nobgc
$ gecco annotate --mibig mibig_prot_seqs_xx.faa -o bgc
```
Use the `--hmm` flag to give other HMMs instead of using the internal one
(PFam v31.0 only at the moment). **Make sure to use the `--mibig` input flag
and not `--proteins` when annotating MIBiG sequences to ensure additional
metadata are properly extracted from the sequence id of each protein.**
*This step will probably take ages, count about 5 minutes to annotate
1M amino acids with PFam.*
### Create the embedding
When both the BGC and the non BGC sequences have been annotated, merge them into
a continuous table to train the CRF on using the `gecco embed` subcommand:
```console
$ gecco embed --bgc bgc/*.features.tsv --nobgc nobgc/*.features.tsv -o merged.features.tsv
```
If the non-BGC training set is not large enough to fit all BGCs, a warning will
be thrown.
### Train the model
To train the model with default parameters, use the following command:
```console
$ gecco train -i merged.features.tsv -o model
$ gecco run --genome some_genome.fna -o some_output_dir
```
Additional parameters of interest are:
- `--jobs`, which controls the number of threads that will be spawned by
GECCO whenever a step can be parallelized. The default, *0*, will
autodetect the number of CPUs on the machine using
[`multiprocessing.cpu_count`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.cpu_count).
- `--cds`, controlling the minimum number of consecutive genes a BGC region
must have to be detected by GECCO (default is 3).
- `--threshold`, controlling the minimum probability for a gene to be
considered part of a BGC region. Using a lower number will increase the
number (and possibly length) of predictions, but reduce accuracy.
## Housekeeping GECCO
## Versioning
As it is a good project management practice, we should follow
[semantic versioning](https://semver.org/), so remember the following:
<!-- ## 📖 Documentation -->
* As long as the model predict the same thing, retraining/updating the model
should be considered a non-breaking change, so you should bump the MINOR
version of the program.
* Upgrading the internal HMMs could potentially change the output but won't
break the program, they should be treated as non-breaking change, so you
should bump the MINOR version of the program.
* If the model changes prediction (e.g. predicted classes change), then you
should bump the MAJOR version of the program as it it a breaking change.
* Changes in the code should be treated following semver the usual way.
* Changed in the CLI should be treated as changed in the API (e.g. a new
CLI option or a new command bumps the MINOR version, removal of an option
bumps the MAJOR version).
## 💭 Feedback
### ⚠️ Issue Tracker
### Upgrading the internal HMMs
Found a bug ? Have an enhancement request ? Head over to the [GitHub issue
tracker](https://github.com/zellerlab/GECCO/issues) if you need to report
or ask something. If you are filing in on a bug, please include as much
information as you can about the issue, and try to recreate the same bug
in a simple, easily reproducible situation.
To bump the version of the internal HMMs (for instance, to switch to a newer
version of Pfam), you will need to do the following:
### 🏗️ Contributing
- edit the `setup.py` file with the new URL to the HMM file.
- update the signature file in `gecco/data/hmms` with the MD5 checksum of the
new file (this can be found online or computed locally with the `md5sums`
command after having downloaded the file from a safe source).
Contributions are more than welcome! See [`CONTRIBUTING.md`](https://github.com/althonos/pyhmmer/blob/master/CONTRIBUTING.md)
for more details.
## ⚖️ License
### Upgrading the internal CRF model
After having trained a new version of the model, simply run the following command
to update the internal GECCO model as well as the hash signature file:
```console
$ python setup.py update_model --model <path_to_new_crf.model>
```
This software is provided under the [GNU General Public License v3.0 *or later*](https://choosealicense.com/licenses/gpl-3.0/). GECCO is developped by the [Zeller Team](https://www.embl.de/research/units/scb/zeller/index.html)
at the [European Molecular Biology Laboratory](https://www.embl.de/) in Heidelberg.
_build
changes.md
Type Prediction
===============
Domain Annotation
=================
.. currentmodule:: gecco.hmmer
.. automodule:: gecco.hmmer
......
../CHANGELOG.md
\ No newline at end of file
......@@ -32,10 +32,6 @@ def setup(app):
app.add_css_file("css/main.css")
app.add_js_file("js/apitoc.js")
app.add_js_file("js/example-admonition.js")
# Copy `CHANGELOG.md` from project directory
changelog_src = os.path.join(project_dir, "CHANGELOG.md")
changelog_dst = os.path.join(docssrc_dir, "changes.md")
shutil.copy(changelog_src, changelog_dst)
# -- Project information -----------------------------------------------------
......
../CONTRIBUTING.md
\ No newline at end of file
......@@ -13,6 +13,8 @@ Documentation
:maxdepth: 1
Installation <install>
Training <training>
Contributing <contributing>
.. rubric:: Library
......
......@@ -4,18 +4,18 @@ Installation
PyPi
^^^^
``GECCO`` is hosted on the EMBL Git server, but the easiest way to install it is
GECCO is hosted on the EMBL Git server, but the easiest way to install it is
to download the latest release from its `PyPi repository <https://pypi.python.org/pypi/gecco>`_.
It will install all dependencies then install the ``gecco`` module:
.. code:: console
$ pip install --user gecco
$ pip install gecco
.. Conda
.. ^^^^^
..
.. Pronto is also available as a `recipe <https://anaconda.org/bioconda/GECCO>`_
.. GECCO is also available as a `recipe <https://anaconda.org/bioconda/GECCO>`_
.. in the `bioconda <https://bioconda.github.io/>`_ channel. To install, simply
.. use the `conda` installer:
..
......
Training
========
Requirements
^^^^^^^^^^^^
In order to train GECCO, you will need to have it installed with the *train*
dependencies:
.. code-block:: console
$ pip install gecco[train]
This will install additional Python packages, such as `pandas <https://pandas.pydata.org/>`_
which is needed to process the feature tables, or `fisher <https://pypy.org/project/fisher>`_
which is used to select the most informative features.
Resources
^^^^^^^^^
You are free to use sequences coming from anywhere as your _positive_ regions.
GECCO was trained to detect Biosynthetic Gene Clusters, so the
`MIBiG <https://mibig.secondarymetabolites.org/>`_ database was used; you need
to get the FASTA file from MIBiG containing all BGC proteins. It can be found
`on the Download page of MIBiG <https://mibig.secondarymetabolites.org/download>`_.
You also need to have some _negative_ regions, such as bacterial genomes, and
make sure they are free of _positives_ (e.g., they must not contain any BGCs).
A common approach is to download a bunch of complete genomes from the
`ENA <https://www.ebi.ac.uk/ena/browser/home>`_ and to removes the regions where
positives are detected (e.g. removing BGCs found by `antiSMASH <https://antismash.secondarymetabolites.org/>`_).
Features
^^^^^^^^
GECCO does not train on sequences directly, but on feature tables. To obtain
a feature table from the sequences, the ``gecco annotate`` subcommand can be
used:
.. code-block:: console
$ gecco annotate --genome reference_genome.fna -o nobgc
$ gecco annotate --mibig mibig_prot_seqs_xx.faa -o bgc
Use the `--hmm` flag with the path to a different HMM file to use other HMMs than
the internal ones from GECCO (a subset of Pfam v33.1 and Tigrfam v15.0). **Make
sure the `--mibig` input flag is used when annotating MIBiG sequences to ensure
the metadata are properly extracted from the sequence ID of each protein in the
input file.**
_This step takes a long time, count about 5 minutes to annotate 1M amino-acids
with Pfam alone._
Embedding
^^^^^^^^^
When both the BGC and the non-BGC sequences have been annotated, merge them into
a continuous table to train the CRF on using the ``gecco embed`` subcommand:
.. code-block:: console
$ gecco embed --bgc bgc/*.features.tsv --nobgc nobgc/*.features.tsv -o embedding.tsv
Fitting
^^^^^^^
To train the model with default parameters, you must have built a feature table
and a cluster table similar to the ones created by GECCO after a prediction.
Once it's done, use the following command to fit the weights of the CRF, and to
train the BGC type classifier:
.. code-block:: console
$ gecco train --features embedding.features.tsv --clusters embedding.clusters.tsv -o model
Use the ``--select`` flag to select a fraction of most informative features
before training to reduce the total feature set (for instance, use ``--select 0.3``
to select the 30% features with the lowest Fisher *p-value*).
The command will output a folder (named ``model`` here) containing all the data
files necessary for predicting probabilities with ``gecco run``:
.. code-block:: console
$ gecco run --model model ...
......@@ -64,6 +64,8 @@ class Run(Command): # noqa: D101
Parameters - Debug:
--model <directory> the path to an alternative CRF model
to use (obtained with `gecco train`).
--hmm <hmm> the path to one or more alternative HMM
file to use (in HMMER format).
"""
def _check(self) -> typing.Optional[int]:
......@@ -95,8 +97,28 @@ class Run(Command): # noqa: D101
self.logger.error("could not locate input file {!r}", input)
return 1
# check the HMMs exist
for hmm in self.args["--hmm"]:
if not os.path.exists(hmm):
self.logger.error("could not locate HMM file {!r}", hmm)
return 1
return None
def _custom_hmms(self):
for path in self.args["--hmm"]:
base = os.path.basename(path)
if base.endswith(".gz"):
base, _ = os.path.splitext(base)
base, _ = os.path.splitext(base)
yield HMM(
id=base,
version="?",
url="?",
path=path,
relabel_with=r"s/([^\.]*)(\..*)?/\1/"
)
def __call__(self) -> int: # noqa: D102
# Make output directory
out_dir = self.args["--output-dir"]
......@@ -132,7 +154,10 @@ class Run(Command): # noqa: D101
self.logger.debug("Finished running HMM {}", hmm.id)
with multiprocessing.pool.ThreadPool(min(self.args["--jobs"], 2)) as pool:
pool.map(annotate, embedded_hmms())
if self.args["--hmm"]:
pool.map(annotate, self._custom_hmms())
else:
pool.map(annotate, embedded_hmms())
# Count number of annotated domains
count = sum(1 for gene in genes for domain in gene.protein.domains)
......
......@@ -208,7 +208,7 @@ class PyHMMER(object):
for hit in hits:
target_name = hit.name.decode('utf-8')
for domain in hit.domains:
raw_acc = domain.alignment.hmm_name
raw_acc = domain.alignment.hmm_accession
accession = self.hmm.relabel(raw_acc.decode('utf-8'))
entry = interpro.by_accession.get(accession)
......
......@@ -31,7 +31,7 @@ python_requires = >= 3.6
setup_requires =
setuptools >=39.2
tqdm ~=4.41
pyhmmer ~=0.1.1
pyhmmer ~=0.1.4
install_requires =
better-exceptions ~=0.2.2
biopython ~=1.78
......@@ -39,7 +39,7 @@ install_requires =
dataclasses ~=0.8 ; python_version < '3.7'
docopt ~=0.6.2
numpy ~=1.16
pyhmmer ~=0.1.2
pyhmmer ~=0.1.4
pyrodigal ~=0.4.1
scikit-learn ~=0.24.0
scipy ~=1.4
......@@ -54,7 +54,13 @@ train =
tqdm ~=4.41
[options.packages.find]
include = gecco, gecco.crf, gecco.hmmer, gecco.types, gecco.cli
include =
gecco
gecco.crf
gecco.hmmer
gecco.types
gecco.cli
gecco.cli.commands
[options.package_data]
gecco = _version.txt, py.typed
......
#!/usr/bin/env python
import configparser
import contextlib
import csv
import glob
import gzip
......@@ -19,10 +20,20 @@ from functools import partial
from xml.etree import ElementTree as etree
import setuptools
from distutils import log
from distutils.command.build import build as _build
from distutils.command.clean import clean as _clean
from setuptools.command.sdist import sdist as _sdist
from tqdm import tqdm
from pyhmmer.plan7 import HMMFile
try:
from tqdm import tqdm
except ImportError as err:
tqdm = err
try:
from pyhmmer.plan7 import HMMFile
except ImportError as err:
HMMFile = err
class sdist(_sdist):
......@@ -65,6 +76,10 @@ class update_model(setuptools.Command):
self.announce(msg, level=2)
def run(self):
# Check `tqdm` is installed
if isinstance(tqdm, ImportError):
raise RuntimeError("tqdm is required to run the `update_model` command") from tqdm
# Copy the file to the new in-source location and compute its hash.
hasher = hashlib.md5()
self.info("Copying the trained CRF model to the in-source location")
......@@ -133,11 +148,19 @@ class build_data(setuptools.Command):
def run(self):
self.mkpath(self.build_lib)
# Check `tqdm` and `pyhmmer` are installed
if isinstance(HMMFile, ImportError):
raise RuntimeError("pyhmmer is required to run the `build_data` command") from HMMFile
if isinstance(tqdm, ImportError):
raise RuntimeError("tqdm is required to run the `build_data` command") from tqdm
# Load domain whitelist from the type classifier data
domains_file = os.path.join("gecco", "types", "domains.tsv")
self.info("loading domain accesssions from {}".format(domains_file))
with open(domains_file, "rb") as f:
domains = [line.strip() for line in f]
# Download and binarize required HMMs
for in_ in glob.iglob(os.path.join("gecco", "hmmer", "*.ini")):
local = os.path.join(self.build_lib, in_).replace(".ini", ".h3m")
self.mkpath(os.path.dirname(local))
......@@ -159,7 +182,7 @@ class build_data(setuptools.Command):
raise
def download_hmm(self, output, domains, options):
base = "https://github.com/althonos/GECCO/releases/download/v{version}/{id}.hmm.gz"
base = "https://github.com/zellerlab/GECCO/releases/download/v{version}/{id}.hmm.gz"
url = base.format(id=options["id"], version=self.distribution.get_version())
# attempt to use the GitHub releases URL, otherwise fallback to official URL
try:
......@@ -168,13 +191,16 @@ class build_data(setuptools.Command):
except urllib.error.HTTPError:
self.announce("using fallback {}".format(options["url"]), level=2)
response = urllib.request.urlopen(options["url"])
# download the HMM
format = dict(
# use `tqdm` to make a progress bar
pbar = tqdm.wrapattr(
response,
"read",
total=int(response.headers["Content-Length"]),
desc=os.path.basename(output),
leave=False,
)
with tqdm.wrapattr(response, "read", **format) as src:
# download the HMM
with pbar as src:
with open(output, "wb") as dst:
nwritten = 0
for hmm in HMMFile(gzip.open(src)):
......@@ -192,11 +218,28 @@ class build(_build):
_build.run(self)
class clean(_clean):
def run(self):
# remove HMM files that have been installed inplace
hmm_dir = os.path.join(os.path.dirname(__file__), "gecco", "hmmer")
for ini in glob.iglob(os.path.join(hmm_dir, "*.ini")):
for ext in [".h3m", ".hmm"]:
path = ini.replace(".ini", ext)
if os.path.exists(path):
log.info("Removing %s", path)
if not self.dry_run:
os.remove(path)
# run the rest of the command as normal
_clean.run(self)
if __name__ == "__main__":
setuptools.setup(
cmdclass={
"build": build,
"build_data": build_data,
"clean": clean,
"sdist": sdist,
"update_model": update_model,
},
......