Commits (302)
data/pfam/
data/db_config.txt
test/
*.pyc
*.DS_Store
*__pycache__
*.remote-sync.json
conda/
GECCO.egg-info/
dist/
# Created by https://www.gitignore.io/api/python
# Edit at https://www.gitignore.io/?templates=python
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
*.log
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
pyproject.toml
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# End of https://www.gitignore.io/api/python
*.hmm.gz
stages:
- test
- pages
- lint
variables:
TERM: ansi
# --- Stage Templates ----------------------------------------------------------
.test: &test
stage: test
cache:
key: ${CI_COMMIT_REF_SLUG}
paths:
- ci/cache
before_script:
- ./ci/gitlab/test.before_script.sh
script:
- python setup.py bdist_wheel
- pip install -U dist/*.whl
- python -m coverage run -p -m unittest discover -vv
after_script:
- python -m coverage combine
- python -m coverage report
# --- Stages -------------------------------------------------------------------
test:python3.7:
image: python:3.7
<<: *test
test:python3.8:
image: python:3.8
<<: *test
docs:
image: python:3.8
stage: pages
before_script:
- pip install -U -r docs/requirements.txt
script:
- sphinx-build -b html docs public
artifacts:
paths:
- public
lint:pydocstyle:
image: python:3.8
stage: lint
before_script:
- pip install -U pydocstyle
script:
- pydocstyle gecco
lint:mypy:
image: python:3.8
stage: lint
allow_failure: true
before_script:
- pip install -U mypy
script:
- mypy gecco
lint:pylint:
image: python:3.8
stage: lint
allow_failure: true
before_script:
- pip install -U pylint
script:
- pylint gecco
lint:radon:
image: python:3.8
stage: lint
allow_failure: true
before_script:
- pip install -U radon
script:
- radon cc -a gecco
include README.md
include LICENSE
include static/gecco.png
include gecco/_version.txt
recursive-include gecco/data *.tsv *.tsv.gz *.model *.md5 *.ini
![](gecco.png)
![](static/gecco.png)
# Hi, I'm GECCO!
## Requirements
* [Python](https://www.python.org/downloads/) 3.6 or higher
* [HMMER](http://hmmer.org/) v3.2 or higher
* [Prodigal](https://github.com/hyattpd/Prodigal) v2.6.3 or higher
Both HMMER and Prodigal can be installed through [conda](https://anaconda.org/). They have to be in the `PATH` variable before running GECCO.
HMMER can be installed through [conda](https://anaconda.org/). It has to
be in the `$PATH` variable before running GECCO. GECCO also requires
additional Python libraries, but they are normally installed automatically
by `pip` or `conda` when installing GECCO.
## Installation
## Installing GECCO
To install GECCO, just run
### Development version
```bash
git clone https://git.embl.de/fleck/GECCO.git
cd GECCO
python setup.py install
To install GECCO from the EMBL git server, run:
```console
$ pip install git+https://git.embl.de/grp-zeller/GECCO/
```
This should install all Python dependencies and download the Pfam database. Also, it should add `gecco.py` to your `PATH` variable. If that did not work for some reason, you can do this manually by typing
Note that this command can take a long time to complete as it need to download
around 250MB of data from the EBI FTP server. You will need to have writing
rights to the site folder of Python; if this is not the case, use `pip` with
the `--user` flag to install it to a local folder. Another option is to use
a virtual environment, either with `virtualenv` or with `conda`.
Once the install is finished, a `gecco` command will be available in your path
automatically.
## Running GECCO
Once `gecco.py` is available in your `PATH`, you can run it from everywhere by
giving it a FASTA or GenBank file with the genome you want to analyze, as well
as an output directory.
```bash
export PATH=$(pwd):$PATH
```console
$ gecco run --genome some_genome.fna -o some_output_dir
```
If you're on a remote server without root access, it might be useful to install GECCO inside a conda environment because otherwise pip probably won't be able to install the Python dependencies.
## Training GECCO
### Resources
For this, you need to get the FASTA file from MIBiG containing all the proteins
of BGCs. It can be downloaded [from there](https://mibig.secondarymetabolites.org/download).
Then you also need some bacterial genomes free of BGCs, also in FASTA format. A
common approach is to download a bunch of complete genomes from the ENA and to
remove the ones where a BGC is detected with either DeepBGC or AntiSMASH.
## Running GECCO
Once `gecco.py` is available in your `PATH`, you can run it from everywhere by giving it a FASTA or GenBank file with the genome you want to analyze, as well as an output directory.
### Build feature tables
```bash
python gecco.py -o some_output_dir some_genome.fna
With your training sequences ready, first build a feature table using the
`gecco annotate` subcommand:
```console
$ gecco annotate --genome reference_genome.fna -o nobgc
$ gecco annotate --mibig mibig_prot_seqs_xx.faa -o bgc
```
Use the `--hmm` flag to give other HMMs instead of using the internal one
(PFam v31.0 only at the moment). **Make sure to use the `--mibig` input flag
and not `--proteins` when annotating MIBiG sequences to ensure additional
metadata are properly extracted from the sequence id of each protein.**
*This step will probably take ages, count about 5 minutes to annotate
1M amino acids with PFam.*
### Create the embedding
When both the BGC and the non BGC sequences have been annotated, merge them into
a continuous table to train the CRF on using the `gecco embed` subcommand:
```console
$ gecco embed --bgc bgc/*.features.tsv --nobgc nobgc/*.features.tsv -o merged.features.tsv
```
For more info, you could check the wiki... if there was one. So far, you're out of luck because I haven't gotten aroud to write one yet ;)
If the non-BGC training set is not large enough to fit all BGCs, a warning will
be thrown.
### Train the model
To train the model with default parameters, use the following command:
```console
$ gecco train -i merged.features.tsv -o model
```
## Housekeeping GECCO
## Versioning
As it is a good project management practice, we should follow
[semantic versioning](https://semver.org/), so remember the following:
* As long as the model predict the same thing, retraining/updating the model
should be considered a non-breaking change, so you should bump the MINOR
version of the program.
* Upgrading the internal HMMs could potentially change the output but won't
break the program, they should be treated as non-breaking change, so you
should bump the MINOR version of the program.
* If the model changes prediction (e.g. predicted classes change), then you
should bump the MAJOR version of the program as it it a breaking change.
* Changes in the code should be treated following semver the usual way.
* Changed in the CLI should be treated as changed in the API (e.g. a new
CLI option or a new command bumps the MINOR version, removal of an option
bumps the MAJOR version).
### Upgrading the internal HMMs
To bump the version of the internal HMMs (for instance, to switch to a newer
version of Pfam), you will need to do the following:
- edit the `setup.py` file with the new URL to the HMM file.
- update the signature file in `gecco/data/hmms` with the MD5 checksum of the
new file (this can be found online or computed locally with the `md5sums`
command after having downloaded the file from a safe source).
### Upgrading the internal CRF model
After having trained a new version of the model, simply run the following command
to update the internal GECCO model as well as the hash signature file:
```console
$ python setup.py update_model --model <path_to_new_crf.model>
```
#!/bin/sh
log() {
tput bold
tput setaf 2
printf "%12s " $1
tput sgr0
shift 1
echo $@
}
error() {
tput bold
tput setaf 1
printf "%12s " $1
tput sgr0
shift 1
echo $@
}
#!/bin/sh
set -e
. $(dirname $(dirname $0))/functions.sh
# --- Install software dependencies ------------------------------------------
log Installing executable dependencies with aptitude
apt update
apt install -y hmmer
log Installing Python dependencies with pip
pip install -U coverage
# --- Install data dependencies ----------------------------------------------
mkdir -p ci/cache
mkdir -p build/lib/gecco/data/hmms
for ini_file in gecco/hmmer/*.ini; do
url=$(grep "url" $ini_file | cut -d'=' -f2 | sed 's/ //g')
hmm=$(grep "id" $ini_file | cut -d'=' -f2 | sed 's/ //g')
version=$(grep "version" $ini_file | cut -d'=' -f2 | sed 's/ //g')
hmm_file=${ini_file%.ini}.hmm.gz
cache_file="ci/cache/${hmm}.${version}.hmm.gz"
if ! [ -e "$cache_file" ]; then
if [ "$hmm" = "Panther" ]; then
log Extracting $hmm v$version
wget "$url" -q -O- \
| tar xz --wildcards --no-wildcards-match-slash --no-anchored PTHR\*/hmmer.hmm -O \
| gzip > "$cache_file"
else
log Downloading $hmm v$version
wget "$url" -q -O "$cache_file"
fi
else
log Using cached $hmm v$version
fi
mkdir -p $(dirname "build/lib/${hmm_file}")
cp "$cache_file" "build/lib/${hmm_file}"
done
#!/bin/sh
set -e
. $(dirname $(dirname $0))/functions.sh
python setup.py bdist_wheel
pip install -U dist/*.whl
python -m coverage run -p -m unittest discover -vv
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
p {
text-align: justify;
}
/* a.reference strong {
font-weight: bold;
font-size: 90%;
color: #c7254e;
box-sizing: border-box;
font-family: Menlo,Monaco,Consolas,"Courier New",monospace;
} */
.field-list a.reference {
font-weight: bold;
font-style: italic;
font-size: 90%;
color: #c7254e;
box-sizing: border-box;
font-family: Menlo,Monaco,Consolas,"Courier New",monospace;
}
$(document).ready(function() {
(function ($) {
if (window.location.href.match("/api/gecco.*") !== null) {
$(".nav-list")
.children()
.filter("li")
.append("<ul id='apitoc' class='nav nav-list'></ul>");
$( "dt" )
.has( ".sig-name" )
.slice(1)
.each(function( index ) {
var html = (
"<li><a href='#"
+ $( this ).attr("id")
+ "'><code>"
+ $( this ).find(".sig-name").text()
+ "</code></a></li>"
);
$("#apitoc").append(html);
});
} else if (window.location.href.match("/api/warnings*") !== null) {
$(".nav-list")
.children()
.filter("li")
.append("<ul id='apitoc' class='nav nav-list'></ul>");
$( "dt" )
.each(function( index ) {
var html = (
"<li><a href='#"
+ $( this ).attr("id")
+ "'><code>"
+ $( this ).find(".sig-name").text()
+ "</code></a></li>"
);
$("#apitoc").append(html);
});
}
})(window.$jqTheme || window.jQuery);
})
$(document).ready(function() {
(function ($) {
$(".admonition-example")
.removeClass("panel-info")
.addClass("panel-success").end()
})(window.$jqTheme || window.jQuery);
})
{{ name }}
{{ underline }}
.. currentmodule:: {{ module }}
.. autoclass:: {{ name }}()
:members:
:special-members: __init__