......@@ -45,5 +45,5 @@ jobs:
- name: Install planemo
run: pip install -U planemo
- name: Deploy repository to Tool Shed
if: startsWith(github.ref, 'refs/tags/v')
run: planemo shed_update ./galaxy --owner althonos --name gecco -t toolshed -m ${{ github.event.head_commit.message }} --shed_key ${{ secrets.TOOL_SHED_API_KEY }}
run: planemo shed_update ./galaxy --owner althonos --name gecco -t toolshed -m '${{ github.event.head_commit.message }}' --shed_key '${{ secrets.TOOL_SHED_API_KEY }}'
......@@ -45,14 +45,13 @@ $ conda install -c bioconda gecco
```
This will install GECCO, its dependencies, and the data needed to run
predictions. This requires around 100MB of data to be downloaded, so
it could take some time depending on your Internet connection. Once done, you
will have a ``gecco`` command available in your $PATH.
predictions. This requires around 40MB of data to be downloaded, so
it could take some time depending on your Internet connection. Once done,
you will have a ``gecco`` command available in your $PATH.
*Note that GECCO uses [HMMER3](http://hmmer.org/), which can only run
on PowerPC and recent x86-64 machines running a POSIX operating system.
Therefore, Linux and OSX are supported platforms, but GECCO will not be able
to run on Windows.*
Therefore, GECCO will work on Linux and OSX, but not on Windows.*
## 🧬 Running GECCO
......@@ -72,10 +71,30 @@ Additional parameters of interest are:
autodetect the number of CPUs on the machine using
[`os.cpu_count`](https://docs.python.org/3/library/os.html#os.cpu_count).
- `--cds`, controlling the minimum number of consecutive genes a BGC region
must have to be detected by GECCO (default is 3).
must have to be detected by GECCO. The default is *3*.
- `--threshold`, controlling the minimum probability for a gene to be
considered part of a BGC region. Using a lower number will increase the
number (and possibly length) of predictions, but reduce accuracy.
number (and possibly length) of predictions, but reduce accuracy. The
default of *0.3* was selected to optimize precision/recall on a test set
of 364 BGCs from [MIBiG 2.0](https://mibig.secondarymetabolites.org/).
## 🔎 Results
GECCO will create the following files:
- `{genome}.features.tsv`: The *features* file, containing the identified
proteins and domains in the input sequences, in tabular format.
- `{genome}.clusters.tsv`: If any were found, a *clusters* file, containing
the coordinates of the predicted clusters along their putative biosynthetic
type, in tabular format.
- `{genome}_cluster_{N}.gbk`: If any were found, a GenBank file per cluster,
containing the cluster sequence annotated with its member proteins and domains.
To get a more visual way of exploring of the predictions, you
can open the GenBank files in a genome editing software like [UGENE](http://ugene.net/),
or you can load the results into an AntiSMASH report.
Check the [Integrations](https://gecco.embl.de/integrations.html#antismash) page of the
documentation for a step-by-step guide.
## 🔖 Reference
......
......@@ -78,8 +78,8 @@ Or with Conda, using the `bioconda` channel:
Predictions
^^^^^^^^^^^
GECCO works with DNA sequences, and loads them using Biopython, allowing it
to support a `large variety of formats <https://biopython.org/wiki/SeqIO>`_,
GECCO works with DNA sequences, and loads them using `Biopython <http://biopython.org/>`_,
allowing it to support a `large variety of formats <https://biopython.org/wiki/SeqIO>`_,
including the common FASTA and GenBank files.
Run a prediction on a FASTA file named ``sequence.fna`` and output the
......@@ -170,8 +170,8 @@ the software contains the complete license text.
About
-----
GECCO is developped by the `Zeller group <https://www.embl.de/research/units/scb/zeller/index.html>`_
at the European Molecular Biology Laboratory in Heidelberg. The following
GECCO is developped by the `Zeller Team <https://www.embl.de/research/units/scb/zeller/index.html>`_
at the `European Molecular Biology Laboratory <https://www.embl.de/>`_ in Heidelberg. The following
individuals contributed to the development of GECCO:
- `Laura M. Carroll <https://github.com/lmc297>`_
......
......@@ -3,19 +3,22 @@ Installation
Requirements
^^^^^^^^^^^^
------------
GECCO requires additional libraries that can be installed directly from PyPI,
the Python Package Index. Contrary to other tools in the field
(such as DeepBGC or AntiSMASH), it does not require any external binary.
Installing GECCO locally
------------------------
PyPi
^^^^
GECCO is hosted on the EMBL Git server, but the easiest way to install it is
to download the latest release from its `PyPi repository <https://pypi.python.org/pypi/gecco>`_.
It will install all dependencies then install the ``gecco-tool`` package:
The GECCO source is hosted on the EMBL Git server and mirrored on GitHub, but
the easiest way to install it is to download the latest release from
`PyPi <https://pypi.python.org/pypi/gecco>`_ with ``pip``.
.. code:: console
......@@ -27,36 +30,46 @@ Conda
GECCO is also available as a `recipe <https://anaconda.org/bioconda/GECCO>`_
in the `Bioconda <https://bioconda.github.io/>`_ channel. To install, simply
use the `conda` installer:
use the ``conda`` installer:
.. code:: console
$ conda install -c bioconda GECCO
Git + ``pip``
^^^^^^^^^^^^^
If, for any reason, you prefer to download the library from the git repository,
you can clone the repository and install the repository by running:
.. code:: console
$ pip install https://github.com/zellerlab/GECCO/archive/master.zip
Keep in mind this will install the development version of the library, so not
everything may work as expected compared to a stable versioned release.
GitHub + ``setuptools``
^^^^^^^^^^^^^^^^^^^^^^^
If you do not have ``pip`` installed, you can do the following (after
having properly installed all the dependencies):
.. code:: console
$ git clone https://github.com/zellerlab/GECCO/
$ cd GECCO
# python setup.py install
.. Git + ``pip``
.. ^^^^^^^^^^^^^
..
.. If, for any reason, you prefer to download the library from the git repository,
.. you can clone the repository and install the repository by running:
..
.. .. code:: console
..
.. $ pip install https://github.com/zellerlab/GECCO/archive/master.zip
..
..
.. Keep in mind this will install the development version of the library, so not
.. everything may work as expected compared to a stable versioned release.
..
..
.. GitHub + ``setuptools``
.. ^^^^^^^^^^^^^^^^^^^^^^^
..
.. If you do not have ``pip`` installed, you can do the following (after
.. having properly installed all the dependencies):
..
.. .. code:: console
..
.. $ git clone https://github.com/zellerlab/GECCO/
.. $ cd GECCO
.. # python setup.py install
Using GECCO in Galaxy
---------------------
GECCO is available as a Galaxy tool in the `Toolshed <https://toolshed.g2.bx.psu.edu/>`_.
It can be used directly on the `Galaxy Europe server <https://usegalaxy.eu/>`_. To
add it to your local Galaxy server, get in touch with the admin and ask them
to add the `Toolshed repository for GECCO <https://toolshed.g2.bx.psu.edu/view/althonos/gecco/88dc16b4f583>`_
to the server tools.
Training
========
Requirements
^^^^^^^^^^^^
In order to train GECCO, you will need to have it installed with the *train*
dependencies:
Software Requirements
---------------------
In order to train GECCO, you need to have installed with the *train* optional dependencies.
This can be done with ``pip``:
.. code-block:: console
......@@ -13,77 +14,180 @@ dependencies:
This will install additional Python packages, such as `pandas <https://pandas.pydata.org/>`_
which is needed to process the feature tables, or `fisher <https://pypy.org/project/fisher>`_
which is used to select the most informative features.
which is used to select the most informative domains.
Domain database
---------------
GECCO needs HMM domains to use as features. Installing the ``gecco-tool`` package
will also install a subset of the Pfam database that can be used for making the
predictions. However, this subset should not be used for training, since a
different subset of domains may be selected with different training data.
You can get the latest version of Pfam (35.0 in December 2021) from the EMBL
FTP server:
.. code::
$ wget "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/Pfam-A.hmm.gz" -O Pfam35.hmm.gz
You are also free to get additional HMMs from other databases, such as
`TIGRFAM <https://www.jcvi.org/research/tigrfams>`_ or `PANTHER <http://www.pantherdb.org/panther/;jsessionid=D7BFDD605F98EC1159A5E0E77536FD76>`_,
or even to build your own HMMs, as long as they are in `HMMER <http://hmmer.org/>`_ format.
Training sequences
------------------
Regions in their genomic context
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Resources
^^^^^^^^^
The easiest case for training GECCO is when you have entire genomes with regions
of interest. In that case, you can directly use these sequences, and you will
only have to prepare a cluster table with the coordinates of each positive region.
Regions from separate datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the event you don't have the genomic context available for your regions of
interest, you will have to provide a "fake" context by embedding the positive
regions into contigs that don't contain any positive.
You are free to use sequences coming from anywhere as your _positive_ regions.
GECCO was trained to detect Biosynthetic Gene Clusters, so the
`MIBiG <https://mibig.secondarymetabolites.org/>`_ database was used; you need
to get the FASTA file from MIBiG containing all BGC proteins. It can be found
`on the Download page of MIBiG <https://mibig.secondarymetabolites.org/download>`_.
`MIBiG <https://mibig.secondarymetabolites.org/>`_ database was used to get
the positives, with some additional filtering to remove redundancy and entries
with invalid annotations. For the negative regions, we used representative
genomes from the `proGenomes <https://progenomes.embl.de/>`_ database, and masked
known BGCs using `antiSMASH <https://antismash.secondarymetabolites.org/>`_.
You also need to have some _negative_ regions, such as bacterial genomes, and
make sure they are free of _positives_ (e.g., they must not contain any BGCs).
A common approach is to download a bunch of complete genomes from the
`ENA <https://www.ebi.ac.uk/ena/browser/home>`_ and to removes the regions where
positives are detected (e.g. removing BGCs found by `antiSMASH <https://antismash.secondarymetabolites.org/>`_).
Features
^^^^^^^^
Feature tables
--------------
GECCO does not train on sequences directly, but on feature tables. To obtain
a feature table from the sequences, the ``gecco annotate`` subcommand can be
used:
GECCO does not train on sequences directly, but on feature tables. You can build
the feature table yourself (see below for the expected format), but the easiest
way to obtain a feature table from the sequences is the ``gecco annotate`` subcommand.
To build a table from a collection of nucleotide sequences in ``sequences.fna``
and HMMs in ``Pfam35.hmm.gz``, use:
.. code-block:: console
$ gecco annotate --genome reference_genome.fna -o nobgc
$ gecco annotate --mibig mibig_prot_seqs_xx.faa -o bgc
$ gecco annotate --genome sequences.fna --hmm Pfam35.hmm.gz -o features.tsv
Use the `--hmm` flag with the path to a different HMM file to use other HMMs than
the internal ones from GECCO (a subset of Pfam v33.1 and Tigrfam v15.0). **Make
sure the `--mibig` input flag is used when annotating MIBiG sequences to ensure
the metadata are properly extracted from the sequence ID of each protein in the
input file.**
.. hint::
_This step takes a long time, count about 5 minutes to annotate 1M amino-acids
with Pfam alone._
If you have more than one HMM file, you can add additional ``--hmm`` flags
so that all of them are used.
The feature table is a TSV file that looks like this, with one row per domain,
per protein, per sequence:
Embedding
^^^^^^^^^
============ ============== ===== ==== ====== ======= ====== ======== =========== ============ ==========
sequence_id protein_id start end strand domain hmm i_evalue pvalue domain_start domain_end
============ ============== ===== ==== ====== ======= ====== ======== =========== ============ ==========
AFPU01000001 AFPU01000001_1 3 2555 ``+`` PF01573 Pfam35 95.95209 0.00500 2 27
AFPU01000001 AFPU01000001_2 2610 4067 ``-`` PF17032 Pfam35 0.75971 3.961e-05 83 142
AFPU01000001 AFPU01000001_2 2610 4067 ``-`` PF13719 Pfam35 4.89304 0.000255 85 98
... ... ... ... ... ... ... ... ... ... ...
============ ============== ===== ==== ====== ======= ====== ======== =========== ============ ==========
When both the BGC and the non-BGC sequences have been annotated, merge them into
a continuous table to train the CRF on using the ``gecco embed`` subcommand:
.. hint::
.. code-block:: console
If this step takes too long, you can also split the file containing your
input sequences, process them independently in parallel, and combine the
result.
Cluster tables
--------------
The cluster table is used to additional information to GECCO: the location of
each positive region in the input data, and the type of each region (if it makes
sense). You need to build this table manually, but it should be quite straightforward.
.. hint::
If a region has more than one type, use `;` to separate the two types
in the type column. For instance, a Polyketide/NRP hybrid cluster can be
marked with the type ``Polyketide;NRP``.
$ gecco embed --bgc bgc/*.features.tsv --nobgc nobgc/*.features.tsv -o embedding.tsv
The cluster table is a TSV file that looks like this, with one row per region:
============ ============ ====== ====== ==========
sequence_id bgc_id start end type
============ ============ ====== ====== ==========
AFPU01000001 BGC0000001 806243 865563 Polyketide
MTON01000024 BGC0001910 129748 142173 Terpene
... ... ... ... ...
============ ============ ====== ====== ==========
Fitting
^^^^^^^
.. hint::
To train the model with default parameters, you must have built a feature table
and a cluster table similar to the ones created by GECCO after a prediction.
Once it's done, use the following command to fit the weights of the CRF, and to
train the BGC type classifier:
If the concept of "type" makes no sense for the regions you are trying to
detect, you can omit the ``type`` column entirely. This will effectively
mark all the regions from the training sequences as "Unknown".
Fitting the model
-----------------
Now that you have everything needed, it's time to train GECCO! Use the
following method to fit the CRF model and the type classifier:
.. code-block:: console
$ gecco train --features embedding.features.tsv --clusters embedding.clusters.tsv -o model
$ gecco -vv train --features features.tsv --clusters clusters.tsv -o model
GECCO will create a directory named ``model`` containing all the required files
to make predictions later on.
L1/L2 regularisation
^^^^^^^^^^^^^^^^^^^^
Use the ``--c1`` and ``--c2`` flags to control the weight for the L1 and L2
regularisation, respectively. The command line defaults to *0.15* and *0.15*;
however, for training GECCO, we disabled L2 regularisation and selected
a value of *0.4* for :math:`C_1` by optimizing on an external validation dataset.
Feature selection
^^^^^^^^^^^^^^^^^
GECCO supports selecting the most informative features from the training dataset
using a simple contingency testing for the presence/absence of each domain in
the regions of interest. Reducing the number of features helps the CRF model to
get better accuracy. It also greatly reduces the time needed to make predictions
by skipping the HMM annotation step for useless domains.
Use the ``--select`` flag to select a fraction of most informative features
before training to reduce the total feature set (for instance, use ``--select 0.3``
to select the 30% features with the lowest Fisher *p-value*).
The command will output a folder (named ``model`` here) containing all the data
files necessary for predicting probabilities with ``gecco run``:
.. code-block:: console
$ gecco train --features features.tsv --clusters clusters.tsv -o model --select 0.3
.. hint::
You will get a warning in case you select a *p-value* threshold that is still
too high, resulting in non-informative domains to be included in the selected
features.
Predicting with the new model
-----------------------------
To make predictions with a model different from the one embedded in GECCO, you
will need the folder from a previous ``gecco train`` run, as well as the HMMs
used to build the feature tables in the first place.
.. code-block:: console
$ gecco run --model model ...
$ gecco run --model model --hmm Pfam35.hmm.gz --genome genome.fa -o ./predictions/
Congratulations, you trained GECCO with your own dataset, and successfully
used it to make predictions!
tool_test_output.html
tool_test_output.json
../CHANGELOG.md
\ No newline at end of file
<?xml version='1.0' encoding='utf-8'?>
<tool id="gecco" name="GECCO" version="0.8.5" python_template_version="3.5">
<description>GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).</description>
<description>is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).</description>
<requirements>
<requirement type="package" version="0.8.5">gecco</requirement>
</requirements>
......@@ -14,9 +14,9 @@
#end if
ln -s '$input' input_tempfile.$file_extension &&
gecco -vv run
--format $input.ext
--genome input_tempfile.$file_extension
gecco -vv run
--format $input.ext
--genome input_tempfile.$file_extension
--postproc $postproc
--force-clusters-tsv
#if $cds:
......@@ -29,10 +29,10 @@
--antismash-sideload
#end if
&& mv input_tempfile.features.tsv $features
&& mv input_tempfile.clusters.tsv $clusters
&& mv input_tempfile.features.tsv '$features'
&& mv input_tempfile.clusters.tsv '$clusters'
#if $antismash_sideload
&& mv input_tempfile.sideload.json $sideload
&& mv input_tempfile.sideload.json '$sideload'
#end if
]]></command>
......@@ -62,7 +62,7 @@
<output name="features" file="features.tsv"/>
<output name="clusters" file="clusters.tsv"/>
<output_collection name="records" type="list">
<element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" lines_diff="2"/>
<element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" compare="diff" lines_diff="4"/>
</output_collection>
</test>
<test>
......@@ -72,52 +72,41 @@
<output name="clusters" file="clusters.tsv"/>
<output name="sideload" file="sideload.json"/>
<output_collection name="records" type="list">
<element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" lines_diff="2"/>
<element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" compare="diff" lines_diff="4"/>
</output_collection>
</test>
</tests>
<help>
<![CDATA[
<help><![CDATA[
**Overview**
Overview
--------
GECCO is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).
GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).
It is developed in the Zeller group and is part of the suite of computational microbiome analysis tools hosted at EMBL.
**Input**
Input
-----
GECCO works with DNA sequences, and loads them using Biopython, allowing it to support a large variety of formats, including the common FASTA and GenBank files.
**Output**
Output
------
GECCO will create the following files once done (using the same prefix as the input file):
- features.tsv: The features file, containing the identified proteins and domains in the input sequences.
- clusters.tsv: If any were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.
- {sequence}_cluster_{N}.gbk: If any BGCs were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.
- ``features.tsv``: The features file, containing the identified proteins and domains in the input sequences.
- ``clusters.tsv``: If any were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.
- ``{sequence}_cluster_{N}.gbk``: If any BGCs were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.
**Contact**
Contact
-------
If you have any question about GECCO, if you run into any issue, or if you would like to make a feature request, please create an issue in the GitHub repository.
You can also directly contact Martin Larralde via email. If you want to contribute to GECCO, please have a look at the contribution guide first, and feel free to
open a pull request on the GitHub repository.
If you have any question about GECCO, if you run into any issue, or if you would like to make a feature request, please create an issue in the
`GitHub repository <https://github.com/zellerlab/gecco>`_. You can also directly contact `Martin Larralde via email <mailto:martin.larralde@embl.de>`_.
If you want to contribute to GECCO, please have a look at the contribution guide first, and feel free to open a pull request on the GitHub repository.
]]>
</help>
]]></help>
<citations>
<citation type="bibtex">
@article {Carroll2021.05.03.442509,
author = {Carroll, Laura M. and Larralde, Martin and Fleck, Jonas Simon and Ponnudurai, Ruby and Milanese, Alessio and Cappio, Elisa and Zeller, Georg},
title = {Accurate de novo identification of biosynthetic gene clusters with GECCO},
elocation-id = {2021.05.03.442509},
year = {2021},
doi = {10.1101/2021.05.03.442509},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Biosynthetic gene clusters (BGCs) are enticing targets for (meta)genomic mining efforts, as they may encode novel, specialized metabolites with potential uses in medicine and biotechnology. Here, we describe GECCO (GEne Cluster prediction with COnditional random fields; https://gecco.embl.de), a high-precision, scalable method for identifying novel BGCs in (meta)genomic data using conditional random fields (CRFs). Based on an extensive evaluation of de novo BGC prediction, we found GECCO to be more accurate and over 3x faster than a state-of-the-art deep learning approach. When applied to over 12,000 genomes, GECCO identified nearly twice as many BGCs compared to a rule-based approach, while achieving higher accuracy than other machine learning approaches. Introspection of the GECCO CRF revealed that its predictions rely on protein domains with both known and novel associations to secondary metabolism. The method developed here represents a scalable, interpretable machine learning approach, which can identify BGCs de novo with high precision.Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2021/05/04/2021.05.03.442509},
eprint = {https://www.biorxiv.org/content/early/2021/05/04/2021.05.03.442509.full.pdf},
journal = {bioRxiv}
}
</citation>
<citation type="doi">10.1101/2021.05.03.442509</citation>
</citations>
</tool>
......@@ -15,7 +15,7 @@ REFERENCE 1
JOURNAL bioRxiv (2021.05.03.442509)
REMARK doi:10.1101/2021.05.03.442509
COMMENT ##GECCO-Data-START##
version :: GECCO v0.8.4
version :: GECCO v0.8.5
creation_date :: 2021-11-21T16:33:58.470847
biosyn_class :: Polyketide
alkaloid_probability :: 0.0
......
{
"records": [
{
"name": "BGC0001866.1",
"subregions": [
{
"details": {
"alkaloid_probability": "0.000",
"average_p": "0.997",
"max_p": "1.000",
"nrp_probability": "0.140",
"other_probability": "0.000",
"polyketide_probability": "0.980",
"ripp_probability": "0.000",
"saccharide_probability": "0.000",
"terpene_probability": "0.000"
},
"end": 32979,
"label": "Polyketide",
"start": 347
}
]
}
],
"tool": {
"configuration": {
"cds": "3",
"e-filter": "None",
"postproc": "'gecco'",
"threshold": "0.3"
},
"description": "Biosynthetic Gene Cluster prediction with Conditional Random Fields.",
"name": "GECCO",
"version": "0.8.5"
}
}
\ No newline at end of file