Commit 1f5b1e87 authored by Martin Larralde's avatar Martin Larralde
Browse files

Rewrite the `Training GECCO` guide in the documentation

parent a9c7f807
Pipeline #27977 passed with stages
in 4 minutes and 20 seconds
Training
========
Requirements
^^^^^^^^^^^^
In order to train GECCO, you will need to have it installed with the *train*
dependencies:
Software Requirements
---------------------
In order to train GECCO, you need to have installed with the *train* optional dependencies.
This can be done with ``pip``:
.. code-block:: console
......@@ -13,77 +14,180 @@ dependencies:
This will install additional Python packages, such as `pandas <https://pandas.pydata.org/>`_
which is needed to process the feature tables, or `fisher <https://pypy.org/project/fisher>`_
which is used to select the most informative features.
which is used to select the most informative domains.
Domain database
---------------
GECCO needs HMM domains to use as features. Installing the ``gecco-tool`` package
will also install a subset of the Pfam database that can be used for making the
predictions. However, this subset should not be used for training, since a
different subset of domains may be selected with different training data.
You can get the latest version of Pfam (35.0 in December 2021) from the EMBL
FTP server:
.. code::
$ wget "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/Pfam-A.hmm.gz" -O Pfam35.hmm.gz
You are also free to get additional HMMs from other databases, such as
`TIGRFAM <https://www.jcvi.org/research/tigrfams>`_ or `PANTHER <http://www.pantherdb.org/panther/;jsessionid=D7BFDD605F98EC1159A5E0E77536FD76>`_,
or even to build your own HMMs, as long as they are in `HMMER <http://hmmer.org/>`_ format.
Training sequences
------------------
Regions in their genomic context
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Resources
^^^^^^^^^
The easiest case for training GECCO is when you have entire genomes with regions
of interest. In that case, you can directly use these sequences, and you will
only have to prepare a cluster table with the coordinates of each positive region.
Regions from separate datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the event you don't have the genomic context available for your regions of
interest, you will have to provide a "fake" context by embedding the positive
regions into contigs that don't contain any positive.
You are free to use sequences coming from anywhere as your _positive_ regions.
GECCO was trained to detect Biosynthetic Gene Clusters, so the
`MIBiG <https://mibig.secondarymetabolites.org/>`_ database was used; you need
to get the FASTA file from MIBiG containing all BGC proteins. It can be found
`on the Download page of MIBiG <https://mibig.secondarymetabolites.org/download>`_.
`MIBiG <https://mibig.secondarymetabolites.org/>`_ database was used to get
the positives, with some additional filtering to remove redundancy and entries
with invalid annotations. For the negative regions, we used representative
genomes from the `proGenomes <https://progenomes.embl.de/>`_ database, and masked
known BGCs using `antiSMASH <https://antismash.secondarymetabolites.org/>`_.
You also need to have some _negative_ regions, such as bacterial genomes, and
make sure they are free of _positives_ (e.g., they must not contain any BGCs).
A common approach is to download a bunch of complete genomes from the
`ENA <https://www.ebi.ac.uk/ena/browser/home>`_ and to removes the regions where
positives are detected (e.g. removing BGCs found by `antiSMASH <https://antismash.secondarymetabolites.org/>`_).
Features
^^^^^^^^
Feature tables
--------------
GECCO does not train on sequences directly, but on feature tables. To obtain
a feature table from the sequences, the ``gecco annotate`` subcommand can be
used:
GECCO does not train on sequences directly, but on feature tables. You can build
the feature table yourself (see below for the expected format), but the easiest
way to obtain a feature table from the sequences is the ``gecco annotate`` subcommand.
To build a table from a collection of nucleotide sequences in ``sequences.fna``
and HMMs in ``Pfam35.hmm.gz``, use:
.. code-block:: console
$ gecco annotate --genome reference_genome.fna -o nobgc
$ gecco annotate --mibig mibig_prot_seqs_xx.faa -o bgc
$ gecco annotate --genome sequences.fna --hmm Pfam35.hmm.gz -o features.tsv
Use the `--hmm` flag with the path to a different HMM file to use other HMMs than
the internal ones from GECCO (a subset of Pfam v33.1 and Tigrfam v15.0). **Make
sure the `--mibig` input flag is used when annotating MIBiG sequences to ensure
the metadata are properly extracted from the sequence ID of each protein in the
input file.**
.. hint::
_This step takes a long time, count about 5 minutes to annotate 1M amino-acids
with Pfam alone._
If you have more than one HMM file, you can add additional ``--hmm`` flags
so that all of them are used.
The feature table is a TSV file that looks like this, with one row per domain,
per protein, per sequence:
Embedding
^^^^^^^^^
============ ============== ===== ==== ====== ======= ====== ======== =========== ============ ==========
sequence_id protein_id start end strand domain hmm i_evalue pvalue domain_start domain_end
============ ============== ===== ==== ====== ======= ====== ======== =========== ============ ==========
AFPU01000001 AFPU01000001_1 3 2555 ``+`` PF01573 Pfam35 95.95209 0.00500 2 27
AFPU01000001 AFPU01000001_2 2610 4067 ``-`` PF17032 Pfam35 0.75971 3.961e-05 83 142
AFPU01000001 AFPU01000001_2 2610 4067 ``-`` PF13719 Pfam35 4.89304 0.000255 85 98
... ... ... ... ... ... ... ... ... ... ...
============ ============== ===== ==== ====== ======= ====== ======== =========== ============ ==========
When both the BGC and the non-BGC sequences have been annotated, merge them into
a continuous table to train the CRF on using the ``gecco embed`` subcommand:
.. hint::
.. code-block:: console
If this step takes too long, you can also split the file containing your
input sequences, process them independently in parallel, and combine the
result.
Cluster tables
--------------
The cluster table is used to additional information to GECCO: the location of
each positive region in the input data, and the type of each region (if it makes
sense). You need to build this table manually, but it should be quite straightforward.
.. hint::
If a region has more than one type, use `;` to separate the two types
in the type column. For instance, a Polyketide/NRP hybrid cluster can be
marked with the type ``Polyketide;NRP``.
$ gecco embed --bgc bgc/*.features.tsv --nobgc nobgc/*.features.tsv -o embedding.tsv
The cluster table is a TSV file that looks like this, with one row per region:
============ ============ ====== ====== ==========
sequence_id bgc_id start end type
============ ============ ====== ====== ==========
AFPU01000001 BGC0000001 806243 865563 Polyketide
MTON01000024 BGC0001910 129748 142173 Terpene
... ... ... ... ...
============ ============ ====== ====== ==========
Fitting
^^^^^^^
.. hint::
To train the model with default parameters, you must have built a feature table
and a cluster table similar to the ones created by GECCO after a prediction.
Once it's done, use the following command to fit the weights of the CRF, and to
train the BGC type classifier:
If the concept of "type" makes no sense for the regions you are trying to
detect, you can omit the ``type`` column entirely. This will effectively
mark all the regions from the training sequences as "Unknown".
Fitting the model
-----------------
Now that you have everything needed, it's time to train GECCO! Use the
following method to fit the CRF model and the type classifier:
.. code-block:: console
$ gecco train --features embedding.features.tsv --clusters embedding.clusters.tsv -o model
$ gecco -vv train --features features.tsv --clusters clusters.tsv -o model
GECCO will create a directory named ``model`` containing all the required files
to make predictions later on.
L1/L2 regularisation
^^^^^^^^^^^^^^^^^^^^
Use the ``--c1`` and ``--c2`` flags to control the weight for the L1 and L2
regularisation, respectively. The command line defaults to *0.15* and *0.15*;
however, for training GECCO, we disabled L2 regularisation and selected
a value of *0.4* for :math:`C_1` by optimizing on an external validation dataset.
Feature selection
^^^^^^^^^^^^^^^^^
GECCO supports selecting the most informative features from the training dataset
using a simple contingency testing for the presence/absence of each domain in
the regions of interest. Reducing the number of features helps the CRF model to
get better accuracy. It also greatly reduces the time needed to make predictions
by skipping the HMM annotation step for useless domains.
Use the ``--select`` flag to select a fraction of most informative features
before training to reduce the total feature set (for instance, use ``--select 0.3``
to select the 30% features with the lowest Fisher *p-value*).
The command will output a folder (named ``model`` here) containing all the data
files necessary for predicting probabilities with ``gecco run``:
.. code-block:: console
$ gecco train --features features.tsv --clusters clusters.tsv -o model --select 0.3
.. hint::
You will get a warning in case you select a *p-value* threshold that is still
too high, resulting in non-informative domains to be included in the selected
features.
Predicting with the new model
-----------------------------
To make predictions with a model different from the one embedded in GECCO, you
will need the folder from a previous ``gecco train`` run, as well as the HMMs
used to build the feature tables in the first place.
.. code-block:: console
$ gecco run --model model ...
$ gecco run --model model --hmm Pfam35.hmm.gz --genome genome.fa -o ./predictions/
Congratulations, you trained GECCO with your own dataset, and successfully
used it to make predictions!
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment