Commits (13)
......@@ -5,7 +5,19 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
## [Unreleased]
[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.3...master
[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.7.0...master
## [v0.7.0] - 2021-05-31
[v0.7.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.3...v0.7.0
### Added
- Support for writing an AntiSMASH sideload JSON file after a `gecco run` workflow.
- Code for converting GenBank files in BiG-SLiCE compatible format with the `gecco convert` subcommand.
- Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.
### Changed
- Minimum Biopython version to `v1.73` for compatibility with older bioinformatics tooling.
- Internal domain composition shipped in the `gecco.types` with newer composition array obtained directly from MIBiG files.
### Removed
- Outdated notice about `-vvv` verbosity level in the help message of the main `gecco` command.
## [v0.6.3] - 2021-05-10
[v0.6.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.2...v0.6.3
......
......@@ -49,7 +49,7 @@ it could take some time depending on your Internet connection. Once done, you
will have a ``gecco`` command available in your $PATH.
*Note that GECCO uses [HMMER3](http://hmmer.org/), which can only run
on PowerPC and and recent x86-64 machines running a POSIX operating system.
on PowerPC and recent x86-64 machines running a POSIX operating system.
Therefore, Linux and OSX are supported platforms, but GECCO will not be able
to run on Windows.*
......
......@@ -84,7 +84,7 @@ predictions to the current directory:
.. code:: console
$ gecco run --genome sequence.fna
$ gecco -v run --genome sequence.fna
Output
......@@ -141,6 +141,7 @@ Guides
:maxdepth: 1
Installation <install>
Integrations <integrations>
Training <training>
Contributing <contributing>
......
Integrations
============
The files written by GECCO are standard TSV and GenBank files, so they should
be easy to use in downstream analyses. However, some common use-cases are
already covered to reduce the need for custom scripts.
AntiSMASH
^^^^^^^^^
Since ``v0.7.0``, GECCO can natively output JSON files that can be loaded into
the AntiSMASH viewer as *external annotations*. To do so, simply run
your analysis with the ``--antismash-sideload`` option to generate an
additional file:
.. code-block:: console
$ gecco run -g KC188778.1.gbk -o output_dir --antismash-sideload
The output folder will contain an additional JSON file compared to usual
runs:
.. code-block:: console
$ tree output_dir
output_dir
├── KC188778.1_cluster_1.gbk
├── KC188778.1.clusters.tsv
├── KC188778.1.features.tsv
└── KC188778.1.sideload.json
0 directories, 4 files
That JSON file can be loaded into the AntiSMASH result viewer. Check
*Upload extra annotations*, and upload the ``*.sideload.json`` file:
.. image:: /_static/img/integrations/antismash_1.png
When AntiSMASH is done processing your sequences, the Web viewer will display
BGCs found by GECCO as subregions next to the AntiSMASH clusters.
.. image:: /_static/img/integrations/antismash_2.png
GECCO-specific metadata (such as the probability of the predicted type) and
configuration (recording the ``--threshold`` and ``--cds`` values passed to
the ``gecco run`` command) can be seen in the dedicated GECCO tab.
.. image:: /_static/img/integrations/antismash_3.png
BiG-SLiCE
^^^^^^^^^
GECCO outputs GenBank files that only contain
`standard features <http://www.insdc.org/files/feature_table.html>`_, but
BiG-SLiCE requires additional metadata to load BGCs for analysis.
Since ``v0.7.0``, the ``gecco convert`` subcommand can convert GenBank files
obtained with a typical GECCO run into files than can be loaded by BiG-SLiCE.
Just run the command after ``gecco run`` using the same folder as the input:
.. code-block:: console
$ gecco run -g KY646191.1.gbk -o bigslice_dir/dataset_1/KY646191.1/
$ gecco convert gbk -i bigslice_dir/dataset_1/KY646191.1/ --format bigslice
This will create a new *region* file for each GenBank file, which will be
detected by BiG-SLiCE. Provided you organised the folders in the
`appropriate structure <https://github.com/medema-group/bigslice/wiki/Input-folder>`_,
it should look like this:
.. code-block:: console
$ tree bigslice_dir
bigslice_dir
├── dataset_1
│   └── KC188778.1
│      ├── KC188778.1_cluster_1.gbk
│      ├── KC188778.1.clusters.tsv
│      ├── KC188778.1.features.tsv
│      └── KC188778.1.region1.gbk
├── datasets.tsv
└── taxonomy
└── dataset_1_taxonomy.tsv
3 directories, 6 files
BiG-SLiCE will be able to load and render the BGCs found by GECCO:
.. image:: /_static/img/integrations/bigslice_1.png
.. image:: /_static/img/integrations/bigslice_2.png
.. warning::
Because of the way BiG-SLiCE loads BGCs coming from GECCO, they are always
marked as being *fragmented*.
......@@ -9,8 +9,7 @@ dependencies:
.. code-block:: console
$ git clone https://github.com/zellerlab/GECCO
$ pip install ./GECCO[train]
$ pip install gecco-tool[train]
This will install additional Python packages, such as `pandas <https://pandas.pydata.org/>`_
which is needed to process the feature tables, or `fisher <https://pypy.org/project/fisher>`_
......
......@@ -10,4 +10,4 @@ See Also:
__author__ = "Martin Larralde"
__license__ = "GPLv3"
__version__ = "0.6.3"
__version__ = "0.7.0"
......@@ -88,9 +88,9 @@ class Main(Command):
for a given subcommand.
-q, --quiet silence any output other than errors
(-qq silences everything).
-v, --verbose increase verbosity (-v is minimal,
-vv is verbose, and -vvv shows
debug information).
-v, --verbose increase verbosity (-v is verbose,
-vv is very verbose and makes the
output more suitable for logging).
-V, --version show the program version and exit.
"""
......@@ -165,7 +165,7 @@ class Main(Command):
self.error(
"An unexpected error occurred. Consider opening"
" a new issue on the bug tracker"
" (https://github.com/zellerlab/GECCO/issues/new) if"
" ( https://github.com/zellerlab/GECCO/issues/new ) if"
" it persists, including the traceback below:"
)
traceback = rich.traceback.Traceback.from_exception(type(e), e, e.__traceback__)
......
"""Implementation of the ``gecco convert`` subcommand.
"""
import contextlib
import copy
import errno
import functools
import itertools
import glob
import os
import operator
import multiprocessing
import random
import typing
from typing import Any, Dict, Union, Optional, List, TextIO, Mapping
from ... import __version__
from ...model import ClusterTable
from ._base import Command, CommandExit, InvalidArgument
from .._utils import patch_showwarnings
class Convert(Command): # noqa: D101
summary = "convert output for compatibility with other tools"
@classmethod
def doc(cls, fast=False): # noqa: D102
return f"""
gecco convert - {cls.summary}
Usage:
gecco convert gbk -i <input> -f <format> [options]
Arguments:
-i <input>, --input-dir <input> the path to the input directory
containing files to convert.
-f <format>, --format <format> the name of the output format to
write.
Parameters:
-o <out>, --output <out> the name of the output file,
default depends on the output
format.
This command helps transforming the output files created by
GECCO into helpful format, should you want to use the results in
combination with other tools. The supported formats are listed
below:
* ``gecco convert gbk --format=bigslice``: convert and alias the
GenBank files in the given directory so that they can be loaded by
BiG-SLiCE. Output files are named ``*.regionNNN.gbk``.
* ``gecco convert gbk --format=fna``: convert the GenBank files to
FASTA files containing the nucleotide sequence of the cluster.
Output files are named ``*.fna``.
* ``gecco convert gbk --format=faa``: convert the GenBank files to
FASTA files containing the amino-acid sequences of all the proteins
in a cluster. Output files are named ``*.faa``.
"""
_CLUSTERS_FORMATS = set()
_GBK_FORMATS = {"bigslice", "faa", "fna"}
def _check(self) -> typing.Optional[int]:
try:
super()._check()
formats = self._GBK_FORMATS if self.args["gbk"] else self._CLUSTERS_FORMATS
self.format = self._check_flag("--format", str, lambda x: x in formats, hint=", ".join(formats))
self.input_dir = self._check_flag("--input-dir")
except InvalidArgument:
raise CommandExit(1)
# ---
def _convert_gbk_bigslice(self, ctx: contextlib.ExitStack):
import Bio.SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation
# locate original folder
if not os.path.exists(self.input_dir):
self.error("Could not find input folder:", repr(self.input_dir))
raise CommandExit(errno.ENOENT)
# collect `*.clusters.tsv` files
cluster_files = glob.glob(os.path.join(self.input_dir, "*.clusters.tsv"))
unit = "table" if len(cluster_files) == 1 else "tables"
task = self.progress.add_task("Loading", total=len(cluster_files), unit=unit)
# load the original coordinates from the `*.clusters.tsv` files
coordinates = {}
types = {}
for cluster_file in self.progress.track(cluster_files, task_id=task):
cluster_fh = ctx.enter_context(open(cluster_file))
for row in ClusterTable.load(cluster_fh):
ty = ";".join(sorted(ty.name for ty in row.type.unpack()))
coordinates[row.bgc_id] = (row.start, row.end)
types[row.bgc_id] = ty or "Unknown"
# collect `*_clusters_{N}.gbk` files
gbk_files = glob.glob(os.path.join(self.input_dir, "*_cluster_*.gbk"))
unit = "file" if len(gbk_files) == 1 else "files"
task = self.progress.add_task("Converting", total=len(gbk_files), unit=unit)
done = 0
# rewrite GenBank files
for gbk_file in self.progress.track(gbk_files, task_id=task, total=len(gbk_files)):
# load record and ensure it comes from GECCO
record = Bio.SeqIO.read(gbk_file, "genbank")
if "GECCO-Data" not in record.annotations.get('structured_comment', {}):
self.warning(f"GenBank file {gbk_file!r} was not obtained by GECCO")
continue
# mark as AntiSMASH v5
record.annotations['structured_comment']['antiSMASH-Data'] = {
"Version": "5.X",
"Orig. start": coordinates[record.id][0],
"Orig. end": coordinates[record.id][1],
}
# using a /subregion feature will make BiG-SLiCE think the
# BGC is coming from MIBiG, and the predicted type will be
# handled correctly
subregion_feature = SeqFeature(FeatureLocation(0, len(record)), type="subregion")
subregion_feature.qualifiers["contig_edge"] = ["False"]
subregion_feature.qualifiers["aStool"] = ["mibig"]
subregion_feature.qualifiers["label"] = [types[record.id]]
record.features.append(subregion_feature)
# rename the {id}_cluster_{N}.gbk file to {id}.region{N}.gbk
new_name = gbk_file.replace("_cluster_", ".region")
self.info(f"Rewriting {gbk_file!r} to {new_name!r}")
Bio.SeqIO.write(record, new_name, "genbank")
done += 1
self.success(f"Converted {done} GenBank {unit} to BiG-SLiCE format", level=0)
def _convert_gbk_fna(self, ctx: contextlib.ExitStack):
import Bio.SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation
# collect `*_clusters_{N}.gbk` files
gbk_files = glob.glob(os.path.join(self.input_dir, "*_cluster_*.gbk"))
unit = "file" if len(gbk_files) == 1 else "files"
task = self.progress.add_task("Converting", total=len(gbk_files), unit=unit)
done = 0
# rewrite GenBank files
for gbk_file in self.progress.track(gbk_files, task_id=task, total=len(gbk_files)):
# load record and ensure it comes from GECCO
record = Bio.SeqIO.read(gbk_file, "genbank")
if "GECCO-Data" not in record.annotations.get('structured_comment', {}):
self.warning(f"GenBank file {gbk_file!r} was not obtained by GECCO")
continue
# convert to nucleotide FASTA
new_name = gbk_file.replace(".gbk", ".fna")
self.info(f"Converting {gbk_file!r} to FASTA file {new_name!r}")
Bio.SeqIO.write(record, new_name, "fasta")
done += 1
self.success(f"Converted {done} GenBank {unit} to nucleotide FASTA format", level=0)
def _convert_gbk_faa(self, ctx: contextlib.ExitStack):
import Bio.SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
# collect `*_clusters_{N}.gbk` files
gbk_files = glob.glob(os.path.join(self.input_dir, "*_cluster_*.gbk"))
unit = "file" if len(gbk_files) == 1 else "files"
task = self.progress.add_task("Converting", total=len(gbk_files), unit=unit)
done = 0
# rewrite GenBank files
for gbk_file in self.progress.track(gbk_files, task_id=task, total=len(gbk_files)):
# load record and ensure it comes from GECCO
record = Bio.SeqIO.read(gbk_file, "genbank")
if "GECCO-Data" not in record.annotations.get('structured_comment', {}):
self.warning(f"GenBank file {gbk_file!r} was not obtained by GECCO")
continue
# extract proteins
proteins = []
for feat in filter(lambda f: f.type == "CDS", record.features):
if "locus_tag" not in feat.qualifiers:
continue
seq = Seq(feat.qualifiers["translation"][0])
prot = SeqRecord(id=feat.qualifiers["locus_tag"][0], seq=seq)
proteins.append(prot)
# write proteins to FASTA
new_name = gbk_file.replace(".gbk", ".faa")
self.info(f"Converting {gbk_file!r} proteins to FASTA file {new_name!r}")
Bio.SeqIO.write(proteins, new_name, "fasta")
done += 1
self.success(f"Converted {done} GenBank {unit} to protein FASTA format", level=0)
def execute(self, ctx: contextlib.ExitStack) -> int: # noqa: D102
try:
# check the CLI arguments were fine and enter context
self._check()
ctx.enter_context(self.progress)
ctx.enter_context(patch_showwarnings(self._showwarnings))
# run the appropriate method
if self.args["gbk"]:
if self.args["--format"] == "bigslice":
self._convert_gbk_bigslice(ctx)
elif self.args["--format"] == "fna":
self._convert_gbk_fna(ctx)
elif self.args["--format"] == "faa":
self._convert_gbk_faa(ctx)
except CommandExit as cexit:
return cexit.code
except KeyboardInterrupt:
self.error("Interrupted")
return -signal.SIGINT
except Exception as err:
self.progress.stop()
raise
else:
return 0
......@@ -5,6 +5,7 @@ import contextlib
import errno
import glob
import itertools
import json
import logging
import multiprocessing
import operator
......@@ -18,9 +19,11 @@ from typing import Any, Dict, Union, Optional, List, TextIO, Mapping
import rich.emoji
import rich.progress
from ... import __version__
from ._base import Command, CommandExit, InvalidArgument
from .annotate import Annotate
from .._utils import patch_showwarnings
from ...model import ProductType
class Run(Annotate): # noqa: D101
......@@ -46,12 +49,16 @@ class Run(Annotate): # noqa: D101
Biopython format string. GECCO is able
to recognize FASTA and GenBank files
automatically if this is not given.
-o <out>, --output-dir <out> the directory in which to write the
output files. [default: .]
-j <jobs>, --jobs <jobs> the number of CPUs to use for
multithreading. Use 0 to use all of the
available CPUs. [default: 0]
Parameters - Output:
-o <out>, --output-dir <out> the directory in which to write the
output files. [default: .]
--antismash-sideload write an AntiSMASH v6 sideload JSON
file next to the output files.
Parameters - Domain Annotation:
-e <e>, --e-filter <e> the e-value cutoff for protein domains
to be included. [default: 1e-5]
......@@ -89,6 +96,7 @@ class Run(Annotate): # noqa: D101
self.model = self._check_flag("--model")
self.hmm = self._check_flag("--hmm")
self.output_dir = self._check_flag("--output-dir")
self.antismash_sideload = self._check_flag("--antismash-sideload", bool)
except InvalidArgument:
raise CommandExit(1)
......@@ -105,7 +113,10 @@ class Run(Annotate): # noqa: D101
# Check if output files already exist
base, _ = os.path.splitext(os.path.basename(self.genome))
for ext in ["features.tsv", "clusters.tsv"]:
output_exts = ["features.tsv", "clusters.tsv"]
if self.antismash_sideload:
output_exts.append("sideload.json")
for ext in output_exts:
if os.path.isfile(os.path.join(self.output_dir, f"{base}.{ext}")):
self.warn("Output folder contains files that will be overwritten")
break
......@@ -195,6 +206,55 @@ class Run(Annotate): # noqa: D101
self.info("Writing", f"cluster [bold blue]{cluster.id}[/] to", repr(gbk_out), level=1)
SeqIO.write(cluster.to_seq_record(), gbk_out, "genbank")
def _write_sideload_json(self, clusters):
# record version and important parameters
data = {
"records": [],
"tool": {
"name": "GECCO",
"version": __version__,
"description": "Biosynthetic Gene Cluster prediction with Conditional Random Fields.",
"configuration": {
"cds": repr(self.cds),
"e-filter": repr(self.e_filter),
"postproc": repr(self.postproc),
"threshold": repr(self.threshold),
}
}
}
# record if non-standard HMM or model was used
if self.hmm:
data["tool"]["configuration"]["hmm"] = list(self.hmm)
if self.model:
data["tool"]["configuration"]["model"] = self.model
# create a record per sequence
for seq_id, seq_clusters in itertools.groupby(clusters, key=lambda cluster: cluster.source.id):
data["records"].append({"name": seq_id, "subregions": []})
for cluster in seq_clusters:
ty = ";".join(sorted(ty.name for ty in cluster.type.unpack())) or "Unknown"
data["records"][-1]["subregions"].append({
"start": cluster.start,
"end": cluster.end,
"label": ty,
"details": {
"average_p": f"{cluster.average_probability:.3f}",
"max_p": f"{cluster.maximum_probability:.3f}",
"alkaloid_probability": f"{cluster.type_probabilities.get(ProductType.Alkaloid, 0.0):.3f}",
"polyketide_probability": f"{cluster.type_probabilities.get(ProductType.Polyketide, 0.0):.3f}",
"ripp_probability": f"{cluster.type_probabilities.get(ProductType.RiPP, 0.0):.3f}",
"saccharide_probability": f"{cluster.type_probabilities.get(ProductType.Saccharide, 0.0):.3f}",
"terpene_probability": f"{cluster.type_probabilities.get(ProductType.Terpene, 0.0):.3f}",
"nrp_probability": f"{cluster.type_probabilities.get(ProductType.NRP, 0.0):.3f}",
"other_probability": f"{cluster.type_probabilities.get(ProductType.Other, 0.0):.3f}",
}
})
# write the JSON file to the output folder
base, _ = os.path.splitext(os.path.basename(self.genome))
sideload_out = os.path.join(self.output_dir, f"{base}.sideload.json")
self.info("Writing", "sideload JSON to", repr(sideload_out), level=1)
with open(sideload_out, "w") as out:
json.dump(data, out, sort_keys=True, indent=4)
# ---
def execute(self, ctx: contextlib.ExitStack) -> int: # noqa: D102
......@@ -230,6 +290,8 @@ class Run(Annotate): # noqa: D101
self.info("Writing", "result files to folder", repr(self.output_dir), level=1)
self._write_cluster_table(clusters)
self._write_clusters(clusters)
if self.antismash_sideload:
self._write_sideload_json(clusters)
self.success("Found", len(clusters), "biosynthetic gene clusters", level=0)
except CommandExit as cexit:
return cexit.code
......
......@@ -143,8 +143,9 @@ class PyHMMER(DomainAnnotator):
# add the domain to the protein domains of the right gene
assert domain.env_from < domain.env_to
domain = Domain(accession, domain.env_from, domain.env_to, self.hmm.id, domain.i_evalue, None, qualifiers)
gene_index[target_name].protein.domains.append(domain)
assert domain.i_evalue >= 0
d = Domain(accession, domain.env_from, domain.env_to, self.hmm.id, domain.i_evalue, None, qualifiers)
gene_index[target_name].protein.domains.append(d)
# return the updated list of genes that was given in argument
return list(gene_index.values())
......
......@@ -16,6 +16,7 @@ from collections.abc import Sized
from dataclasses import dataclass, field
from typing import Dict, Iterable, List, Mapping, Optional, Sequence, TextIO, NamedTuple, Union, Iterator
import Bio
import numpy
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation, CompoundLocation, Reference
......@@ -164,7 +165,15 @@ class Protein:
"""Convert the protein to a single record.
"""
# FIXME: add domains
return SeqRecord(self.seq, id=self.id, name=self.id)
record = SeqRecord(self.seq, id=self.id, name=self.id)
biopython_version = tuple(map(int, Bio.__version__.split(".")))
if biopython_version < (1, 77):
from Bio import Alphabet
record.seq.alphabet = Alphabet.generic_protein
return record
@dataclass
......@@ -358,18 +367,24 @@ class Cluster:
with patch_locale("C"):
bgc.annotations['date'] = now.strftime("%d-%b-%Y").upper()
biopython_version = tuple(map(int, Bio.__version__.split(".")))
if biopython_version < (1, 77):
from Bio import Alphabet
bgc.seq.alphabet = Alphabet.generic_dna
# add GECCO preprint as a reference
ref = Reference()
ref.title = "Accurate de novo identification of biosynthetic gene clusters with GECCO"
ref.journal = "bioRxiv (2021.05.03.442509)"
ref.comment = "doi:10.1101/2021.05.03.442509"
ref.authors = ", ".join([
"Laura M Carroll",
"Martin Larralde",
"Jonas Simon Fleck",
"Ruby Ponnudurai",
"Alessio Milanese",
"Elisa Cappio Barazzone",
"Laura M Carroll",
"Martin Larralde",
"Jonas Simon Fleck",
"Ruby Ponnudurai",
"Alessio Milanese",
"Elisa Cappio Barazzone",
"Georg Zeller"
])
bgc.annotations.setdefault("references", []).append(ref)
......
This diff is collapsed.
......@@ -40,7 +40,7 @@ setup_requires =
pyhmmer ~=0.3.1
wheel >=0.30
install_requires =
biopython ~=1.78
biopython ~=1.73
dataclasses ~=0.8 ; python_version < '3.7'
docopt ~=0.6.2
importlib-metadata >=1.4 ; python_version < '3.8'
......@@ -77,6 +77,7 @@ gecco = _version.txt, py.typed
[options.entry_points]
gecco.cli.commands =
annotate = gecco.cli.commands.annotate:Annotate
convert = gecco.cli.commands.convert:Convert
cv = gecco.cli.commands.cv:Cv [train]
embed = gecco.cli.commands.embed:Embed [train]
run = gecco.cli.commands.run:Run
......