Commits (7)
......@@ -5,7 +5,14 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
## [Unreleased]
[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1-alpha4...master
[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1...master
## [v0.9.1] - 2022-04-05
[v0.9.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1-alpha4...v0.9.1
### Changed
- Make the `genes.tsv` and `features.tsv` table contain all genes even when they come from a contig too short to be processed by the CRF sliding window.
- Replaced the `--force-clusters-tsv` flag with a `--force-tsv` flag to force writing TSV tables even when no genes or clusters were found in `gecco run` or `gecco annotate`.
## [v0.9.1-alpha4] - 2022-03-31
[v0.9.1-alpha4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1-alpha3...v0.9.1-alpha4
......@@ -15,7 +22,7 @@ Retrain internal model with:
$ python -m gecco -vv train --c1 0.4 --c2 0 --select 0.25 --window-size 20 \
-f mibig-2.0.proG2.Pfam-v35.0.features.tsv \
-c mibig-2.0.proG2.clusters.tsv \
-g GECCO-data/data/embeddings/mibig-2.0.proG2.genes.gff \
-g GECCO-data/data/embeddings/mibig-2.0.proG2.genes.tsv \
-o models/v0.9.1-alpha4
```
......
......@@ -96,11 +96,14 @@ Output
GECCO will create the following files once done (using the same prefix as
the input file):
- ``{sequence}.features.tsv``: The *features* file, containing the identified
proteins and domains in the input sequences.
- ``{sequence}.clusters.tsv``: If any were found, a *clusters* file, containing
the coordinates of the predicted clusters, along their putative biosynthetic
type.
- ``{sequence}.genes.tsv``: The *genes* file, containing the genes found by
`Pyrodigal <https://pyrodigal.readthedocs.io>`_ and per-gene BGC probabilities
predicted by the CRF.
- ``{sequence}.features.tsv``: The *features* file, containing the domains
identified in the predicted genes.
- ``{sequence}.clusters.tsv``: If any BGCs were found, a *clusters* file,
containing the coordinates of the predicted clusters, along their putative
biosynthetic type.
- ``{sequence}_cluster_{N}.gbk``: If any were found, a GenBank file per cluster,
containing the cluster sequence annotated with its member proteins and domains.
......
semantic_version ~=2.8
sphinx ~=2.3
sphinx ~=4.0
sphinx-bootstrap-theme ~=0.7
recommonmark ~=0.6
pygments-style-monokailight ~=0.4
<?xml version='1.0' encoding='utf-8'?>
<tool id="gecco" name="GECCO" version="0.8.10" python_template_version="3.5">
<tool id="gecco" name="GECCO" version="0.9.1" python_template_version="3.5">
<description>is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).</description>
<requirements>
<requirement type="package" version="0.8.10">gecco</requirement>
<requirement type="package" version="0.9.1">gecco</requirement>
</requirements>
<version_command>gecco --version</version_command>
<command detect_errors="aggressive"><![CDATA[
......@@ -18,8 +18,10 @@
--format $input.ext
--genome input_tempfile.$file_extension
--postproc $postproc
--edge-distance $edge_distance
--force-clusters-tsv
--force-tsv
#if $edge_distance
--edge-distance $edge_distance
#end if
#if $mask
--mask
#end if
......@@ -33,6 +35,7 @@
--antismash-sideload
#end if
&& mv input_tempfile.genes.tsv '$genes'
&& mv input_tempfile.features.tsv '$features'
&& mv input_tempfile.clusters.tsv '$clusters'
#if $antismash_sideload
......@@ -49,13 +52,14 @@
<option value="antismash">antiSMASH</option>
<option value="gecco" selected="true">GECCO</option>
</param>
<param argument="--edge-distance" type="integer" min="0" value="10" label="Number of genes from the contig edges to filter out"/>
<param argument="--edge-distance" type="integer" min="0" optional="true" value="" label="Number of genes from the contig edges to filter out"/>
<param argument="--antismash-sideload" type="boolean" checked="false" label="Generate an antiSMASH v6 sideload JSON file"/>
</inputs>
<outputs>
<collection name="records" type="list" label="${tool.name} detected Biosynthetic Gene Clusters on ${on_string} (GenBank)">
<discover_datasets pattern="(?P&lt;designation&gt;.*)\.gbk" ext="genbank" visible="false" />
</collection>
<data name="genes" format="tabular" label="${tool.name} summary of detected genes on ${on_string} (TSV)"/>
<data name="features" format="tabular" label="${tool.name} summary of detected features on ${on_string} (TSV)"/>
<data name="clusters" format="tabular" label="${tool.name} summary of detected BGCs on ${on_string} (TSV)"/>
<data name="sideload" format="json" label="antiSMASH v6 sideload file with ${tool.name} detected BGCs on ${on_string} (JSON)">
......@@ -66,12 +70,14 @@
<test>
<param name="input" value="BGC0001866.fna"/>
<output name="features" file="features.tsv"/>
<output name="genes" file="genes.tsv"/>
<output name="clusters" file="clusters.tsv"/>
</test>
<test>
<param name="input" value="BGC0001866.fna"/>
<param name="edge_distance" value="0"/>
<output name="features" file="features.tsv"/>
<output name="genes" file="genes.tsv"/>
<output name="clusters" file="clusters.tsv"/>
<output_collection name="records" type="list">
<element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" compare="diff" lines_diff="4"/>
......@@ -82,6 +88,7 @@
<param name="antismash_sideload" value="True"/>
<param name="edge_distance" value="0"/>
<output name="features" file="features.tsv"/>
<output name="genes" file="genes.tsv"/>
<output name="clusters" file="clusters.tsv"/>
<output name="sideload" file="sideload.json"/>
<output_collection name="records" type="list">
......@@ -107,8 +114,9 @@ Output
GECCO will create the following files once done (using the same prefix as the input file):
- ``features.tsv``: The features file, containing the identified proteins and domains in the input sequences.
- ``clusters.tsv``: If any were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.
- ``features.tsv``: The genes file, containing the genes identified in the input sequences.
- ``features.tsv``: The features file, containing the protein domains identified in the input sequences.
- ``clusters.tsv``: A clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.
- ``{sequence}_cluster_{N}.gbk``: If any BGCs were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.
Contact
......
../../tests/test_cli/data/BGC0001866.genes.tsv
\ No newline at end of file
......@@ -10,4 +10,4 @@ See Also:
__author__ = "Martin Larralde"
__license__ = "GPLv3"
__version__ = "0.9.1-alpha4"
__version__ = "0.9.1"
......@@ -56,6 +56,9 @@ class Annotate(Command): # noqa: D101
Parameters - Output:
-o <out>, --output-dir <out> the directory in which to write the
output files. [default: .]
--force-tsv always write TSV output files even
when they are empty (e.g. because
no genes or no clusters were found).
Parameters - Gene Calling:
......@@ -99,6 +102,7 @@ class Annotate(Command): # noqa: D101
self.hmm = self._check_flag("--hmm", optional=True)
self.output_dir = self._check_flag("--output-dir")
self.mask = self._check_flag("--mask", bool)
self.force_tsv = self._check_flag("--force-tsv", bool)
except InvalidArgument:
raise CommandExit(1)
......@@ -277,6 +281,8 @@ class Annotate(Command): # noqa: D101
if genes:
self.success("Found", "a total of", len(genes), "genes", level=1)
else:
if self.force_tsv:
self._write_feature_table([])
self.warn("No genes were found")
return 0
# annotate domains and write results
......
......@@ -63,8 +63,10 @@ class Run(Annotate): # noqa: D101
output files. [default: .]
--antismash-sideload write an AntiSMASH v6 sideload JSON
file next to the output files.
--force-clusters-tsv always write a ``clusters.tsv`` file
even when no clusters were found.
--force-tsv always write TSV output files even
when they are empty (e.g. because
no genes or no clusters were found).
Parameters - Gene Calling:
-M, --mask Enable unknown region masking to
......@@ -84,7 +86,7 @@ class Run(Annotate): # noqa: D101
valid cluster must contain. [default: 3]
-m <m>, --threshold <m> the probability threshold for cluster
detection. Default depends on the
post-processing method (0.3 for gecco,
post-processing method (0.8 for gecco,
0.6 for antismash).
--postproc <method> the method to use for cluster validation
(antismash or gecco). [default: gecco]
......@@ -122,7 +124,7 @@ class Run(Annotate): # noqa: D101
optional=True,
)
if self.args["--threshold"] is None:
self.threshold = 0.3 if self.args["--postproc"] == "gecco" else 0.6
self.threshold = 0.8 if self.args["--postproc"] == "gecco" else 0.6
else:
self.threshold = self._check_flag("--threshold", float, lambda x: 0 <= x <= 1, hint="number between 0 and 1")
self.jobs = self._check_flag("--jobs", int, lambda x: x >= 0, hint="positive or null integer")
......@@ -134,7 +136,7 @@ class Run(Annotate): # noqa: D101
self.hmm = self._check_flag("--hmm")
self.output_dir = self._check_flag("--output-dir")
self.antismash_sideload = self._check_flag("--antismash-sideload", bool)
self.force_clusters_tsv = self._check_flag("--force-clusters-tsv", bool)
self.force_tsv = self._check_flag("--force-tsv", bool)
self.mask = self._check_flag("--mask", bool)
except InvalidArgument:
raise CommandExit(1)
......@@ -310,6 +312,10 @@ class Run(Annotate): # noqa: D101
if genes:
self.success("Found", "a total of", len(genes), "genes", level=1)
else:
if self.force_tsv:
self._write_genes_table(genes)
self._write_feature_table([])
self._write_cluster_table([])
self.warn("No genes were found")
return 0
# use a whitelist for domain annotation, so that we only annotate
......@@ -326,7 +332,7 @@ class Run(Annotate): # noqa: D101
self.success("Found", len(clusters), "potential gene clusters", level=1)
else:
self.warn("No gene clusters were found")
if self.force_clusters_tsv:
if self.force_tsv:
self._write_cluster_table(clusters)
return 0
# predict types for putative clusters
......
......@@ -167,8 +167,10 @@ class ClusterCRF(object):
if len(feats) < self.window_size:
warnings.warn(
f"Contig {sequence[0].source.id!r} does not contain enough"
f" genes for sliding window of size {self.window_size}"
f" genes ({len(sequence)}) for sliding window of size"
f" {self.window_size}"
)
predicted.extend(sequence)
continue
# predict marginals over a sliding window, storing maximum probabilities
probabilities = numpy.zeros(len(sequence))
......
......@@ -70,7 +70,7 @@ class ClusterRefiner:
def __init__(
self,
threshold: float = 0.3,
threshold: float = 0.8,
criterion: str = "gecco",
n_cds: int = 5,
n_biopfams: int = 5,
......