# CoFFE - a pipeline for structure-based annotation transfer
Sufficient sequence similarity is used to consider an unknown protein an ortholog of a well-annotated one, and transfer structural and functional information to it. In the genomics era sequencing far outpaces functional experiments as well as experimental protein structure determination, making sequence-based annotation transfer a critical component of working with biological data.
Experimentally determined protein structures and functions usually come from one of few model species (mostly human, mouse, fly, worm, yeast). For organisms that are phyletically distant, the usefulness of sequence-based annotation transfer is severely limited.
Something that is conserved better across long evolutionary distances are protein structures, which have a more direct link to protein function. This is something that is long recognised (hence efforts like SCOP and CATH), but since there was no easy way to obtain lots of structures it was never practical to use protein structure similarity to assess homology in the same manner as sequences. Protein structure determination is hard and expensive, and protein structure prediction only worked well enough in homology modelling, where it was possible to find a template structure based on sequence similarity.
# MorF - a pipeline for structure-based annotation transfer
Sufficient sequence similarity is used to consider an unknown protein an ortholog of a
well-annotated one, and transfer structural and functional information to it. In the genomics era
sequencing far outpaces functional experiments as well as experimental protein structure
determination, making sequence-based annotation transfer a critical component of working with
biological data.
Experimentally determined protein structures and functions usually come from one of few model
species (mostly human, mouse, fly, worm, yeast). For organisms that are phyletically distant, the
usefulness of sequence-based annotation transfer is severely limited.
Something that is conserved better across long evolutionary distances are protein structures, which
have a more direct link to protein function. This is something that is long recognised (hence
efforts like SCOP and CATH), but since there was no easy way to obtain lots of structures it was
never practical to use protein structure similarity to assess homology in the same manner as
sequences. Protein structure determination is hard and expensive, and protein structure prediction
only worked well enough in homology modelling, where it was possible to find a template structure
based on sequence similarity.
[AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) changed how we think about protein
structures. By leveraging deep learning, multiple sequence alignments, and the ever-expanding
...
...
@@ -12,15 +24,23 @@ library of solved protein structures, AlphaFold is able to predict three-dimensi
structures at resolutions that rival solved crystal structures, and has immediately found use in
large parts of biological research.
We used AlphaFold to predict structures for the proteome of _Spongilla lacustris_, a freshwater sponge, and annotated them via structural similarity to all available protein structures. Please consider the [manuscript](https://www.biorxiv.org/content/10.1101/2022.07.05.498892v2) for more details, or peruse the notebooks to see our analysis.
We used [colabfold](https://github.com/sokrypton/ColabFold) to predict structures for the proteome
of _Spongilla lacustris_, a freshwater sponge, and annotated them via structural similarity to all
available protein structures. Please consider the
[manuscript](https://www.biorxiv.org/content/10.1101/2022.07.05.498892) for more details, or peruse
the notebooks to see our analysis.
## Authors and contributions
- Niko Papadopoulos and Fabian Ruperti conceived the project.
- Niko Papadopoulos, Fabian Ruperti, and Jacob Musser designed the project.
- Niko Papadopoulos and Fabian Ruperti performed the main analysis
- Milot Mirdita consulted on ColabFold usage, performed additional analysis on novel fold candidates.
- Martin Steinegger consulted on ColabFold usage, performed additional analysis on HGT candidates.
- Niko Papadopoulos and Fabian Ruperti performed the main analysis.
- Niko Papadopoulos and Fabian Ruperti performed the additional analysis requested during manuscript
revision.
- Milot Mirdita consulted on ColabFold usage, performed additional analysis on novel fold
candidates and consulted during the revision process.
- Martin Steinegger consulted on project design and ColabFold usage, and performed additional
analysis on HGT candidates.
- Jacob Musser and Alexandros Pittis consulted on gene naming and phylogenetic assignment.
- Niko Papadopoulos, Fabian Ruperti, Jacob Musser, and Detlev Arendt wrote the manuscript.