MorF - a pipeline for structure-based annotation transfer
Sufficient sequence similarity is used to consider an unknown protein an ortholog of a well-annotated one, and transfer structural and functional information to it. In the genomics era sequencing far outpaces functional experiments as well as experimental protein structure determination, making sequence-based annotation transfer a critical component of working with biological data.
Experimentally determined protein structures and functions usually come from one of few model species (mostly human, mouse, fly, worm, yeast). For organisms that are phyletically distant, the usefulness of sequence-based annotation transfer is severely limited.
Something that is conserved better across long evolutionary distances are protein structures, which have a more direct link to protein function. This is something that is long recognised (hence efforts like SCOP and CATH), but since there was no easy way to obtain lots of structures it was never practical to use protein structure similarity to assess homology in the same manner as sequences. Protein structure determination is hard and expensive, and protein structure prediction only worked well enough in homology modelling, where it was possible to find a template structure based on sequence similarity.
AlphaFold changed how we think about protein structures. By leveraging deep learning, multiple sequence alignments, and the ever-expanding library of solved protein structures, AlphaFold is able to predict three-dimensional protein structures at resolutions that rival solved crystal structures, and has immediately found use in large parts of biological research.
We used colabfold to predict structures for the proteome of Spongilla lacustris, a freshwater sponge, and annotated them via structural similarity to all available protein structures. Please consider the manuscript for more details, or peruse the notebooks to see our analysis.
Authors and contributions
- Niko Papadopoulos and Fabian Ruperti conceived the project.
- Niko Papadopoulos, Fabian Ruperti, and Jacob Musser designed the project.
- Niko Papadopoulos and Fabian Ruperti performed the main analysis.
- Niko Papadopoulos and Fabian Ruperti performed the additional analysis requested during manuscript revision.
- Milot Mirdita consulted on ColabFold usage, performed additional analysis on novel fold candidates and consulted during the revision process.
- Martin Steinegger consulted on project design and ColabFold usage, and performed additional analysis on HGT candidates.
- Jacob Musser and Alexandros Pittis consulted on gene naming and phylogenetic assignment.
- Niko Papadopoulos, Fabian Ruperti, Jacob Musser, and Detlev Arendt wrote the manuscript.
License
GPL 3.0
Project status
Directions pursued currently and in the future:
- rework code, modularize, apply to additional (non-UniProt) species
- add sequence profiles comparison (profile-sequence, profile-profile searches)
- confirm existence of HGT candidates on Spongilla genome, address possible mechanisms
- apply to proteomics data, other cell types