Skip to content
Snippets Groups Projects

spongfold

This repository documents the idea of Fabi and Niko to use predicted protein structures and structure similarity to annotate proteomes.

Background

AlphaFold changed how we think about protein structures. By leveraging deep learning, multiple sequence alignments, and the ever-expanding library of solved protein structures, AlphaFold is able to predict three-dimensional protein structures at resolutions that rival solved crystal structures, and has immediately found use in large parts of biological research.

One of the older dogmata in molecular biology concerns proteins, and holds that

Sequence defines structure defines function.

In the past this was taken to mean that sequence similarity, above a certain threshold, correlated with sequence homology or even orthology. Sufficient sequence similarity was then inferred to mean high structural similarity (this was mostly proven by the success of homology modelling in CASP) as well as functional equivalence. In fact, "bidirectional best BLAST hits" is the de facto baseline for functional annotation.

The question of the threshold turns out to be an important one. Burkhardt Rost famously outlined it as the "twilight zone" of protein similarity; briefly, while we can confidently assert structural similarity (at least to the fold level) for sequences with >30% sequence identity, this deteriorates impressively fast at 25% sequence identity. In the years since this pronouncement, sequence search algorithms with increased sensitivity exploited the mountains of new sequencing data to dive into the twilight zone and detect remote homology; however, at low sequence identity, the sequence information alone is still not enough to guarantee homology.

However, structure is more conserved than sequence. In theory, predicted structures can be compared against known structures that are otherwise annotated, allowing for the transfer of functional annotations (albeit less specific than sequence-based ones, since we will be detecting very remote homology at best). This is of particular interest for non-model organisms, especially ones outside the well-studied taxonomic groups (e.g. vertebrates or ecdysozoans).

Usage

(eventually a tutorial on what order to use the scripts in, if we don't have a master script).

Roadmap

We will write this in as general a way as possible so that it can be reused by others/for other species. Platynereis is an obvious candidate after we proof Spongilla.

Authors and acknowledgment

Fabian Ruperti, Niko Papadopoulos, Jacob Musser, Alexandros Pittis

License

MIT, probably?

Project status

ongoing