Skip to content
Snippets Groups Projects
Commit fb68c6cc authored by Niko Papadopoulos's avatar Niko Papadopoulos
Browse files

updated README to reflect the stage of the project

parent e619cd50
No related branches found
No related tags found
1 merge request!2updated README to reflect the stage of the project
# spongfold
# CoFFE - a pipeline for structure-based annotation transfer
This repository documents the idea of Fabi and Niko to use predicted protein structures and
structure similarity to annotate proteomes.
Sufficient sequence similarity is used to consider an unknown protein an ortholog of a well-annotated one, and transfer structural and functional information to it. In the genomics era sequencing far outpaces functional experiments as well as experimental protein structure determination, making sequence-based annotation transfer a critical component of working with biological data.
Experimentally determined protein structures and functions usually come from one of few model species (mostly human, mouse, fly, worm, yeast). For organisms that are phyletically distant, the usefulness of sequence-based annotation transfer is severely limited.
Something that is conserved better across long evolutionary distances are protein structures, which have a more direct link to protein function. This is something that is long recognised (hence efforts like SCOP and CATH), but since there was no easy way to obtain lots of structures it was never practical to use protein structure similarity to assess homology in the same manner as sequences. Protein structure determination is hard and expensive, and protein structure prediction only worked well enough in homology modelling, where it was possible to find a template structure based on sequence similarity.
## Background
[AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) changed how we think about protein
structures. By leveraging deep learning, multiple sequence alignments, and the ever-expanding
library of solved protein structures, AlphaFold is able to predict three-dimensional protein
structures at resolutions that rival solved crystal structures, and has immediately found use in
large parts of biological research.
One of the older dogmata in molecular biology concerns proteins, and holds that
> Sequence defines structure defines function.
In the past this was taken to mean that sequence similarity, above a certain threshold, correlated
with sequence homology or even orthology. Sufficient sequence similarity was then inferred to mean
high structural similarity (this was mostly proven by the success of homology modelling in CASP) as
well as functional equivalence. In fact, "bidirectional best BLAST hits" is the _de facto_ baseline
for functional annotation.
The question of the threshold turns out to be an important one. Burkhardt Rost famously outlined it
as the ["twilight zone"](https://pubmed.ncbi.nlm.nih.gov/10195279/) of protein similarity; briefly,
while we can confidently assert structural similarity (at least to the fold level) for sequences
with >30% sequence identity, this deteriorates impressively fast at 25% sequence identity. In the
years since this pronouncement, sequence search algorithms with increased sensitivity exploited the
mountains of new sequencing data to dive into the twilight zone and detect remote homology; however,
at low sequence identity, the sequence information alone is still not enough to guarantee homology.
However, structure is more conserved than sequence. In theory, predicted structures can be compared
against known structures that are otherwise annotated, allowing for the transfer of functional
annotations (albeit less specific than sequence-based ones, since we will be detecting very remote
homology at best). This is of particular interest for non-model organisms, especially ones outside
the well-studied taxonomic groups (e.g. vertebrates or ecdysozoans).
## Follow-up ideas
Having a phylome, scRNAseq/cell type annotation, functional proteomic data and the prediction of
protein structures for a non-bilaterian Metazoan presents a unique combination that would allow to
ask many fundamental questions. Potential follow-up analysis include:
- Correlation between AF prediction accuracy (overall, domain specific, etc.) and sequence
identity/similarity or bitscore of best FoldSeek hit. I.e.: "Does higher sequence identity mean
better prediction accuracy?"
- Relationship between identified homologs through sequence search (orthofinder, eggnog-mapper,
blast, phylome) and best hits in FoldSeek for single sponge proteins. I.e.: "Do the best AF hits
also include proteins identified as homologs in the phylome? Is there a sequence identity
threshold to that?"
- Is there biological meaning to best FoldSeek hits of un-annotated, highly expressed genes in the
scRNAseq dataset or differentially regulated hits in the functional proteomic datasets?. I.e.:
"Can we transfer function / functionally annotate previously un-annotated hits in scRNAseq and
protomics data and most importantly, do these hits make sense when taking prior knowledge into
account?"
## Usage
(eventually a tutorial on what order to use the scripts in, if we don't have a master script).
## Roadmap
We will write this in as general a way as possible so that it can be reused by others/for other
species. Platynereis is an obvious candidate after we proof _Spongilla_.
We used AlphaFold to predict structures for the proteome of _Spongilla lacustris_, a freshwater sponge, and annotated them via structural similarity to all available protein structures. Please consider the [manuscript](https://www.biorxiv.org/content/10.1101/2022.07.05.498892v2) for more details, or peruse the notebooks to see our analysis.
## Authors and contributions
- Niko Papadopoulos and Fabian Ruperti conceived the project.
- Niko Papadopoulos, Fabian Ruperti, and Jacob Musser designed the project.
- Milot Mirdita and Martin Steinegger consulted on ColabFold usage.
- Niko Papadopoulos and Fabian Ruperti performed the main analysis
- Milot Mirdita consulted on ColabFold usage, performed additional analysis on novel fold candidates.
- Martin Steinegger consulted on ColabFold usage, performed additional analysis on HGT candidates.
- Jacob Musser and Alexandros Pittis consulted on gene naming and phylogenetic assignment.
- Niko Papadopoulos, Fabian Ruperti, Jacob Musser, and Detlev Arendt wrote the manuscript.
## License
MIT, probably?
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment