diff --git a/README.md b/README.md index 8cfc6b74f1298a7179b4a6fdaf37f2ae28cc16d8..e57792bdf2a19d2cb3dbcc62c109bbb5a58e70ff 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,22 @@ -# CoFFE - a pipeline for structure-based annotation transfer - -Sufficient sequence similarity is used to consider an unknown protein an ortholog of a well-annotated one, and transfer structural and functional information to it. In the genomics era sequencing far outpaces functional experiments as well as experimental protein structure determination, making sequence-based annotation transfer a critical component of working with biological data. - -Experimentally determined protein structures and functions usually come from one of few model species (mostly human, mouse, fly, worm, yeast). For organisms that are phyletically distant, the usefulness of sequence-based annotation transfer is severely limited. - -Something that is conserved better across long evolutionary distances are protein structures, which have a more direct link to protein function. This is something that is long recognised (hence efforts like SCOP and CATH), but since there was no easy way to obtain lots of structures it was never practical to use protein structure similarity to assess homology in the same manner as sequences. Protein structure determination is hard and expensive, and protein structure prediction only worked well enough in homology modelling, where it was possible to find a template structure based on sequence similarity. +# MorF - a pipeline for structure-based annotation transfer + +Sufficient sequence similarity is used to consider an unknown protein an ortholog of a +well-annotated one, and transfer structural and functional information to it. In the genomics era +sequencing far outpaces functional experiments as well as experimental protein structure +determination, making sequence-based annotation transfer a critical component of working with +biological data. + +Experimentally determined protein structures and functions usually come from one of few model +species (mostly human, mouse, fly, worm, yeast). For organisms that are phyletically distant, the +usefulness of sequence-based annotation transfer is severely limited. + +Something that is conserved better across long evolutionary distances are protein structures, which +have a more direct link to protein function. This is something that is long recognised (hence +efforts like SCOP and CATH), but since there was no easy way to obtain lots of structures it was +never practical to use protein structure similarity to assess homology in the same manner as +sequences. Protein structure determination is hard and expensive, and protein structure prediction +only worked well enough in homology modelling, where it was possible to find a template structure +based on sequence similarity. [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) changed how we think about protein structures. By leveraging deep learning, multiple sequence alignments, and the ever-expanding @@ -12,15 +24,23 @@ library of solved protein structures, AlphaFold is able to predict three-dimensi structures at resolutions that rival solved crystal structures, and has immediately found use in large parts of biological research. -We used AlphaFold to predict structures for the proteome of _Spongilla lacustris_, a freshwater sponge, and annotated them via structural similarity to all available protein structures. Please consider the [manuscript](https://www.biorxiv.org/content/10.1101/2022.07.05.498892v2) for more details, or peruse the notebooks to see our analysis. +We used [colabfold](https://github.com/sokrypton/ColabFold) to predict structures for the proteome +of _Spongilla lacustris_, a freshwater sponge, and annotated them via structural similarity to all +available protein structures. Please consider the +[manuscript](https://www.biorxiv.org/content/10.1101/2022.07.05.498892) for more details, or peruse +the notebooks to see our analysis. ## Authors and contributions - Niko Papadopoulos and Fabian Ruperti conceived the project. - Niko Papadopoulos, Fabian Ruperti, and Jacob Musser designed the project. -- Niko Papadopoulos and Fabian Ruperti performed the main analysis -- Milot Mirdita consulted on ColabFold usage, performed additional analysis on novel fold candidates. -- Martin Steinegger consulted on ColabFold usage, performed additional analysis on HGT candidates. +- Niko Papadopoulos and Fabian Ruperti performed the main analysis. +- Niko Papadopoulos and Fabian Ruperti performed the additional analysis requested during manuscript + revision. +- Milot Mirdita consulted on ColabFold usage, performed additional analysis on novel fold + candidates and consulted during the revision process. +- Martin Steinegger consulted on project design and ColabFold usage, and performed additional + analysis on HGT candidates. - Jacob Musser and Alexandros Pittis consulted on gene naming and phylogenetic assignment. - Niko Papadopoulos, Fabian Ruperti, Jacob Musser, and Detlev Arendt wrote the manuscript.