Skip to content
Snippets Groups Projects
Commit 30b67fc7 authored by Niko Papadopoulos's avatar Niko Papadopoulos
Browse files

revisit MorF-EggNOG overlap (Table 3)

parent c29d13de
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:2ab498a2-bfa8-49c7-93ee-e5d37054a460 tags:
# Structure-sequence agreement in model species
In an effort to get a feeling for how often morphologs were orthologs, we searched with AlphaFoldDB against itself. In Table 3 of the manuscript we reported that proteins that had non-species morphologs (very) often were also orthologs. The reviewers pointed out that these orthologs might plausibly come from very closely related species, thus weakening our claim that structural similarity might detect functional similarity or orthology over longer evolutionary distances. Here we are revisiting this analysis and looking to exclude
In an effort to get a feeling for how often morphologs were orthologs, we searched with AlphaFoldDB against itself. In Table 3 of the manuscript we reported that proteins that had non-species morphologs (very) often were also orthologs. The reviewers pointed out that these orthologs might plausibly come from very closely related species, thus weakening our claim that structural similarity might detect functional similarity or orthology over longer evolutionary distances. Here we are revisiting this analysis and looking to exclude closely related species for the current query (e.g. exclude all vertebrates when looking at human/mouse/rat/zebrafish).
%% Cell type:code id:9e515cdc-8372-42aa-b993-0f9a21d4ce86 tags:
``` python
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
```
%% Cell type:code id:48771bad-b5d8-4f81-ae39-d0955f120201 tags:
``` python
afdb = pd.read_csv('/Users/npapadop/Documents/data/coffe/self_score.tsv', sep='\t', header=None)
afdb.columns = ['query', 'target', 'perc_id', 'ali_length', 'no_mismatch', 'no_gapopen',
'q_start', 'q_end', 't_start', 't_end', 'eval', 'bit']
afdb['query'] = afdb['query'].str.split('-').str[1]
afdb['target'] = afdb['target'].str.split('-').str[1]
afdb.drop(columns=["no_mismatch", "no_gapopen", "q_start", "q_end", "t_start", "t_end"], inplace=True)
```
%% Cell type:code id:e795b8d5-8986-46b2-97f5-43b8dae21c05 tags:
``` python
uniprot_to_eggnog = pd.read_csv('/Users/npapadop/Documents/data/coffe/uni2egg_euk.Nov2018.tsv', sep='\t', header=None)
uniprot_to_eggnog.columns = ['uniprot', 'orthogroup']
uniprot_to_eggnog.set_index("uniprot", inplace=True)
```
%% Cell type:code id:06265422-c6e6-4180-bef8-908ca806a894 tags:
``` python
annot = pd.read_csv('/Users/npapadop/Documents/data/coffe/afdb_proteomes_species.tsv', sep='\t', index_col=0)
annot.set_index("id", inplace=True)
```
%% Cell type:code id:892a05c7-35f4-4f92-b379-f0b17551177c tags:
``` python
taxonomy = {
"Zea mays": "Monocots",
"Glycine max": "Eudicots",
"Trypanosoma brucei brucei (strain 927/4 GUTat10.1)": "Trypanosomatida",
"Staphylococcus aureus (strain NCTC 8325 / PS 47)": "Staphylococcales",
"Leishmania infantum": "Trypanosomatida",
"Madurella mycetomatis": "Sordariomycetes",
"Cladophialophora carrionii": "Chaetothyriales",
"Drosophila melanogaster": "Arthropoda",
"Methanocaldococcus jannaschii (strain ATCC 43067 / DSM 2661 / JAL-1 / JCM 10045 / NBRC 100440)": "Methanocaldococcales",
"Mycobacterium leprae (strain TN)": "Mycobacteriales",
"Mus musculus": "Vertebrata",
"Escherichia coli (strain K12)": "Enterobacteriales",
"Campylobacter jejuni subsp. jejuni serotype O:2 (strain ATCC 700819 / NCTC 11168)": "Campylobacterales",
"Wuchereria bancrofti": "Nematoda",
"Dracunculus medinensis": "Nematoda",
"Homo sapiens": "Vertebrata",
"Sporothrix schenckii (strain ATCC 58251 / de Perez 2211183)": "Sordariomycetes",
"Mycobacterium ulcerans str. Harvey": "Mycobacteriales",
"Dictyostelium discoideum": "Amoebozoa",
"Arabidopsis thaliana": "Eudicots",
"Caenorhabditis elegans": "Nematoda",
"Oryza sativa subsp. japonica": "Monocots",
"Saccharomyces cerevisiae (strain ATCC 204508 / S288c)": "Saccharomycetales",
"Pseudomonas aeruginosa (strain ATCC 15692 / DSM 22644 / CIP 104116 / JCM 14847 / LMG 12228 / 1C / PRS 101 / PAO1)": "Pseudomonadales",
"Schizosaccharomyces pombe (strain 972 / ATCC 24843)": "Schizosaccharomycetes",
"Strongyloides stercoralis": "Nematoda",
"Fonsecaea pedrosoi CBS 271.37": "Chaetothyriales",
"Brugia malayi": "Nematoda",
"Klebsiella pneumoniae subsp. pneumoniae (strain HS11286)": "Enterobacteriales",
"Onchocerca volvulus": "Nematoda",
"Danio rerio": "Vertebrata",
"Ajellomyces capsulatus (strain G186AR / H82 / ATCC MYA-2454 / RMSCC 2432)": "Eurotiomycetes",
"Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720)": "Enterobacteriales",
"Trichuris trichiura": "Nematoda",
"Schistosoma mansoni": "Platyhelminthes",
"Candida albicans (strain SC5314 / ATCC MYA-2876)": "Saccharomycetales",
"Trypanosoma cruzi (strain CL Brener)": "Trypanosomatida",
"Enterococcus faecium": "Lactobacillales",
"Helicobacter pylori (strain ATCC 700392 / 26695)": "Campylobacterales",
"Plasmodium falciparum (isolate 3D7)": "Aconoidasida",
"Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv)": "Mycobacteriales",
"Neisseria gonorrhoeae (strain ATCC 700825 / FA 1090)": "Neisseriales",
"Streptococcus pneumoniae (strain ATCC BAA-255 / R6)": "Lactobacillales",
"Nocardia brasiliensis ATCC 700358": "Mycobacteriales",
"Paracoccidioides lutzii (strain ATCC MYA-826 / Pb01)": "Eurotiomycetes",
"Rattus norvegicus": "Vertebrata",
"Shigella dysenteriae serotype 1 (strain Sd197)": "Enterobacteriales",
"Haemophilus influenzae (strain ATCC 51907 / DSM 11121 / KW20 / Rd)": "Pasteurallales"
}
```
%% Cell type:code id:b35c269b-0a48-470e-8f5d-6dd02373826a tags:
``` python
annot.columns = ["query species"]
afdb = afdb.join(annot, on="query")
annot.columns = ["target species"]
afdb = afdb.join(annot, on="target")
afdb["target taxonomy"] = afdb["target species"].replace(taxonomy)
# reset annot columns
annot.columns = ["species"]
```
%% Cell type:code id:58daa91d-ec2b-47cb-b007-6621b50e5576 tags:
``` python
uniprot_to_eggnog.columns = ["query euk OG"]
afdb = afdb.join(uniprot_to_eggnog, on="query")
uniprot_to_eggnog.columns = ["target euk OG"]
afdb = afdb.join(uniprot_to_eggnog, on="target")
```
%% Cell type:code id:02448ebd-05d4-47a0-90a5-3f579e3a5fff tags:
``` python
query_OG_present = ~afdb["query euk OG"].isnull()
target_OG_present = ~afdb["target euk OG"].isnull()
OGs_available = query_OG_present & target_OG_present
```
%% Cell type:code id:ae403d3f-ed80-4b03-b408-c7f00cceb86b tags:
``` python
def best_match(df, sort_by='bit', tiebreak='ali_length'):
have_max = df[sort_by] == np.max(df[sort_by])
max_ali = df[have_max][tiebreak] == np.max(df[have_max][tiebreak])
return df[have_max][max_ali].index.values[0]
```
%% Cell type:code id:6bcc92ad-84f9-4b17-a886-fdcf306410a0 tags:
``` python
for species in np.unique(annot['species']):
taxon = taxonomy[species]
query_is_species = afdb["query species"] == species
target_is_diff_taxon = afdb["target taxonomy"] != taxon
keep = query_is_species & target_is_diff_taxon & OGs_available
if np.sum(keep) == 0:
print(f'{species}\t{0}\t{0}\t0.00%')
continue
best_indices = afdb[keep].groupby("query").apply(best_match)
species_best = afdb.loc[best_indices]
total_proteins = len(afdb[query_is_species]["query"].unique())
have_og = len(species_best)
og_agreement = np.sum(species_best["query euk OG"] == species_best["target euk OG"]) / have_og
print(f'{species}\t{total_proteins}\t{have_og}\t{og_agreement * 100: .2f}%')
```
%% Output
Ajellomyces capsulatus (strain G186AR / H82 / ATCC MYA-2454 / RMSCC 2432) 0 0 0.00%
Arabidopsis thaliana 27357 19800 83.78%
Brugia malayi 0 0 0.00%
Caenorhabditis elegans 19599 11177 67.29%
Campylobacter jejuni subsp. jejuni serotype O:2 (strain ATCC 700819 / NCTC 11168) 0 0 0.00%
Candida albicans (strain SC5314 / ATCC MYA-2876) 5973 4118 84.41%
Cladophialophora carrionii 11169 7040 81.29%
Danio rerio 24626 13303 75.36%
Dictyostelium discoideum 12584 6312 63.96%
Dracunculus medinensis 0 0 0.00%
Drosophila melanogaster 13418 8828 79.69%
Enterococcus faecium 0 0 0.00%
Escherichia coli (strain K12) 0 0 0.00%
Fonsecaea pedrosoi CBS 271.37 0 0 0.00%
Glycine max 55759 32239 85.40%
Haemophilus influenzae (strain ATCC 51907 / DSM 11121 / KW20 / Rd) 0 0 0.00%
Helicobacter pylori (strain ATCC 700392 / 26695) 0 0 0.00%
Homo sapiens 20392 13838 74.30%
Klebsiella pneumoniae subsp. pneumoniae (strain HS11286) 0 0 0.00%
Leishmania infantum 7914 4423 56.14%
Madurella mycetomatis 0 0 0.00%
Methanocaldococcus jannaschii (strain ATCC 43067 / DSM 2661 / JAL-1 / JCM 10045 / NBRC 100440) 0 0 0.00%
Mus musculus 21507 15084 75.05%
Mycobacterium leprae (strain TN) 0 0 0.00%
Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) 0 0 0.00%
Mycobacterium ulcerans str. Harvey 0 0 0.00%
Neisseria gonorrhoeae (strain ATCC 700825 / FA 1090) 0 0 0.00%
Nocardia brasiliensis ATCC 700358 0 0 0.00%
Onchocerca volvulus 0 0 0.00%
Oryza sativa subsp. japonica 41794 20306 86.51%
Paracoccidioides lutzii (strain ATCC MYA-826 / Pb01) 8791 5086 84.19%
Plasmodium falciparum (isolate 3D7) 0 0 0.00%
Pseudomonas aeruginosa (strain ATCC 15692 / DSM 22644 / CIP 104116 / JCM 14847 / LMG 12228 / 1C / PRS 101 / PAO1) 0 0 0.00%
Rattus norvegicus 19246 13180 74.60%
Saccharomyces cerevisiae (strain ATCC 204508 / S288c) 6013 4347 83.41%
Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) 0 0 0.00%
Schistosoma mansoni 13845 1730 91.85%
Schizosaccharomyces pombe (strain 972 / ATCC 24843) 5104 3982 90.51%
Shigella dysenteriae serotype 1 (strain Sd197) 0 0 0.00%
Sporothrix schenckii (strain ATCC 58251 / de Perez 2211183) 8652 6349 78.41%
Staphylococcus aureus (strain NCTC 8325 / PS 47) 0 0 0.00%
Streptococcus pneumoniae (strain ATCC BAA-255 / R6) 0 0 0.00%
Strongyloides stercoralis 0 0 0.00%
Trichuris trichiura 0 0 0.00%
Trypanosoma brucei brucei (strain 927/4 GUTat10.1) 8476 4303 60.26%
Trypanosoma cruzi (strain CL Brener) 19026 8782 56.56%
Wuchereria bancrofti 0 0 0.00%
Zea mays 38818 18644 85.47%
......
......@@ -150,7 +150,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
"version": "3.10.8"
}
},
"nbformat": 4,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment