Skip to content
Snippets Groups Projects
Commit 7f88d077 authored by Niko Papadopoulos's avatar Niko Papadopoulos
Browse files

added HMMER analysis, morpholog phylogenetic distribution, and additional info on table3

parent e71eca47
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id:e4b053ea-b0c7-4884-9bae-fb9738ab86a2 tags:
``` python
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
```
%% Cell type:markdown id:852d095d-0d68-4850-8485-ec7708b81ee0 tags:
I figured out a way to access the entire EggNOG database - see [here](https://twitter.com/galicae/status/1608117353711124480?s=20&t=IueyawfuyEwv2D7Nb65olQ). Really wish I could have figured this out about a year ago... anyway, here goes:
%% Cell type:code id:c7bb8a9a-b1b3-4b6d-925e-f41e249962d8 tags:
``` python
eggnog = pd.read_csv("/Users/npapadop/Documents/data/references/eggnog/eggnog_slim.csv", index_col=0)
eggnog["taxid"] = eggnog["name"].str.split(".").str[0]
eggnog["eggnog_id"] = eggnog["name"].str.split(".").str[1]
eggnog.drop(columns=["name"], inplace=True)
eggnog.reset_index(inplace=True, drop=True)
```
%% Cell type:code id:dcd4ff43-60a9-45fa-9493-0eedc4727d9d tags:
``` python
taxid_info = pd.read_csv("../data/e5.taxid_info.tsv", sep="\t", index_col="# Taxid")
```
%% Cell type:code id:f276d9ba-d500-4798-a961-1a81a0275a3f tags:
``` python
translation = pd.read_csv("../data/translation", sep="\t", header=None)
translation.columns = ["name", "accession", "type"]
translation["taxid"] = translation["name"].str.split(".").str[0]
translation["eggnog_id"] = translation["name"].str.split(".").str[1]
translation.drop(columns=["name"], inplace=True)
translation.reset_index(inplace=True, drop=True)
```
%% Cell type:code id:c426b885-99a4-4103-8ee8-42fd25dd78b6 tags:
``` python
def foldseek_top_hits(m8_loc):
foldseek = pd.read_csv(m8_loc, sep="\t", header=None)
foldseek.columns = ["query", "target", "seq. id.", "alignment length", "no. mismatches",
"no. gap open", "query start", "query end", "target start", "target end",
"e value", "bit score"]
foldseek["query"] = foldseek["query"].str.split("-").str[1]
foldseek["target"] = foldseek["target"].str.split("-").str[1]
in_species_hits = foldseek["target"].isin(foldseek["query"])
foldseek = foldseek[~in_species_hits]
top_hits = foldseek.sort_values("e value").drop_duplicates("query", keep="first")
return top_hits
```
%% Cell type:code id:dce10df8-05bb-40df-9ba6-f0d08405124a tags:
``` python
yeast = foldseek_top_hits("/Users/npapadop/Documents/data/foldseek/yeast.m8")
arabidopsis = foldseek_top_hits("/Users/npapadop/Documents/data/foldseek/arabidopsis.m8")
```
%% Cell type:code id:92bcf06b-3da2-4034-be58-3961fff4b857 tags:
``` python
def extract_unique_ids(df):
queries = df["query"].unique()
targets = df["target"].unique()
to_translate = np.concatenate((queries, targets))
return np.unique(to_translate)
```
%% Cell type:code id:cc6e23dc-2757-4ce5-bf07-bfaf2141939a tags:
``` python
sce_translate = extract_unique_ids(yeast)
ath_translate = extract_unique_ids(arabidopsis)
to_translate = np.unique(np.concatenate((sce_translate, ath_translate)))
```
%% Cell type:markdown id:beb147c5-6c5f-4447-ae38-644948baf204 tags:
The accession IDs are not unique; it may happen that we have collisions. To avoid unnecessary drama we will just remove all duplicates.
%% Cell type:code id:044b9d64-9d97-4f55-ac70-3fae5542c5db tags:
``` python
def map_AC_to_EggNOG(to_translate, translation, eggnog):
keep_accession = np.intersect1d(to_translate, translation["accession"])
eggnog_ids = translation.set_index("accession").loc[keep_accession]["eggnog_id"]
eggnog_ids.drop_duplicates(keep=False, inplace=True)
lexicon = pd.DataFrame(eggnog_ids).reset_index()
lexicon.set_index("eggnog_id", inplace=True)
keep_eggnog = np.intersect1d(eggnog["eggnog_id"], eggnog_ids)
annotation = eggnog.set_index("eggnog_id").loc[keep_eggnog]
taxonomy = annotation["taxid"].reset_index().drop_duplicates()
taxonomy.set_index("eggnog_id", inplace=True)
lexicon = lexicon.join(taxonomy)
lexicon = lexicon.reset_index()
lexicon.columns = ["eggnog", "accession", "taxid"]
return lexicon
```
%% Cell type:code id:b8804d2d-6099-4125-a182-b5955039f90a tags:
``` python
lexicon = map_AC_to_EggNOG(to_translate, translation, eggnog)
```
%% Cell type:code id:5b48ddd0-5fbe-4482-bec2-535f8f493b5a tags:
``` python
def max_tax_overlap(row, x="query lineage", y="target lineage"):
idx = 0
for idx, (i, j) in enumerate(zip(row[x], row[y])):
if i != j:
return row[x][idx - 1]
return row[x][idx]
```
%% Cell type:code id:21cab938-8fea-4afa-b65f-38e43886e221 tags:
``` python
def get_tax_id(top_hits, lexicon):
top_hits = top_hits.join(lexicon.set_index("accession")["taxid"], on="query")
top_hits.columns = ['query', 'target', 'seq. id.', 'alignment length', 'no. mismatches',
'no. gap open', 'query start', 'query end', 'target start',
'target end', 'e value', 'bit score', 'query_taxid']
top_hits = top_hits.join(lexicon.set_index("accession")["taxid"], on="target")
top_hits.columns = ['query', 'target', 'seq. id.', 'alignment length', 'no. mismatches',
'no. gap open', 'query start', 'query end', 'target start',
'target end', 'e value', 'bit score', 'query_taxid', 'target_taxid']
missing = top_hits["query_taxid"].isna() | top_hits["target_taxid"].isna()
equal = top_hits["query_taxid"] == top_hits["target_taxid"]
remove = missing | equal
top_hits["query_taxid"].isna().sum()
top_hits["target_taxid"].isna().sum()
top_hits[~remove]
top_hits_taxid = top_hits[~remove][["query_taxid", "target_taxid", "e value"]].copy()
top_hits_taxid["query_taxid"] = top_hits_taxid["query_taxid"].astype(np.int64)
top_hits_taxid["target_taxid"] = top_hits_taxid["target_taxid"].astype(np.int64)
top_hits_taxid = top_hits_taxid.join(taxid_info["Named Lineage"], on="query_taxid")
top_hits_taxid.columns = ["query", "target", "e value", "query lineage"]
top_hits_taxid = top_hits_taxid.join(taxid_info["Named Lineage"], on="target")
top_hits_taxid.columns = ["query", "target", "e value", "query lineage", "target lineage"]
top_hits_taxid["query lineage"] = top_hits_taxid["query lineage"].str.split(",")
top_hits_taxid["target lineage"] = top_hits_taxid["target lineage"].str.split(",")
top_hits_taxid["overlap"] = top_hits_taxid.apply(max_tax_overlap, axis=1)
return top_hits_taxid
```
%% Cell type:code id:1751cddf-ed02-4060-b307-49a50272aa85 tags:
``` python
yeast_tax_id = get_tax_id(yeast, lexicon)
arabidopsis_tax_id = get_tax_id(arabidopsis, lexicon)
```
%% Cell type:code id:67e88166-cf6a-498a-87b2-6a13f20f98bd tags:
``` python
yeast_tax_id["overlap"].value_counts()
```
%% Output
Saccharomycetales 2787
saccharomyceta 442
Eukaryota 206
Opisthokonta 189
Ascomycota 117
cellular organisms 10
Name: overlap, dtype: int64
%% Cell type:code id:a6bb37f4-850b-4a5b-a795-559ae06ebef8 tags:
``` python
arabidopsis_tax_id["overlap"].value_counts()
```
%% Output
rosids 15528
Mesangiospermae 2794
Eukaryota 250
cellular organisms 14
Name: overlap, dtype: int64
%% Cell type:markdown id:dac2f763-5bed-46de-8cc6-890cb9ef4d85 tags:
In the manuscript we state that MorF performs at the same level or better than blastp and EggNOG-mapper, annotation pipelines that use sequence similarity. The reviewers challenged us to expand the comparison to more sensitive sequence search options. For ease of comparison we chose to use emapper in profile mode (HMMER). Since we are interested in remote homology we decided to use the HMMs at the `Eukaryota` level of the EggNOG database.
%% Cell type:code id:fa3ab327-3034-49b2-b4dd-06400812de8f tags:
``` python
from datetime import datetime, timezone
import pytz
utc_dt = datetime.now(timezone.utc) # UTC time
dt = utc_dt.astimezone()
tz = pytz.timezone('Europe/Berlin')
berlin_now = datetime.now(tz)
print(f'{berlin_now:%Y-%m-%d %H:%M}')
```
%% Output
2023-01-09 16:55
%% Cell type:markdown id:17a5b145-106d-493b-a6b8-56ece838561b tags:
# Agreement between sequence profile searches and morphologs
We will compare the annotation produced by AlphaFold+FoldSeek against that produced by EggNOG-mapper in `hmmer` mode. If our idea worked well, we should see that in the overwhelming majority of cases the structural pipeline identifies the same annotation for each _Spongilla_ protein.
%% Cell type:code id:33bfb60a-09ad-4295-9c76-0b4c1e25f0e7 tags:
``` python
import glob
from os.path import exists
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from upsetplot import plot as upset
```
%% Cell type:markdown id:aebf5571-df8e-4a37-ade0-a9c66a6d4b74 tags:
Read the two tables:
%% Cell type:code id:95f5b02a-83e9-4eb6-a322-c15368890d2e tags:
``` python
morf = pd.read_parquet('../data/results/structure_annotation.parquet')
hmmer = pd.read_csv('../data/profile/slac_hmmer.emapper.annotations', sep='\t', skiprows=4, skipfooter=3, engine='python')
emapper = pd.read_csv('../data/results/Slacustris_eggnog.tsv', sep='\t')
```
%% Cell type:markdown id:e501c7b6-5201-4044-b709-80edf3fea48c tags:
Identify the bit score cut off and threshold the structure annotation:
%% Cell type:code id:4cee8058-943b-40ff-bf2a-e79fb918b3e4 tags:
``` python
bitscore_cut_off = np.exp(5)
keep = morf['bit score'] >= bitscore_cut_off
morf = morf[keep].copy()
```
%% Cell type:markdown id:c59bd41f-e827-4bbe-b742-93c0ac5b64fa tags:
extract protein IDs from the hmmer results so we can merge:
%% Cell type:code id:9e998c73-6e3d-4e71-a232-2f21701e52df tags:
``` python
hmmer["protein_id"] = hmmer["#query"].str.split(".").str[1].astype(int)
```
%% Cell type:markdown id:03e9553c-08df-4fac-afe1-8690e2a62f16 tags:
How many proteins are annotated by each approach?
%% Cell type:code id:343086a8-9344-454b-8628-a34fc5e5a649 tags:
``` python
morf.shape, hmmer.shape, emapper.shape
```
%% Output
((25232, 30), (28897, 22), (17990, 11))
%% Cell type:markdown id:793bf267-9baa-4fd3-b9f1-23f1ee2a6a76 tags:
How many proteins don't get a name?
%% Cell type:code id:01caaf09-160f-4ec2-bf0d-895551ad141c tags:
``` python
(morf['Preferred_name'] == '-').sum(), (hmmer['Preferred_name'] == '-').sum(), (emapper['Preferred_name'] == '-').sum()
```
%% Output
(5206, 11724, 5317)
%% Cell type:markdown id:74d30005-698d-470b-b48b-af8824b954f4 tags:
How many proteins don't get a description?
%% Cell type:code id:a84742df-0c9c-48fc-b716-bf02ca4fb0bd tags:
``` python
(morf['Description'] == '-').sum(), (hmmer['Description'] == '-').sum(), (emapper['Description'] == '-').sum()
```
%% Output
(38, 2592, 596)
%% Cell type:markdown id:7d04e462-be6c-4316-84d4-4185a2d8528c tags:
While `emapper-hmmer` annotates more proteins, it also annotates them at a less useful level (no name); effectively, MorF still annotates a larger part of the _Spongilla_ proteome.
# What is the level of detail given by each modality?
We will use the amount of orthogroups as a proxy for that. If annotation via hmmer is more vague we would expect to consistently find less orthogroups listed per protein.
This is an inherently flawed comparison, as MorF looks for the best morpholog, which is going to be a real protein rather than an orthologous group, as is the case for emapper. However, the argument could be made that if MorF agrees with emapper/emapper-hmmer on the eukaryote or root level, then the additional detail brought by using a best-hit approach might be beneficial.
%% Cell type:code id:4f6a0e0c-c31b-4c99-af3e-f3e5f7de97ed tags:
``` python
morf['#OGs'] = morf['eggNOG_OGs'].str.split(',').apply(len)
hmmer['#OGs'] = hmmer['eggNOG_OGs'].str.split(',').apply(len)
emapper['#OGs'] = emapper['eggNOG_OGs'].str.split(',').apply(len)
```
%% Cell type:code id:483f8d88-913c-493d-9ced-d1c5c523b8e3 tags:
``` python
fig, ax = plt.subplots()
ax.hist(morf['#OGs'], bins=20, label='MorF', alpha=0.5, density=True)
ax.hist(emapper['#OGs'], bins=20, label='emapper', alpha=0.5, density=True)
ax.hist(hmmer['#OGs'], bins=20, label='emapper-hmmer', alpha=0.5, density=True)
ax.legend();
```
%% Output
%% Cell type:markdown id:52950276-189a-4105-8b26-29f1ef9dc3b1 tags:
MorF's distribution is clearly shifted to the right. Let's quantify it:
%% Cell type:code id:e53585a7-b2d8-4ad7-90d4-e1baace690a2 tags:
``` python
summaries = {
'MorF': morf['#OGs'].describe(),
'emapper': emapper['#OGs'].describe(),
'emapper-hmmer': hmmer['#OGs'].describe(),
}
pd.DataFrame(summaries)
```
%% Output
MorF emapper emapper-hmmer
count 25232.000000 17990.000000 28897.000000
mean 8.283767 5.213897 5.350486
std 2.327989 2.077985 2.231996
min 2.000000 2.000000 2.000000
25% 7.000000 4.000000 4.000000
50% 8.000000 4.000000 5.000000
75% 10.000000 7.000000 7.000000
max 17.000000 13.000000 16.000000
%% Cell type:markdown id:86e04dcb-7680-4471-8a45-b0d68ad67b31 tags:
Interesting to see that the standard deviation is essentially the same in all distributions! MorF annotates, on average, 3 more OGs per protein. This is probably a product of how it works, namely by getting a best hit and assigning all its information to the query protein, but since standard `emapper` and `emapper-hmmer` are essentially the same distribution this may be more important.
# How much does each modality annotate?
We will visualize the different subsets of annotation overlap with an upset plot.
%% Cell type:code id:69d79788-2835-4d54-a97f-841877b1b090 tags:
``` python
total_proteins = 41943
```
%% Cell type:code id:bb1646f2-8a27-4bbc-9a2b-20e7ed45d18c tags:
``` python
sequence = np.intersect1d(emapper['protein_id'], hmmer['protein_id'])
in_all = np.intersect1d(sequence, morf['protein_id'])
hmmer_morf = np.intersect1d(morf['protein_id'], hmmer['protein_id'])
hmmer_morf = np.setdiff1d(hmmer_morf, emapper['protein_id'])
emapper_morf = np.intersect1d(morf['protein_id'], emapper['protein_id'])
emapper_morf = np.setdiff1d(emapper_morf, hmmer['protein_id'])
emapper_hmmer = np.intersect1d(hmmer['protein_id'], emapper['protein_id'])
emapper_hmmer = np.setdiff1d(emapper_hmmer, morf['protein_id'])
morf_only = np.setdiff1d(morf['protein_id'], hmmer['protein_id'])
morf_only = np.setdiff1d(morf_only, emapper['protein_id'])
hmmer_only = np.setdiff1d(hmmer['protein_id'], morf['protein_id'])
hmmer_only = np.setdiff1d(hmmer_only, emapper['protein_id'])
emapper_only = np.setdiff1d(emapper['protein_id'], morf['protein_id'])
emapper_only = np.setdiff1d(emapper_only, hmmer['protein_id'])
```
%% Cell type:code id:993e7652-aff9-4880-a54f-4fa22e3fc169 tags:
``` python
all_annotated = np.concatenate((morf['protein_id'], hmmer['protein_id'], emapper['protein_id']))
all_unique = np.unique(all_annotated)
```
%% Cell type:code id:85c67750-3125-412a-8179-71b0401858cb tags:
``` python
counts = [
total_proteins - len(all_unique),
len(morf_only),
len(hmmer_only),
len(hmmer_morf),
len(emapper_only),
len(emapper_morf),
len(emapper_hmmer),
len(in_all),
]
```
%% Cell type:code id:d3b9c02c-e388-4cfb-9255-d824f25a368a tags:
``` python
result = {
'emapper': [False, False, False, False, True, True, True, True],
'emapper-hmmer': [False, False, True, True, False, False, True, True],
'MorF': [False, True, False, True, False, True, False, True],
'counts': counts
}
overlap = pd.DataFrame(result)
overlap.set_index(['emapper', 'emapper-hmmer', 'MorF'], inplace=True)
```
%% Cell type:code id:b69d5175-0d57-475c-8f46-62be8265beab tags:
``` python
subplots = upset(overlap, sum_over='counts')
subplots['intersections'].set_title('#proteins annotated by each source');
```
%% Output
%% Cell type:markdown id:eea1454f-71bc-4fcf-8d46-737610063c73 tags:
This plot demonstrates the sensitivity of sequence profiles; there aren't many annotations that HMMER doesn't find :)
a sanity check:
%% Cell type:code id:82814d7f-d401-49fd-aedf-9b1f6702f150 tags:
``` python
overlap.sum()
```
%% Output
counts 41943
dtype: int64
%% Cell type:markdown id:1c5f1532-90b8-4552-92f5-9230c3c5b57e tags:
this makes sense. Now let's have a look at the IDs that overlap:
%% Cell type:code id:a412b367-b3b7-46ad-9f22-77dd3509521f tags:
``` python
sequence = np.intersect1d(emapper['protein_id'], hmmer['protein_id'])
keep = np.intersect1d(sequence, morf['protein_id'])
len(keep)
```
%% Output
16346
%% Cell type:markdown id:50341c58-e419-4762-baa4-0d895576b0eb tags:
We will compare the level of detail and the agreement between the three modalities. Let's start by subsetting the tables before merging:
%% Cell type:code id:85664620-0bb3-410b-bc7c-e75ad2a90c71 tags:
``` python
morf = morf.set_index("protein_id").loc[keep].copy()
emapper = emapper.set_index("protein_id").loc[keep].copy()
hmmer = hmmer.set_index("protein_id").loc[keep].copy()
```
%% Cell type:code id:add4fa47-8fbd-4a95-b522-1cf6bb0a0fd6 tags:
``` python
orthogroups = emapper[['eggNOG_OGs']].join(hmmer[['eggNOG_OGs']], lsuffix='_emapper', rsuffix='_hmmer').join(morf[['eggNOG_OGs']])
```
%% Cell type:code id:72071b54-9db9-4a50-a4d8-4828453e6115 tags:
``` python
orthogroups.columns = ['emapper', 'hmmer', 'morf']
```
%% Cell type:markdown id:20c96b85-54f7-453f-a286-ff95a3385c10 tags:
## EggNOG most specific OrthoGroup level
this is the ideal scenario: for how many cases do the annotation pipelines put a gene in the same group of orthologous genes? Extract the EggNOG OG information and compare. We will always examine both ways (source 1 included in source 2 or vice versa) and count both cases as positives.
%% Cell type:code id:859149f7-05d4-4241-99b6-1b8cc20eb558 tags:
``` python
def keep_last(row):
"""
A function to isolate the root EggNOG orthogroup.
Expects a comma-separated string where the root orthogroup
contains the word 'root'.
"""
x = np.array(row.split(','))
return x[-1]
```
%% Cell type:code id:45f4676c-6654-40e7-bee3-1cf4f61d36ec tags:
``` python
for source in ['emapper', 'hmmer', 'morf']:
orthogroups['most_specific_' + source] = orthogroups[source].apply(keep_last)
```
%% Cell type:markdown id:b5138fdc-d0f4-45f0-91ff-e6ac9634f084 tags:
Compare the root orthogroup lists of each gene between tables. Hitting at least one is considered a success.
%% Cell type:code id:66793098-a668-4e0a-8999-9e33a95a7e86 tags:
``` python
def contains_OG(row, source1='morf', source2='emapper', base='most_specific_'):
if row[base + source1] is None or row[base + source2] is None:
return None
s1_contains_s2 = row[base + source1] in row[source2]
s2_contains_s1 = row[base + source2] in row[source1]
return s1_contains_s2 | s2_contains_s1
```
%% Cell type:code id:6c7ecd7f-3531-4870-b078-62840bff6b9f tags:
``` python
for s1, s2 in [
('morf', 'hmmer'),
('morf', 'emapper'),
('hmmer', 'emapper')
]:
orthogroups['ms_' + s1 + '_' + s2] = orthogroups.apply(contains_OG, axis=1, args=(s1, s2))
```
%% Cell type:markdown id:286a5e92-d267-4100-890b-ad6b4ea2a11c tags:
## EggNOG eukaryote level OG
since for HMMER we used the eukaryote level OGs, we would expect the significant overlap to happen there:
%% Cell type:code id:3adc0728-ee33-46b7-8f05-a6d2ece5777f tags:
``` python
def keep_euk(row):
"""
A function to isolate the root EggNOG orthogroup.
Expects a comma-separated string where the root orthogroup
contains the word 'root'.
"""
x = np.array(row.split(','))
for og in x:
if "Eukaryota" in og:
return og
else:
return None
```
%% Cell type:code id:7a2f5c9a-fe22-4d4c-ab9f-b69d157c0f27 tags:
``` python
for source in ['emapper', 'hmmer', 'morf']:
orthogroups['euk_' + source] = orthogroups[source].apply(keep_euk)
```
%% Cell type:markdown id:4df3e0d5-adc5-4514-b8bc-24de39a01f84 tags:
See if the eukaryote orthogroup is shared between modalities.
%% Cell type:code id:4382f954-010b-486d-8d28-89cb5bae400b tags:
``` python
for s1, s2 in [
('morf', 'hmmer'),
('morf', 'emapper'),
('hmmer', 'emapper')
]:
orthogroups['euk_' + s1 + '_' + s2] = orthogroups.apply(contains_OG, axis=1, args=(s1, s2, 'euk_'))
```
%% Cell type:markdown id:d631a6e9-e7e2-4433-9c9e-707aecb6da86 tags:
## EggNOG root OrthoGroup level
this is the most pertinent level: for how many cases do the two annotation pipelines put a gene in the same group of orthologous genes? Extract the EggNOG OG information and compare. To facilitate the comparison we will only look at the root level, but we could extend this as much as we want (e.g. pick the most specific orthogroup for structure and ask if it is found in the sequence annotation).
Isolate the root orthogroup for each peptide. If multiple root orthogroups are present, keep all of them.
%% Cell type:code id:5d672a2f-adfa-46da-8548-0657836576a3 tags:
``` python
def keep_root(row):
"""
A function to isolate the root EggNOG orthogroup.
Expects a comma-separated string where the root orthogroup
contains the word 'root'.
"""
x = np.array(row.split(','))
keep = np.zeros(len(x), dtype=bool)
for i, og in enumerate(x):
keep[i] = 'root' in og
# print(x, keep)
return x[keep]
```
%% Cell type:code id:5a683094-4840-4ede-b1a6-ba21a0f49bde tags:
``` python
for source in ['emapper', 'hmmer', 'morf']:
orthogroups['root_' + source] = orthogroups[source].apply(keep_root)
```
%% Cell type:markdown id:c4248331-d51b-4209-a0e1-57bf02ad53ea tags:
Compare the root orthogroup lists of each gene between tables. Hitting at least one is considered a success.
%% Cell type:code id:f31398bf-eeda-4c49-9f35-e11acbe739d9 tags:
``` python
def is_contained_in(row, x='eggNOG_OGs_struct', y='eggNOG_OGs_seq'):
overlap = np.intersect1d(row[x], row[y])
return len(overlap) > 0
```
%% Cell type:code id:588ea19c-6661-4a82-bcda-08253f7c583d tags:
``` python
for s1, s2 in [
('morf', 'hmmer'),
('morf', 'emapper'),
('hmmer', 'emapper')
]:
orthogroups['root_' + s1 + '_' + s2] = orthogroups.apply(is_contained_in, axis=1, args=('root_' + s1, 'root_' + s2))
```
%% Cell type:markdown id:a9183c22-29ff-4956-96dc-16daaae97e6d tags:
# Overlap agreement
Here we will just print the results:
%% Cell type:code id:6eb5faba-729c-4745-b9a7-3c2586504e42 tags:
``` python
for s1, s2 in [
('morf', 'hmmer'),
('morf', 'emapper'),
('hmmer', 'emapper')
]:
specific_overlap = orthogroups['ms_' + s1 + '_' + s2].sum()
euk_overlap = orthogroups['euk_' + s1 + '_' + s2].sum()
root_overlap = np.sum(orthogroups['root_' + s1 + '_' + s2]).sum()
print(f'{s1} and {s2}:')
total = len(keep)
print(f'\t- most specific OG: {specific_overlap} / {total} ({specific_overlap / total * 100:.2f}%)')
total = np.sum(~(orthogroups['euk_' + s1].isnull() | orthogroups['euk_' + s2].isnull()))
print(f'\t- eukaryotic OG: {euk_overlap} / {total} ({euk_overlap / total * 100:.2f}%)')
total = len(keep)
print(f'\t- root OG: {root_overlap} / {total} ({root_overlap / total * 100:.2f}%)')
```
%% Output
morf and hmmer:
- most specific OG: 7959 / 16346 (48.69%)
- eukaryotic OG: 13038 / 15889 (82.06%)
- root OG: 14251 / 16346 (87.18%)
morf and emapper:
- most specific OG: 9280 / 16346 (56.77%)
- eukaryotic OG: 14030 / 15745 (89.11%)
- root OG: 14827 / 16346 (90.71%)
hmmer and emapper:
- most specific OG: 10909 / 16346 (66.74%)
- eukaryotic OG: 13828 / 16032 (86.25%)
- root OG: 14899 / 16346 (91.15%)
%% Cell type:markdown id:7db8fabf-58d7-4be1-86c8-eed0b20ec250 tags:
Where do the orthogroups come from? Do the methods differ in the level of detail they provide?
%% Cell type:code id:62496fa8-166a-460c-9354-c49bcfd37fa0 tags:
``` python
orthogroups['most_specific_hmmer'].str.split('|').str[-1].value_counts()[:10]
```
%% Output
Metazoa 6658
Bilateria 1613
Eukaryota 1140
Actinopterygii 913
Chordata 768
Arthropoda 519
Vertebrata 464
Opisthokonta 444
Poales 355
Lepidoptera 210
Name: most_specific_hmmer, dtype: int64
%% Cell type:code id:539bd579-25c5-46f6-b034-7141ad8b76f8 tags:
``` python
orthogroups['most_specific_emapper'].str.split('|').str[-1].value_counts()[:10]
```
%% Output
Metazoa 7692
Bilateria 1340
Actinopterygii 1146
Eukaryota 970
Chordata 705
Vertebrata 627
Opisthokonta 398
Rodentia 273
Mammalia 251
Arthropoda 233
Name: most_specific_emapper, dtype: int64
%% Cell type:code id:1e004899-37ae-4005-a59d-e64bc867a504 tags:
``` python
orthogroups['most_specific_morf'].str.split('|').str[-1].value_counts()[:10]
```
%% Output
Actinopterygii 3216
Rodentia 3160
Hominidae 2291
Vertebrata 749
Drosophilidae 668
Cetartiodactyla 646
Chromadorea 597
Bilateria 486
Poales 399
Rhabditida 389
Name: most_specific_morf, dtype: int64
%% Cell type:markdown id:adb46728-e071-439a-aaf7-d242ed3198d6 tags:
Standard emapper and `emapper-hmmer` have a very similar top 10, with many representatives from rather broad categories (eukaryotes, metazoa, bilateria). I was expecting this for HMMER, but not for emapper.
%% Cell type:markdown id:2ab498a2-bfa8-49c7-93ee-e5d37054a460 tags:
# Structure-sequence agreement in model species
In an effort to get a feeling for how often morphologs were orthologs, we searched with AlphaFoldDB against itself. In Table 3 of the manuscript we reported that proteins that had non-species morphologs (very) often were also orthologs. The reviewers pointed out that these orthologs might plausibly come from very closely related species, thus weakening our claim that structural similarity might detect functional similarity or orthology over longer evolutionary distances. Here we are revisiting this analysis and looking to exclude
......@@ -201,7 +201,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
"version": "3.10.8"
}
},
"nbformat": 4,
......
......@@ -663,7 +663,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
"version": "3.10.8"
}
},
"nbformat": 4,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment