Skip to content
Snippets Groups Projects
Commit 3047dac4 authored by Niko Papadopoulos's avatar Niko Papadopoulos
Browse files

comparison of top N hits at certain score margin done; discovered bug with EC comparison

parent 9409104d
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id:0b1b23b0-54ae-4233-a07e-bdcbd701eebb tags:
``` python
from datetime import datetime, timezone
import pytz
utc_dt = datetime.now(timezone.utc) # UTC time
dt = utc_dt.astimezone()
tz = pytz.timezone('Europe/Berlin')
berlin_now = datetime.now(tz)
print(f'{berlin_now:%Y-%m-%d %H:%M}')
```
%% Output
2022-12-20 15:22
2022-12-20 16:48
%% Cell type:markdown id:3dc6274a-98d3-42ad-966d-1c6ea2481d8e tags:
# Annotation of second best hit
The reviewers noted that in the MorF pipeline we used the top hit to transfer functional annotation for each _Spongilla_ protein. They wondered if there was agreement in the functional annotation within between the morphologs ("no meaningful difference in score") of each protein. While we offered anecdotal evidence that the top hits all were within the same EggNOG orthogroup, we didn't back this up with extra analysis. This was rightfully brought up, and we are setting out to rectify it here.
Unfortunately this analysis is rather annoying to set up and execute; it requires us to define "top hits" in a meaningful way, something that's not always straightforward. Second, it requires us to get the full orthogroup assignments for a (very) large number of proteins, and, to the best of our knowledge, there is no programmatic way to get that information from the EggNOG database. The best workaround we found so far is to identify the proteins we want functional information for, retrieve their sequences from UniProt, submit these sequences to emapper, retrieve the results, and keep the best hit per sequence, which is going to be the identity match of the sequence to itself.
This is a lamentably circuitous way of getting information that, in some way, is definitely included in the EggNOG databases; unfortunately, we cannot think of a better way of accessing it.
After going through this, it is finally time to assess how well MorF does!
%% Cell type:code id:080f8852-4f0f-4e2a-92c3-6d73dca6a07d tags:
``` python
import glob
from os.path import exists
from tqdm import tqdm
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from skimage import filters
```
%% Cell type:markdown id:672a2976-2891-4dca-89ae-b58f099b9687 tags:
Read the predictions and cutoffs:
%% Cell type:code id:e286b380-898e-43a1-be55-1f9af1158ffe tags:
%% Cell type:code id:cfa31b4f-7a70-42cd-851c-5efae95a60b4 tags:
``` python
filtered = pd.read_csv("../data/revision/filtered_afdb.tsv", sep="\t", index_col=0)
```
%% Cell type:code id:75e276fb-0fa6-454a-9c62-e4a920f5b2db tags:
``` python
cutoffs = pd.read_csv("../data/revision/cutoffs.tsv", sep="\t", index_col=0)
```
%% Cell type:markdown id:373734c5-1dc9-40c5-a704-72ea3fc3275c tags:
Read the annotations, keep the best hit per query, and concatenate into one big table:
%% Cell type:code id:d74f6742-80ec-4b3e-a9dd-0275ad3c0131 tags:
``` python
res = []
for f in glob.glob("../data/eggnog_fastas/*"):
annot_fragment = pd.read_csv(f, sep="\t", skiprows=4, skipfooter=3, engine="python")
annot_fragment["id"] = annot_fragment["#query"].str.split("|").str[1]
beyond_reasonable_doubt = annot_fragment[annot_fragment["evalue"] < 1e-100]
unique_ids = beyond_reasonable_doubt["id"].unique().shape
all_ids = beyond_reasonable_doubt["id"].shape
if unique_ids != all_ids:
beyond_reasonable_doubt = beyond_reasonable_doubt.sort_values("evalue").drop_duplicates("id")
beyond_reasonable_doubt.set_index("id", inplace=True)
res.append(beyond_reasonable_doubt.copy())
annot = pd.concat(res)
```
%% Cell type:code id:0c1196c7-7714-4f89-b6fb-650d7ae0874c tags:
``` python
filtered = filtered.join(annot, on="uniprot")
```
%% Cell type:code id:dcd6c60b-2da6-436f-928f-a396f9516700 tags:
``` python
informative_columns = ["query", "bit score", "uniprot", "eggNOG_OGs", "Preferred_name", "EC"]
slim = filtered[informative_columns]
```
%% Cell type:code id:aeab3f8a-7ad1-4707-af1a-39fe2aa2543c tags:
%% Cell type:code id:e233f1a1-2eed-4e90-8270-bb2cd2f35d7e tags:
``` python
test = slim[slim["query"] == 9999]
clean = ~test.isna().apply(any, axis=1)
test = test[clean].copy()
```
%% Cell type:code id:222aee68-c47d-46a4-8c37-3cc8e5d9be06 tags:
``` python
best_hit = test.sort_values("bit score", ascending=False).iloc[0]["EC"]
def string_cumsum(x):
lis = x.split(".")
res = [lis[0]]
total = lis[0]
for x in lis[1:]:
total = total + "." + x
res.append(total)
return np.array(res)
```
%% Cell type:code id:9a84ff50-3173-4993-93c3-c7819efa54d8 tags:
``` python
def compare_EC(x, target):
query = [np.array(ec.split(".")) for ec in x.split(",")]
target = np.array(target.split("."))
query = [string_cumsum(ec) for ec in x.split(",")]
target = [string_cumsum(ec) for ec in target.split(",")]
max_agreement = 0
for q in query:
tmp = np.sum(q == target)
if tmp > max_agreement:
max_agreement = tmp
for t in target:
tmp = np.sum(q == t)
if tmp > max_agreement:
max_agreement = tmp
return max_agreement
```
%% Cell type:code id:5aa2b245-a874-4892-8800-c4df841ba97d tags:
%% Cell type:code id:aeab3f8a-7ad1-4707-af1a-39fe2aa2543c tags:
``` python
def calc_EC_overlap(df):
df = df.sort_values("bit score", ascending=False)
best_hit = df.iloc[0]["EC"]
return df["EC"].apply(compare_EC, target=best_hit)
```
%% Cell type:code id:cc2ac792-c299-483c-a855-10e1b65c6ff0 tags:
``` python
def otsu_score(df):
# print(df["query"].iloc[0])
agreement = calc_EC_overlap(df)
# print(agreement, (np.sum(agreement) - 4), ((df.shape[0] - 1) * 4))
otsu_agreement = (np.sum(agreement) - 4) / ((df.shape[0] - 1))
return otsu_agreement
```
%% Cell type:code id:37bdba39-2219-48d6-aeaf-10b27ae16542 tags:
``` python
def top10p_score(df, cutoffs):
agreement = calc_EC_overlap(df)
query = df.iloc[0]["query"]
top10percent = df["bit score"] > cutoffs["top10%"].loc[query]
niko_agreement = (np.sum(agreement[top10percent]) - 4) / ((np.sum(top10percent) - 1))
return niko_agreement
```
%% Cell type:code id:6a3d5a50-63d0-45d1-80d2-90f74e53db9d tags:
``` python
# slim = filtered[informative_columns]
filtered["EC"].replace("-", None, inplace=True)
has_EC = ~filtered["EC"].isnull()
slim = filtered[has_EC][informative_columns]
is_not_singleton = slim["query"].duplicated(keep=False)
slim = slim[is_not_singleton]
```
%% Cell type:code id:2e0a4e94-b5df-4d17-9073-c7bf38c423a3 tags:
``` python
test["agreement"] = test.sort_values("bit score", ascending=False)["EC"].apply(compare_EC, target=best_hit)
filtered.shape, slim.shape
```
%% Cell type:code id:813c27be-d8f0-4dd2-bc4e-afa343339707 tags:
%% Output
((2115799, 35), (296755, 6))
%% Cell type:code id:d1835787-7889-4507-87ad-14b1c8c13ee9 tags:
``` python
otsu = slim.groupby("query").apply(otsu_score)
top10p = slim.groupby("query").apply(top10p_score, cutoffs=cutoffs)
```
%% Cell type:code id:f31b4422-40ab-4bff-82e8-d094980e5cae tags:
``` python
otsu_agreement = (test["agreement"].sum() - 4) / ((test.shape[0] - 1) * 4)
exclude = top10p.isna()
```
%% Cell type:code id:a1793b2e-7dde-4184-9d7a-fad5aeb9851e tags:
%% Cell type:code id:cdbc5d61-82f6-41e3-a729-ffdc90438485 tags:
``` python
niko = test[test["bit score"] > cutoffs["top10%"].loc[9999]]
niko_agreement = (niko["agreement"].sum() - 4) / ((niko.shape[0] - 1) * 4)
fig, ax = plt.subplots()
ax.scatter(otsu[~exclude], top10p[~exclude])
```
%% Cell type:code id:6b8ad085-204a-47a2-a678-66f7f9151991 tags:
%% Output
<matplotlib.collections.PathCollection at 0x32dc70eb0>
%% Cell type:code id:34140e06-b4ed-4783-b24c-a1041025efa0 tags:
``` python
fig, ax = plt.subplots()
ax.plot(np.sort(otsu[~exclude]), label="strict multi-Otsu")
ax.plot(np.sort(top10p[~exclude]), label="top 10% score range")
ax.set_title("Average EC overlap of top morphologs by cutoff")
ax.legend();
```
%% Output
%% Cell type:code id:4b5646f0-2e55-430b-b58d-d285c39029ef tags:
``` python
fig, ax = plt.subplots()
# ax.hist(otsu[~exclude], bins=50, label="strict multi-Otsu")
bla = top10p.copy()
bla[bla.isna()] = -1
ax.hist(bla, bins=50, label="top 10% score range")
# ax.legend()
ax.set_title("Average EC overlap of morphologs\nwithin 10% of bit score range");
```
%% Output
%% Cell type:code id:97a24f25-cea0-443a-8cf8-80e549a3e5c7 tags:
``` python
otsu_agreement, niko_agreement
np.sum(top10p[~exclude]>3.) / np.sum(~exclude)
```
%% Output
(0.9779411764705882, 1.0)
0.9688341478009375
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment