Skip to content
Snippets Groups Projects
Commit 2c08e415 authored by Niko Papadopoulos's avatar Niko Papadopoulos
Browse files

added statistical comparison of avg pLDDT distributions

parent bbda981a
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:51ce05c0-6933-42fd-a5c5-e5809e6858d9 tags:
The reviewers challenged us to revisit the statement that the AlphaFold prediction quality for _Spongilla_ was the same as for other species.
%% Cell type:code id:c3ea6175-d71f-4ae7-8934-52f818a80a0e tags:
``` python
from tqdm import tqdm
import glob
import numpy as np
import pandas as pd
```
%% Cell type:code id:570141f0-2fc4-4bb7-ad0d-7a9e76531c81 tags:
``` python
plddt = {}
for path in glob.glob("../data/alphafold_performance/*"):
basename = path.split("/")[-1]
species = basename.split(".")[0]
tmp = pd.read_csv(path, sep="\t", index_col=0)
plddt[species] = tmp["plddt"].T.values
```
%% Cell type:code id:89e50e57-402b-4d56-9203-95e5ca4a4065 tags:
``` python
genes = []
score = []
with np.load("../data/results/spongilla_plddt.npz") as spongilla:
for gene in spongilla.keys():
genes.append(gene)
score.append(np.mean(spongilla[gene]))
plddt["S_lacustris"] = np.array(score)
```
%% Cell type:markdown id:9637a995-be1d-467e-8073-4238e8ae080e tags:
To do goodness-of-fit tests it is important to know what sort of distribution we are analysing. We will use D'Agostino and Pearson's test to check if the pLDDT score distributions are normal. The null hypothesis is that they are from a normal distribution, so a small p-value means we can reject it.
%% Cell type:code id:46755534-f0c1-4c1c-8536-857fe54c1850 tags:
``` python
import scipy.stats as stats
from scipy.stats import normaltest
```
%% Cell type:code id:b38831fe-e45e-4e23-b91d-79cfaeae0476 tags:
``` python
print("Two-sided χ-squared probability for the hypothesis test (rounded to 10 decimals)")
for species, score in plddt.items():
stat, pval = normaltest(score)
print(f"{species}: {np.round_(pval, decimals=10)}")
```
%% Output
Two-sided χ-squared probability for the hypothesis test (rounded to 10 decimals)
A_thaliana: 0.0
M_musculus: 0.0
D_rerio: 0.0
S_cerevisiae: 0.0
H_sapiens: 0.0
D_discoideum: 0.0
C_elegans: 0.0
D_melanogaster: 0.0
S_lacustris: 0.0
%% Cell type:markdown id:0dddfa7d-df6b-4851-8a70-a1dfa4055a59 tags:
none of the distributions are normal; this means we are left with non-parametric distribution tests like the Kolmogorov-Smirnov.
%% Cell type:code id:69e57e28-6465-4856-9a95-f3f045611325 tags:
``` python
from scipy.stats import ks_2samp
```
%% Cell type:code id:8c41c67e-6e44-421e-bbae-b4238d2e9f48 tags:
``` python
species = plddt.keys()
result = np.zeros((len(species), len(species)))
```
%% Cell type:code id:acb7d8b8-d987-4fc5-b8e4-080f9738e4dc tags:
``` python
for i, s1 in enumerate(species):
for j, s2 in enumerate(species):
stat, pval = ks_2samp(plddt[s1], plddt[s2])
result[i, j] = pval
```
%% Cell type:code id:d4acd3c6-7b27-41df-bbc5-0e9ced74ec48 tags:
``` python
ks_result = pd.DataFrame(result, columns=species, index=species)
```
%% Cell type:code id:b40f5b10-1166-41cc-8a07-fab5e534c9b6 tags:
``` python
ks_result
```
%% Output
A_thaliana M_musculus D_rerio S_cerevisiae \
A_thaliana 1.000000e+00 6.934609e-13 9.114463e-18 7.262317e-19
M_musculus 6.934609e-13 1.000000e+00 2.070102e-08 5.863687e-16
D_rerio 9.114463e-18 2.070102e-08 1.000000e+00 4.857841e-19
S_cerevisiae 7.262317e-19 5.863687e-16 4.857841e-19 1.000000e+00
H_sapiens 5.750618e-12 5.731073e-34 8.699267e-45 6.808127e-22
D_discoideum 9.016068e-113 2.383184e-159 1.086386e-182 3.284647e-74
C_elegans 4.454955e-08 5.537467e-06 1.085280e-13 1.288430e-22
D_melanogaster 1.026131e-23 6.251385e-24 2.567091e-39 2.306440e-12
S_lacustris 1.648052e-254 1.329357e-215 4.784290e-287 4.903494e-81
H_sapiens D_discoideum C_elegans D_melanogaster \
A_thaliana 5.750618e-12 9.016068e-113 4.454955e-08 1.026131e-23
M_musculus 5.731073e-34 2.383184e-159 5.537467e-06 6.251385e-24
D_rerio 8.699267e-45 1.086386e-182 1.085280e-13 2.567091e-39
S_cerevisiae 6.808127e-22 3.284647e-74 1.288430e-22 2.306440e-12
H_sapiens 1.000000e+00 1.030890e-58 1.705286e-21 8.108620e-11
D_discoideum 1.030890e-58 1.000000e+00 2.050222e-114 4.987552e-62
C_elegans 1.705286e-21 2.050222e-114 1.000000e+00 1.466712e-09
D_melanogaster 8.108620e-11 4.987552e-62 1.466712e-09 1.000000e+00
S_lacustris 4.824450e-129 3.228598e-15 1.278533e-145 6.907128e-69
S_lacustris
A_thaliana 1.648052e-254
M_musculus 1.329357e-215
D_rerio 4.784290e-287
S_cerevisiae 4.903494e-81
H_sapiens 4.824450e-129
D_discoideum 3.228598e-15
C_elegans 1.278533e-145
D_melanogaster 6.907128e-69
S_lacustris 1.000000e+00
%% Cell type:markdown id:a5866286-45ce-45a2-a143-e7e9a5fd59c3 tags:
This tells us that all of the distributions have easy-to-differentiate ECDFs. Let's plot them:
%% Cell type:code id:2eece4e8-0b18-460f-adcf-3cf98fe2d56d tags:
``` python
from matplotlib import pyplot as plt
```
%% Cell type:code id:87696667-c0b2-4b74-9ddb-901ef4c371a6 tags:
``` python
fig, ax = plt.subplots()
for s in species:
x = np.sort(plddt[s])
y = np.arange(len(x))/float(len(x))
plt.plot(x, y, label=s)
ax.legend();
```
%% Output
%% Cell type:markdown id:2f919858-185e-4dd8-bf55-c32d8a945fca tags:
This is indeed the case. _Spongilla_ and _Dictyostelium_ differ even more from the other organisms, enough so that the naked eye can spot it. There is no overwhelming similarity in the profiles, however.
%% Cell type:code id:cd2f071e-f8d6-4398-a049-00b97c1abe38 tags:
``` python
from scipy.stats import kruskal
```
%% Cell type:code id:fb769fd9-d424-4d50-a3bc-42e265474574 tags:
``` python
result = np.zeros((len(species), len(species)))
```
%% Cell type:code id:9e9a0a70-ec9b-41c6-9172-54e7ccf639f9 tags:
``` python
for i, s1 in enumerate(species):
for j, s2 in enumerate(species):
stat, pval = kruskal(plddt[s1], plddt[s2])
result[i, j] = pval
```
%% Cell type:code id:c3d5068b-6f88-47f1-9ade-956b863aae55 tags:
``` python
kruskal_result = pd.DataFrame(result, columns=species, index=species)
```
%% Cell type:code id:7d7597b1-5a93-48a2-a9d9-1e296e5b3fbd tags:
``` python
kruskal_result
```
%% Output
A_thaliana M_musculus D_rerio S_cerevisiae \
A_thaliana 1.000000e+00 5.401098e-11 1.294740e-09 4.059266e-11
M_musculus 5.401098e-11 1.000000e+00 3.243668e-01 9.471571e-03
D_rerio 1.294740e-09 3.243668e-01 1.000000e+00 2.391935e-03
S_cerevisiae 4.059266e-11 9.471571e-03 2.391935e-03 1.000000e+00
H_sapiens 4.322773e-14 1.298086e-38 1.350664e-38 8.916050e-27
D_discoideum 3.897159e-126 1.058769e-163 2.799068e-176 2.290830e-97
C_elegans 4.104535e-01 1.906173e-07 9.242227e-06 3.716571e-09
D_melanogaster 2.307912e-03 1.179170e-14 6.695887e-14 5.815376e-15
S_lacustris 1.998652e-273 0.000000e+00 0.000000e+00 1.645357e-141
H_sapiens D_discoideum C_elegans D_melanogaster \
A_thaliana 4.322773e-14 3.897159e-126 4.104535e-01 2.307912e-03
M_musculus 1.298086e-38 1.058769e-163 1.906173e-07 1.179170e-14
D_rerio 1.350664e-38 2.799068e-176 9.242227e-06 6.695887e-14
S_cerevisiae 8.916050e-27 2.290830e-97 3.716571e-09 5.815376e-15
H_sapiens 1.000000e+00 3.002254e-61 3.791151e-14 1.497413e-03
D_discoideum 3.002254e-61 1.000000e+00 1.923321e-111 5.084089e-65
C_elegans 3.791151e-14 1.923321e-111 1.000000e+00 2.288044e-03
D_melanogaster 1.497413e-03 5.084089e-65 2.288044e-03 1.000000e+00
S_lacustris 2.972335e-129 7.809348e-03 5.246039e-227 3.730213e-131
S_lacustris
A_thaliana 1.998652e-273
M_musculus 0.000000e+00
D_rerio 0.000000e+00
S_cerevisiae 1.645357e-141
H_sapiens 2.972335e-129
D_discoideum 7.809348e-03
C_elegans 5.246039e-227
D_melanogaster 3.730213e-131
S_lacustris 1.000000e+00
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment