Skip to content
Snippets Groups Projects
Commit 8dd17e2e authored by Nikolaos Papadopoulos's avatar Nikolaos Papadopoulos
Browse files

temp saving of AFDB, SWP parquets now possible; minor file structure changes

parent 6a5ed0e3
No related branches found
No related tags found
No related merge requests found
Showing with 298 additions and 224 deletions
*.log
MANIFEST
build
dist
......@@ -33,6 +35,7 @@ venv*/
# various auxiliary files
analysis/figures/*
old_data/*
data/*
*.h5ad
*.tsv
......
This diff is collapsed.
%% Cell type:code id:a5e450d0-6b47-405d-b6f4-fdbb3edfb641 tags:
``` python
from datetime import datetime, timezone
import pytz
utc_dt = datetime.now(timezone.utc) # UTC time
dt = utc_dt.astimezone()
tz = pytz.timezone('Europe/Berlin')
berlin_now = datetime.now(tz)
print(f'{berlin_now:%Y-%m-%d %H:%M}')
```
%% Output
2022-07-15 10:36
%% Cell type:markdown id:3d79e6d8-8b76-4d9b-8621-20543ce83f54 tags:
# Expanding functional annotation of a non-model species using predicted protein structure similarity
***WARNING:*** The starting point of this analysis is a **predicted proteome** produced by the Trinity <i>de-novo</i> assembler. We produced multiple sequence alignments for each **isoform** with MMSeqs2 against the ColabFoldDB and UniRef30 (see the [ColabFold website](https://colabfold.mmseqs.com)); the MSAs were used as input for AlphaFold; the predicted protein structures were queried by FoldSeek against PDB, SwissProt, and AlphaFoldDB. The same sequences were queried by EggNOG. However, ultimately the analysis we conduct here is going to be on the **gene level**; we will therefore summarize results on the isoform level, either by concatenating them or by keeping the best-scoring isoform.
This notebook gathers the preparatory steps for the analysis. This is the following:
1. **Parse MSA files** and extract MSA size, query length, _Spongilla_ transcript name
2. **Parse AlphaFold predictions**
2a. extract average pLDDT score per prediction
2b. keep best-scoring isoform per gene
3. **Read and filter FoldSeek predictions** for each target database (PDB, SwissProt, AlphaFoldDB). Obtain annotation via EggNOG mapper, UniProt.
4. **Read and filter sequence-based annotation** for the _Spongilla_ proteome.
5. **Format structural-based annotation** Summarize in single table, similar to how sequence annotation would look like.
We are going to use
* `numpy` and `pandas` for I/O and data wrangling;
* `tqdm` for progress bars;
* `glob`, `os` for file navigation;
* `json` to parse some AlphaFold output;
* `urllib` to query the UniProt API;
* `UPIMAPI`, a tool to programmatically access UniProt;
* `fasta-splitter.pl`, a Perl script to parse FASTA files.
%% Cell type:code id:1eae5786-023a-45ab-ac4a-ccb3a1db5760 tags:
``` python
import json
import glob
import os
import urllib.parse
import urllib.request
from tqdm import tqdm
import numpy as np
import pandas as pd
```
%% Cell type:markdown id:ebbcdb49-737d-4ed9-b081-56056ecd6cf6 tags:
## 0. Setup
We are going to gather the input here; we need
* input MSAs for _Spongilla lacustris_ proteins (actually Transdecoder-predicted peptides). These were the input to AlphaFold.
* `foldseek` search results for the _Spongilla_ predicted structures against three target databases: AlphaFoldDB, PDB, and SwissProt.
* AlphaFold predicted structures; we obtain a list of structures (input to FoldSeek) as well as auxiliary files with e.g. prediction quality scores.
* input peptides
%% Cell type:code id:c58419bf-f9d3-40f6-ae23-a4bf978f2b35 tags:
``` python
# MSA location
msas = '../data/msa/'
# FOLDSEEK scores
fs_pdb = '../data/pdb_score.tsv'
fs_afdb = '../data/afdb_score.tsv'
fs_swp = '../data/swissprot_score.tsv'
# AlphaFold predictions
structure_list = '../data/named_models/'
metadata = '../data/named_metadata/'
# input fasta
all_peptides = '../data/Spongilla_lacustris_translated_proteome.fasta'
```
%% Cell type:markdown id:9a44be9c-3fdf-4c62-9e16-bea75e383d00 tags:
## 1. Sequence information
### 1.1 Input peptides
The input for the sequence annotation pipeline was predicted peptides, as generated from the assembled genome. Read the peptides and keep track of which sequences start with a methionine and end with a star (i.e. complete transcripts assembled by Trinity/TransDecoder).
Note that there will be multiple (predicted) isoforms per (predicted) gene. We produced one MSA per isoform.
%% Cell type:code id:40fd2798-4dba-48f6-bbd1-fc5758c84835 tags:
``` python
isoforms = []
proteins = []
sequences = []
with open(all_peptides, "r") as fasta_file:
current_protein = None
current_id = ""
for line in tqdm(fasta_file):
if line.startswith(">"):
current_id = line.strip()[1:]
isoform, protein = current_id.split(".")
isoforms.append(isoform[:-2])
proteins.append(protein)
if current_protein is not None:
sequences.append(current_protein)
current_protein = ""
else:
current_protein += line.strip()
sequences.append(current_protein)
```
%% Output
223492it [00:00, 1285673.28it/s]
%% Cell type:code id:2e2fafde-8653-4286-aeb4-cecb55804e4f tags:
``` python
fasta = pd.DataFrame({"isoform": isoforms, "protein": proteins, "seq": sequences})
fasta["has_start"] = fasta["seq"].str.startswith("M")
fasta["has_end"] = fasta["seq"].str[-1].str.startswith("*")
fasta.set_index("isoform", inplace=True)
fasta = fasta[["has_start", "has_end"]]
```
%% Cell type:markdown id:9ef9eeba-0082-4109-ad9b-33df3db9b129 tags:
### 1.2 Multiple sequence alignments
The peptides were aligned against UniRef30 and ColabFoldDB using MMSeqs2. Read the MSAs and extract the number of sequences in each as well as the sequence length and the _Spongilla_ transcript name.
%% Cell type:code id:cc2f89f4-1b24-4b1d-b1ef-b1c40ea99ff4 tags:
``` python
N = len(glob.glob(msas+"*.a3m"))
seq_id = [""] * N
no_seqs = [0] * N
seq_length = [0] * N
gene_name = [""] * N
for i, alignment in enumerate(tqdm(glob.glob(msas+"*.a3m"))):
try:
with open(alignment, "r") as f:
seq_id[i] = alignment.split("/")[-1].split(".")[0]
lines = f.readlines()
no_seqs[i] = (len(lines) - 2) // 2
seq_length[i] = len(lines[1].rstrip()[1:])
gene_name[i] = lines[0].rstrip()[1:]
# print(no_seqs, seq_length, gene_name)
except FileNotFoundError:
continue
```
%% Output
100%|██████████| 41945/41945 [01:22<00:00, 509.76it/s]
%% Cell type:code id:64a7ca10-f99c-402e-a9bf-42826933ae9e tags:
``` python
sequence_info = pd.DataFrame({"query": seq_id, "MSA size": no_seqs, "query length": seq_length, "gene name": gene_name})
sequence_info["protein_id"] = sequence_info["gene name"].str.split(".").str[1]
sequence_info["isoform"] = sequence_info["gene name"].str.split(".").str[0].str[:-2]
sequence_info["gene_id"] = sequence_info["gene name"].str.split(".").str[0].str.split("_").str[:2].str.join("_")
sequence_info.set_index("isoform", inplace=True)
```
%% Cell type:markdown id:46c7c3ce-0f76-4728-a188-5c04247d1eef tags:
include the start codon/end codon information from the raw sequences we parsed:
%% Cell type:code id:d57aa70c-9646-4dfb-80b8-2eb3791c32ba tags:
``` python
sequence_info
```
%% Output
query MSA size query length gene name \
isoform
c100000_g1_i1 0 2765 433 c100000_g1_i1_m.41809
c103531_g2_i2 10000 2082 152 c103531_g2_i2_m.66483
c103531_g3_i1 10001 1924 201 c103531_g3_i1_m.66482
c103531_g4_i1 10002 1959 215 c103531_g4_i1_m.66484
c103532_g1_i1 10003 203 288 c103532_g1_i1_m.66485
... ... ... ... ...
c103530_g1_i3 9998 9311 992 c103530_g1_i3_m.66470
c103531_g2_i1 9999 2103 221 c103531_g2_i1_m.66481
c100434_g1_i2 999 15780 124 c100434_g1_i2_m.44014
c100036_g1_i3 99 3 290 c100036_g1_i3_m.42037
c100002_g1_i1 9 1 112 c100002_g1_i1_m.41844
protein_id gene_id
isoform
c100000_g1_i1 41809 c100000_g1
c103531_g2_i2 66483 c103531_g2
c103531_g3_i1 66482 c103531_g3
c103531_g4_i1 66484 c103531_g4
c103532_g1_i1 66485 c103532_g1
... ... ...
c103530_g1_i3 66470 c103530_g1
c103531_g2_i1 66481 c103531_g2
c100434_g1_i2 44014 c100434_g1
c100036_g1_i3 42037 c100036_g1
c100002_g1_i1 41844 c100002_g1
[41945 rows x 6 columns]
%% Cell type:code id:5ae5dfb2-84a6-420f-8ac2-2c79d4f95e16 tags:
``` python
sequence_info = sequence_info.join(fasta)
# this column will hold NaNs later so convert it to Int64, which can deal with nulls.
sequence_info["protein_id"] = sequence_info["protein_id"].astype(int).astype("Int64")
```
%% Cell type:code id:c61347d2-f602-4f76-a656-70528242a4be tags:
``` python
sequence_info.to_csv("../data/sequence_info.csv")
# sequence_info = pd.read_csv("../data/sequence_info.csv")
```
%% Cell type:markdown id:1b716527-0af8-4945-b70f-a11302c11685 tags:
## 2. AlphaFold-predicted structures
We used the MSAs to predict protein structures using ColabFold. Here, we read the per-residue pLDDT score from the best predicted structure per peptide and average it over all residues; then we merge _Spongilla_ isoforms by keeping the best score per gene ID.
%% Cell type:code id:3831f7ff-07b6-42ae-8276-ab899ab6ffa0 tags:
``` python
alphafold.to_csv("../data/structure_predictions.csv")
```
%% Cell type:markdown id:86732d06-f95e-4583-9997-86bb9c2e0a52 tags:
Should take about 30-45mins to complete
%% Cell type:markdown id:9c6410a0-999a-41c5-af8c-3c30abbb1654 tags:
## 3. FoldSeek predictions
We searched against AlphaFoldDB (predicted), PDB (crystal), and SwissProt (predicted) protein structures to detect structural similarity to our predicted _Spongilla_ peptide structures using FoldSeek. Here we read the FoldSeek output. In order to obtain useful phylogenetic information as well as functional annotation, we will translate all structural hits to their UniProt IDs and then query the EggNOG database and UniProt for annotation.
### 3.1 Reading results
The FoldSeek results come in BLAST format. We will read them and convert the query (FoldSeek's internal unique ID for each input structure) from string to integer.
%% Cell type:code id:f79aa33d-50b9-4cf0-876a-001a63db761a tags:
``` python
!sbatch ../scripts/io_read_pdb.sh
!sbatch ../scripts/io_read_afdb.sh
!sbatch ../scripts/io_read_swp.sh
```
%% Cell type:markdown id:1c3f8ded-6be0-4749-ad85-da5d65be9be6 tags:
should take about 3-5h to complete.
%% Cell type:code id:563af56b-7a1b-4c3a-be9c-8c79fe684c01 tags:
``` python
plddt = np.load('../data/spongilla_plddt.npz')
sequence_info = pd.read_csv("../data/sequence_info.csv")
query_to_isoform = sequence_info[['query', 'isoform']].set_index('query').to_dict()['isoform']
```
%% Cell type:code id:9496151e-6ac4-40ed-a5c1-e3fd1e84fe05 tags:
``` python
pdb['aligned_plddt'] = get_aligned_plddt(pdb, plddt, query_to_isoform)
pdb.to_parquet('../data/pdb_tmp.parquet')
```
%% Output
1470483it [27:00, 907.18it/s]
%% Cell type:code id:85f5b721-880c-400a-a330-37a35503a740 tags:
``` python
print("done")
```
%% Cell type:markdown id:c2343e57-f527-4634-87a9-cbd2bc8aee5f tags:
For PDB the situation is more complicated. We will take the PDB IDs and translate them to UniProt accession numbers using the UniProt API. This will return an inflated list of IDs since some PDB entries contain complexes.
%% Cell type:code id:76ff0420-e724-4152-abbc-29d25bfd3f59 tags:
``` python
pdb = enrich_from_uniprot(pdb, "target", "uniprot", uniprot_from="PDB_ID", uniprot_to="ACC")
```
%% Cell type:code id:7262d562-3c2d-4a7a-9361-1bd197d30a7c tags:
``` python
unique_up_id = pd.concat([pdb['uniprot'].drop_duplicates(),
afdb['uniprot'].drop_duplicates(),
swp['uniprot'].drop_duplicates()])
unique_up_id.drop_duplicates(inplace=True)
```
%% Cell type:code id:df0fdae3-52e6-4e7a-87e8-c81401278b09 tags:
``` python
unique_up_id.to_csv('../data/unique_uniprot_ids.csv', header=False, index=False)
```
%% Cell type:code id:ec3015c3-316e-48fe-a08a-4fd4323cb9ea tags:
``` python
with open(f'../data/foldseek_unique_ids.txt', 'w') as tmp:
ids_as_str = unique_up_id.astype(str)[0]
tmp.write(','.join(ids_as_str))
```
%% Cell type:code id:4650ac1f-4d11-4590-b39f-2a16468d808e tags:
``` python
!python /g/arendt/npapadop/repos/UPIMAPI/upimapi.py -i ../data/foldseek_unique_ids.txt -o ../data/ --fasta --no-annotation
```
%% Cell type:code id:5b403921-8514-431a-a34d-f92fbf034ab8 tags:
``` python
%%bash
mkdir ../data/uniprot_fastas/ -p
perl ../scripts/fasta-splitter.pl --part-size 100000 ../data/uniprotinfo.fasta --nopad --measure count --out-dir ../data/uniprot_fastas/
```
%% Output
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "UTF-8",
LC_TERMINAL = "iTerm2",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
../data/uniprotinfo.fasta: 7206 sequences, 4638410 bp => dividing into 1 part of <= 100000 sequences
All done, 0 seconds elapsed
%% Cell type:markdown id:cc1e0086-8c3f-4881-938e-0270a9f9e53e tags:
These were manually submitted to emapper, and the results (the `out.emapper.annotations` tsv file of each run) downloaded and saved with an `eggnog` ending.
Let's process the retrieved emapper results:
%% Cell type:code id:22630619-3006-425f-bace-980499b61afd tags:
``` python
uniprot_eggnog = []
for file in glob.glob('../data/uniprot_fastas/*.eggnog'):
tmp = pd.read_csv(file, sep='\t', skiprows=4, skipfooter=3, engine='python')
uniprot_eggnog.append(tmp)
```
%% Cell type:markdown id:5b7287bb-7467-436b-8e9b-ff59e2d721b7 tags:
We can concatenate the files, remove unnecessary columns and convert string columns to categorical to save even more space, before saving as a parquet file.
%% Cell type:code id:65cd9d5e-fdac-4a14-94f7-787fdba690fa tags:
``` python
uniprot_annotation = pd.concat(uniprot_eggnog)
uniprot_annotation['uniprot'] = uniprot_annotation['#query'].str.split('|').str[1]
uniprot_annotation.reset_index(drop=True, inplace=True)
# remove unnecessary columns
dead_weight = ['#query', 'seed_ortholog', 'EC', 'KEGG_ko', 'KEGG_Pathway',
'KEGG_Module', 'KEGG_Reaction', 'KEGG_rclass', 'BRITE',
'KEGG_TC', 'CAZy', 'BiGG_Reaction']
uniprot_annotation.drop(dead_weight, axis=1, inplace=True)
# convert rest to categorical to save space
to_categorical = ['uniprot', 'eggNOG_OGs', 'max_annot_lvl', 'COG_category',
'Description', 'Preferred_name', 'GOs', 'PFAMs']
uniprot_annotation[to_categorical] = uniprot_annotation[to_categorical].astype("category")
# finally save in parquet format
uniprot_annotation.to_parquet('../data/uniprot_annotation.parquet')
```
%% Cell type:code id:500fd12f-5e1e-48ef-99c1-3db0424d7cb6 tags:
``` python
uniprot_annotation = pd.read_parquet('../data/uniprot_annotation.parquet')
```
%% Cell type:markdown id:8a2910c3-dbd5-4268-8eed-eef8c9796ca5 tags:
### 3.4 Annotating UniProt sequences - UniProt
We will also query UniProt to get useful gene names, protein names, functional annotation, and taxonomic information. This info is also included in Emapper, but here we get it in a more accessible form.
%% Cell type:code id:c7d97917-53c6-4b51-98a5-97a55100aa37 tags:
``` python
!python /g/arendt/npapadop/repos/UPIMAPI/upimapi.py -i ../data/foldseek_unique_ids.txt -o ../data/ --no-annotation
```
%% Cell type:markdown id:abcfc14c-c236-47c1-ad7a-1bba7244e46b tags:
The `csv` file really is huge though so we'll read it and convert to parquet format:
%% Cell type:code id:87957304-19a6-451d-aa82-9ece85cc90de tags:
``` python
foldseek_full = pd.read_csv('../data/uniprotinfo.tsv', sep='\t')
foldseek_full = foldseek_full[['Entry', 'Entry name', 'Gene names', 'Function [CC]', 'Taxonomic lineage (PHYLUM)']]
foldseek_full.columns = ['uniprot', 'Entry name', 'Gene names', 'Function [CC]', 'Taxonomic lineage (PHYLUM)']
foldseek_full.to_parquet('../data/fs_targets.parquet')
```
%% Cell type:code id:09465de2-c112-4de3-b487-6a27f8cf0504 tags:
``` python
foldseek_full = pd.read_parquet('../data/fs_targets.parquet')
```
%% Cell type:markdown id:86ebcf92-6b4f-4b4c-89dc-f3efc0dfb6b0 tags:
Add the UniProt and EggNOG annotation to the FoldSeek results:
%% Cell type:code id:019667f8-47b2-4959-abe8-38e5529b4e86 tags:
``` python
pdb = pdb.merge(foldseek_full, on='uniprot', how='left').merge(uniprot_annotation, on='uniprot', how='left')
afdb = afdb.merge(foldseek_full, on='uniprot', how='left').merge(uniprot_annotation, on='uniprot', how='left')
swp = swp.merge(foldseek_full, on='uniprot', how='left').merge(uniprot_annotation, on='uniprot', how='left')
```
%% Cell type:markdown id:bf6217eb-a982-425e-b68f-2ec57972e864 tags:
Convert non-numeric columns to categorical; many annotations will be repeated (explore by using `.value_counts()` on any of these columns), therefore we will gain a lot of space by converting away from `object`.
%% Cell type:code id:85f48253-2176-4b3b-ae78-e18dc8156d6f tags:
``` python
to_categorical = ['uniprot', 'eggNOG_OGs', 'max_annot_lvl', 'COG_category', 'target',
'Description', 'Preferred_name', 'GOs', 'PFAMs', 'Entry name',
'Gene names', 'Function [CC]', 'Taxonomic lineage (PHYLUM)']
pdb[to_categorical] = pdb[to_categorical].astype("category")
afdb[to_categorical] = afdb[to_categorical].astype("category")
swp[to_categorical] = swp[to_categorical].astype("category")
```
%% Cell type:markdown id:6769b2e4-6f33-4c88-9519-3426690c5339 tags:
## 4. Structural annotation summary
The structural annotation pipeline can be summarised by keeping the best-scoring annotation for each _Spongilla_ peptide. We will do this by first dropping unannotated entries from the target DB tables, and then keeping the best bit score per query peptide. Finally, we will concatenate the target DB tables and keep the best-scoring annotation for each query peptide. These tables are too big and unwieldy to save anyway.
%% Cell type:code id:fe4b9101-4312-440b-a437-1dd701e02129 tags:
``` python
def remove_unannotated(df):
no_desc = df['Description'] == "-"
no_func = df['Function [CC]'].isnull()
no_annot = no_desc & no_func
return df[~no_annot]
def best_bit_score(df, sort_by='bit score', tiebreak='alignment length'):
have_max = df[sort_by] == np.max(df[sort_by])
max_ali = df[have_max][tiebreak] == np.max(df[have_max][tiebreak])
return df[have_max][max_ali].index.values[0]
def keep_best_annotated(df):
df = remove_unannotated(df)
idx = df.groupby('query').apply(best_bit_score)
res = df.loc[idx].copy()
return res
def tag_origin(df, prefix):
df.columns = prefix + " " + df.columns
return df
```
%% Cell type:code id:9bbd5e11-77c9-4657-b0e8-d6d3e7055d3a tags:
``` python
pdb_best = keep_best_annotated(pdb)
pdb_best['origin'] = 'PDB'
afdb_best = keep_best_annotated(afdb)
afdb_best['origin'] = "AFDB"
swp_best = keep_best_annotated(swp)
swp_best['origin'] = 'SwissProt'
```
%% Cell type:markdown id:5c879807-6e82-405c-9bb5-33bcff5bdae6 tags:
Prune matrices to make them smaller:
%% Cell type:code id:542ca93a-9f73-45ca-8c76-48152dae9a58 tags:
``` python
dead_weight = ['no. mismatches', 'no. gap open', 'query start', 'query end',
'target start', 'target end', 'target']
pdb_best.drop(dead_weight, axis=1, inplace=True)
afdb_best.drop(dead_weight, axis=1, inplace=True)
swp_best.drop(dead_weight, axis=1, inplace=True)
```
%% Cell type:code id:7bb9638e-06d3-42a4-bb74-9220c0cfe886 tags:
``` python
pdb_best.to_parquet('../data/fs_best_pdb.parquet')
afdb_best.to_parquet('../data/fs_best_afdb.parquet')
swp_best.to_parquet('../data/fs_best_swp.parquet')
```
%% Cell type:code id:b52b7268-3247-4884-84cb-5fa3d452878a tags:
``` python
best = pd.concat([pdb_best, afdb_best, swp_best])
best = best.sort_values('corrected bit score', ascending=False).drop_duplicates(['query'])
best.set_index('query', inplace=True)
```
%% Cell type:code id:d85514be-1d56-4837-801d-a3bc24a8e889 tags:
``` python
sequence_info = pd.read_csv("../data/sequence_info.csv")
alphafold = pd.read_csv("../data/structure_predictions.csv", index_col=0)
```
%% Cell type:markdown id:454754cc-d44b-4f5d-bc0b-5c29a2ef4e44 tags:
Add the sequence information and the AlphaFold plddt score to complete the table, then save in parquet format.
%% Cell type:code id:c9e7e73d-1670-4cf4-b6b6-f4f9cebc928a tags:
``` python
structural_annotation = sequence_info.join(best).join(alphafold)
```
%% Cell type:code id:21d1d18c-5de1-43e8-83f3-04d4bf07eaa2 tags:
``` python
structural_annotation.to_parquet('../data/structure_annotation.parquet')
```
%% Cell type:markdown id:2a395280-bf0f-446d-9f48-2d870d022586 tags:
## 5. Sequence-based annotation
We used EggNOG mapper to annotate the _Spongilla_ predicted proteome. The resulting `out.emapper.annotations` tsv was downloaded from the website and is processed here to a more usable state.
%% Cell type:code id:c1bd80a1-f155-4330-8894-fcca687f48a5 tags:
``` python
eggnog = pd.read_csv('../data/MM_ipw26ffh.emapper.annotations.tsv', sep='\t', skiprows=4, skipfooter=3, engine='python')
eggnog["gene_id"] = eggnog["#query"].str.split("_").str[:2].str.join("_")
eggnog["protein_id"] = eggnog["#query"].str.split(".").str[1]
eggnog.drop(['#query', 'seed_ortholog', 'EC', 'KEGG_ko', 'KEGG_Pathway',
'KEGG_Module', 'KEGG_Reaction', 'KEGG_rclass', 'BRITE', 'KEGG_TC',
'CAZy', 'BiGG_Reaction'], axis=1, inplace=True)
```
%% Cell type:code id:6c7e43fe-2951-4104-b15b-129286601a4d tags:
``` python
eggnog.to_csv('../data/Slacustris_eggnog.tsv', sep='\t', index=False)
```
%% Cell type:code id:f46b7ad3-99c8-487b-a3c1-29aee10b9775 tags:
``` python
```
......
%% Cell type:code id:e951dd81-8a76-45fb-a00d-8de54697bf0f tags:
%% Cell type:code id:a121491b tags:
``` python
from datetime import datetime, timezone
import pytz
utc_dt = datetime.now(timezone.utc) # UTC time
dt = utc_dt.astimezone()
tz = pytz.timezone('Europe/Berlin')
berlin_now = datetime.now(tz)
print(f'{berlin_now:%Y-%m-%d %H:%M}')
```
%% Output
2022-06-08 15:01
2022-07-28 08:02
%% Cell type:markdown id:84d83807-5bdb-4fb7-8bcc-069c17f984a4 tags:
%% Cell type:markdown id:308f25e4 tags:
In this script I will prepare the metadata for deposition in [ModelArchive](https://www.modelarchive.org). I will use the lookup table produced by the CoFFE pipeline and concatenate all the possibly relevant information into a single column (description).
%% Cell type:code id:3bebc329-4921-4890-baac-5f3718a687d9 tags:
%% Cell type:code id:9eb7d317 tags:
``` python
import pandas as pd
```
%% Cell type:code id:38b0ec79-e498-4ac0-ae63-9b1ce7ff44a1 tags:
%% Cell type:code id:d7d72594 tags:
``` python
sequence = pd.read_csv('../data/Slacustris_eggnog.tsv', sep='\t')
structure = pd.read_parquet('../data/structure_annotation.parquet')
sequence = pd.read_csv('../old_data/Slacustris_eggnog.tsv', sep='\t')
structure = pd.read_parquet('../old_data/structure_annotation.parquet')
```
%% Cell type:code id:a51fa323-ff59-4acf-abaa-6ddddfc8de5c tags:
%% Output
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_136/1368120052.py in <module>
----> 1 sequence = pd.read_csv('../ol_data/Slacustris_eggnog.tsv', sep='\t')
2 structure = pd.read_parquet('../old_data/structure_annotation.parquet')
/opt/conda/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588
/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
480
481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
483
484 if chunksize or iterator:
/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
809 self.options["has_index_names"] = kwds["has_index_names"]
810
--> 811 self._engine = self._make_engine(self.engine)
812
813 def close(self):
/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
1038 )
1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
1041
1042 def _failover_to_python(self):
/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
49
50 # open handles
---> 51 self._open_handles(src, kwds)
52 assert self.handles is not None
53
/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
220 Let the readers open IOHandles after they are done with their potential raises.
221 """
--> 222 self.handles = get_handle(
223 src,
224 "r",
/opt/conda/lib/python3.9/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
699 if ioargs.encoding and "b" not in ioargs.mode:
700 # Encoding
--> 701 handle = open(
702 handle,
703 ioargs.mode,
FileNotFoundError: [Errno 2] No such file or directory: '../ol_data/Slacustris_eggnog.tsv'
%% Cell type:code id:6532f2fc tags:
``` python
merged = structure.join(sequence, on='protein_id', lsuffix='_str', rsuffix='_seq').fillna('-')
```
%% Cell type:code id:fa874500-61c0-4410-ae55-b081e0912670 tags:
%% Cell type:code id:4a137f30 tags:
``` python
def generate_description(row):
desc = f"Emapper preferred name [EggNOG]: {row['Preferred_name_seq']}\n" + \
f"Emapper description [EggNOG]: {row['Description_seq']}\n" + \
desc = f"Preferred name [EggNOG]: {row['Preferred_name_seq']}\n" + \
f"Description [EggNOG]: {row['Description_seq']}\n" + \
f"CoFFE preferred name [EggNOG]: {row['Preferred_name_str']}\n" + \
f"CoFFE description [EggNOG]: {row['Description_str']}\n" + \
f"CoFFE UniProt function: {row['Function [CC]']}\n\n" + \
f"isoform ID: {row['isoform']}\n" + \
f"peptide length: {row['query length']}\n" + \
f"input MSA size [MMseqs2]: {row['MSA size']}\n" + \
f"gene ID: {row['gene_id_str']}\n" + \
f"protein ID: {row['protein_id_str']}\n" + \
f"mRNA has start codon [TransDecoder]: {row['has_start']}\n" + \
f"mRNA has stop codon [TransDecoder]: {row['has_end']}\n\n" + \
f"average pLDDT [ColabFold]: {row['plddt']}\n\n" + \
f"best structural hit [FoldSeek]: {row['target']}\n" + \
f"structural state %identity [FoldSeek]: {row['seq. id.']}\n" + \
f"bit score [FoldSeek]: {row['bit score']}\n\n" + \
f"UniProt ID of best structural hit (CoFFE): {row['uniprot']}\n" + \
f"CoFFE Emapper bit score [EggNOG]: {row['score_str']}\n" + \
f"CoFFE orthogroups [EggNOG]: {row['eggNOG_OGs_str']}\n" + \
f"CoFFE maximum annotation level [EggNOG]: {row['max_annot_lvl_str']}\n" + \
f"CoFFE GO terms [EggNOG]: {row['GOs_str']}\n" + \
# f"CoFFE GO terms [EggNOG]: {row['GOs_str']}\n" + \
f"CoFFE PFAM domains [EggNOG]: {row['PFAMs_str']}\n\n" + \
f"best sequence hit bit score [Emapper]: {row['score_seq']}\n" + \
f"Emapper orthogroups [EggNOG]: {row['eggNOG_OGs_seq']}\n" + \
f"Emapper maximum annotation level [EggNOG]: {row['max_annot_lvl_seq']}\n" + \
f"Emapper GO terms [EggNOG]: {row['GOs_seq']}\n" + \
f"Emapper PFAM domains [EggNOG]: {row['PFAMs_seq']}\n"
f"Orthogroups [EggNOG]: {row['eggNOG_OGs_seq']}\n" + \
f"Maximum annotation level [EggNOG]: {row['max_annot_lvl_seq']}\n" + \
# f"Emapper GO terms [EggNOG]: {row['GOs_seq']}\n" + \
f"PFAM domains [EggNOG]: {row['PFAMs_seq']}\n"
return desc
```
%% Cell type:markdown id:2afdd7a7-1131-415d-aaa6-337220672d4d tags:
%% Cell type:markdown id:6712bc16 tags:
Let's test it first:
%% Cell type:code id:63193db9-9f82-4126-80e3-ec76033ad1c6 tags:
%% Cell type:code id:935a770c tags:
``` python
print(generate_description(merged.loc[10001]))
```
%% Output
Emapper preferred name [EggNOG]: -
Emapper description [EggNOG]: -
CoFFE preferred name [EggNOG]: PLA2G15
CoFFE description [EggNOG]: Lecithin:cholesterol acyltransferase
CoFFE UniProt function: -
isoform ID: c103531_g3_i1
peptide length: 202
input MSA size [MMseqs2]: 1924.0
gene ID: c103531_g3
protein ID: 66482
mRNA has start codon [TransDecoder]: False
mRNA has stop codon [TransDecoder]: True
average pLDDT [ColabFold]: 88.31183168316831
best structural hit [FoldSeek]: AF-A0A077YYC2-F1
structural state %identity [FoldSeek]: 0.295
bit score [FoldSeek]: 659.0
UniProt ID of best structural hit (CoFFE): A0A077YYC2
CoFFE Emapper bit score [EggNOG]: 526.0
CoFFE orthogroups [EggNOG]: KOG2369@1|root,KOG3236@1|root,KOG2369@2759|Eukaryota,KOG3236@2759|Eukaryota,38DE3@33154|Opisthokonta,3BDIK@33208|Metazoa,3CTGA@33213|Bilateria,40D2W@6231|Nematoda
CoFFE maximum annotation level [EggNOG]: 33208|Metazoa
CoFFE GO terms [EggNOG]: GO:0000323,GO:0003674,GO:0003824,GO:0004620,GO:0004622,GO:0004623,GO:0005488,GO:0005543,GO:0005575,GO:0005576,GO:0005615,GO:0005622,GO:0005623,GO:0005634,GO:0005654,GO:0005737,GO:0005764,GO:0005773,GO:0006082,GO:0006629,GO:0006631,GO:0006643,GO:0006644,GO:0006650,GO:0006665,GO:0006672,GO:0006793,GO:0006796,GO:0006807,GO:0008150,GO:0008152,GO:0008289,GO:0008374,GO:0009056,GO:0009062,GO:0009395,GO:0009987,GO:0016042,GO:0016054,GO:0016298,GO:0016740,GO:0016746,GO:0016747,GO:0016787,GO:0016788,GO:0019637,GO:0019752,GO:0031974,GO:0031981,GO:0032787,GO:0034638,GO:0034641,GO:0043167,GO:0043168,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0043233,GO:0043436,GO:0043603,GO:0044237,GO:0044238,GO:0044242,GO:0044248,GO:0044255,GO:0044281,GO:0044282,GO:0044421,GO:0044422,GO:0044424,GO:0044428,GO:0044444,GO:0044446,GO:0044464,GO:0046337,GO:0046338,GO:0046395,GO:0046434,GO:0046470,GO:0046475,GO:0046486,GO:0046503,GO:0047499,GO:0052689,GO:0070013,GO:0071704,GO:0072329,GO:0097164,GO:1901564,GO:1901565,GO:1901575
CoFFE PFAM domains [EggNOG]: LCAT
best sequence hit bit score [Emapper]: -
Emapper orthogroups [EggNOG]: -
Emapper maximum annotation level [EggNOG]: -
Emapper GO terms [EggNOG]: -
Emapper PFAM domains [EggNOG]: -
%% Cell type:markdown id:8097273d-2713-4597-9c94-7feb6f6280f9 tags:
%% Cell type:markdown id:6c9e5b53 tags:
Looks good! Now let's apply it to our table:
%% Cell type:code id:43e594b0-0a9b-4e37-b21a-a8a6cd8eda3c tags:
%% Cell type:code id:1c5133d7 tags:
``` python
merged['description'] = merged.apply(generate_description, axis=1)
```
%% Cell type:markdown id:9c16526c-9a46-4abd-ac8f-25da62d23563 tags:
%% Cell type:markdown id:34028222 tags:
Subset the table, rename columns, and save:
%% Cell type:code id:97598d9d-34f8-4644-b1cd-a5be8d2c45dd tags:
%% Cell type:code id:fa278548 tags:
``` python
merged = merged[['isoform', 'description']]
merged.columns = ['title', 'description']
merged.reset_index(drop=True, inplace=True) # no need to keep track of the query ID
```
%% Cell type:code id:3463393a-74ff-4239-bbee-8f240dc1902f tags:
%% Cell type:code id:1f718df1 tags:
``` python
merged.to_csv('/g/arendt/npapadop/data/spongfold_publish/model_archive_metadata.csv')
```
......
# Read/write operations for result analysis
The directory contains all scripts to rerun the processing of ColabFold and Foldseek results. The
`python` files contain the code for processing; the corresponding `.sh` scripts are used to submit
to the SLURM cluster.
### io_parse_structures.py
Will go through the predicted best models and extract the pLDDT score for each residue. It saves the
per-residue-pLDDT for each protein in a dictionary (`npz` file) and the average pLDDT in a CSV
table.
### io_read_afdb.py
Processes the Foldseek predictions against AlphaFold DB. Will return a table that lists all
isoforms, their Foldseek scores, the average pLDDT of the aligned residues, and the UniProt ID of
the target protein.
### io_read_swp.py
Processes the Foldseek predictions against SwissProt. Will return a table that lists all isoforms,
their Foldseek scores, the average pLDDT of the aligned residues, and the UniProt ID of the target
protein.
### io_read_pdb.py
Processes the Foldseek predictions against PDB. Will return a table that lists all isoforms, their
Foldseek scores, the average pLDDT of the aligned residues, and the UniProt ID of the target
protein. The UniProt ID is fetched by querying UniProt (UniProt API).
\ No newline at end of file
......@@ -66,17 +66,9 @@ def enrich_from_uniprot(df, column_from, column_to, uniprot_from="PDB_ID", unipr
df_map = create_id_map(response, column_from, column_to)
return df.join(df_map, on=column_from)
# MSA location
msas = '/g/arendt/npapadop/repos/coffe/data/msa/'
# FOLDSEEK scores
fs_pdb = '/g/arendt/npapadop/repos/coffe/data/pdb_score.tsv'
fs_afdb = '/g/arendt/npapadop/repos/coffe/data/afdb_score.tsv'
fs_swp = '/g/arendt/npapadop/repos/coffe/data/swissprot_score.tsv'
# AlphaFold predictions
structure_list = '/g/arendt/npapadop/repos/coffe/data/named_models/'
metadata = '/g/arendt/npapadop/repos/coffe/data/named_metadata/'
# input fasta
all_peptides = '/g/arendt/npapadop/repos/coffe/data/Spongilla_lacustris_translated_proteome.fasta'
plddt = np.load('/g/arendt/npapadop/repos/coffe/data/spongilla_plddt.npz')
......@@ -88,4 +80,4 @@ afdb["query"] = afdb["query"].values.astype(int)
afdb['aligned_plddt'] = get_aligned_plddt(afdb, plddt, query_to_isoform)
afdb["uniprot"] = afdb["target"].str.split("-").str[1]
afdb_best.to_parquet('/g/arendt/npapadop/repos/coffe/data/afdb_tmp.parquet')
\ No newline at end of file
afdb.to_parquet('/g/arendt/npapadop/repos/coffe/data/afdb_tmp.parquet')
\ No newline at end of file
......@@ -2,12 +2,12 @@
#SBATCH -J coffe_read_afdb
#SBATCH -t 5:00:00
#SBATCH -c 1
#SBATCH --mem=16G
#SBATCH --mem=32G
#SBATCH -o /g/arendt/npapadop/cluster/io_read_afdb.out
#SBATCH -e /g/arendt/npapadop/cluster/io_read_afdb.err
module load Anaconda3
source ~/.bash_profile
conda activate /g/arendt/npapadop/repos/condas/milot
conda activate /g/arendt/npapadop/repos/condas/CoFFE
python /g/arendt/npapadop/repos/coffe/scripts/io_read_afdb.py
\ No newline at end of file
python /g/arendt/npapadop/repos/coffe/scripts/results_processing/io_read_afdb.py
\ No newline at end of file
import os
import json
from tqdm import tqdm
from urllib import parse, request
import numpy as np
import pandas as pd
......@@ -32,7 +33,7 @@ def get_aligned_plddt(df, plddt, name_dict):
def get_from_uniprot(df, column, uniprot_from="ACC", uniprot_to="EGGNOG_ID", request_size=40000):
ids = df[column][~df[column].isnull()]
url = 'https://www.uniprot.org/uploadlists/'
url = 'https://rest.uniprot.org/'
params = {
'from': uniprot_from,
......@@ -41,10 +42,10 @@ def get_from_uniprot(df, column, uniprot_from="ACC", uniprot_to="EGGNOG_ID", req
'query': " ".join(ids.unique())
}
data = urllib.parse.urlencode(params)
data = parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as f:
req = request.Request(url, data)
with request.urlopen(req) as f:
response = f.read()
return response
......@@ -66,18 +67,9 @@ def enrich_from_uniprot(df, column_from, column_to, uniprot_from="PDB_ID", unipr
df_map = create_id_map(response, column_from, column_to)
return df.join(df_map, on=column_from)
# MSA location
msas = '/g/arendt/npapadop/repos/coffe/data/msa/'
# FOLDSEEK scores
fs_pdb = '/g/arendt/npapadop/repos/coffe/data/pdb_score.tsv'
fs_afdb = '/g/arendt/npapadop/repos/coffe/data/afdb_score.tsv'
fs_swp = '/g/arendt/npapadop/repos/coffe/data/swissprot_score.tsv'
# AlphaFold predictions
structure_list = '/g/arendt/npapadop/repos/coffe/data/named_models/'
metadata = '/g/arendt/npapadop/repos/coffe/data/named_metadata/'
# input fasta
all_peptides = '/g/arendt/npapadop/repos/coffe/data/Spongilla_lacustris_translated_proteome.fasta'
# per-residue pLDDT score
plddt = np.load('/g/arendt/npapadop/repos/coffe/data/spongilla_plddt.npz')
sequence_info = pd.read_csv("/g/arendt/npapadop/repos/coffe/data/sequence_info.csv")
......
......@@ -8,6 +8,6 @@
module load Anaconda3
source ~/.bash_profile
conda activate /g/arendt/npapadop/repos/condas/milot
conda activate /g/arendt/npapadop/repos/condas/CoFFE
python /g/arendt/npapadop/repos/coffe/scripts/io_read_pdb.py
\ No newline at end of file
python /g/arendt/npapadop/repos/coffe/scripts/results_processing/io_read_pdb.py
\ No newline at end of file
......@@ -88,4 +88,4 @@ swp["query"] = swp["query"].values.astype(int)
swp['aligned_plddt'] = get_aligned_plddt(swp, plddt, query_to_isoform)
swp["uniprot"] = swp["target"].str.split("-").str[1]
swp_best.to_parquet('/g/arendt/npapadop/repos/coffe/data/swp_tmp.parquet')
\ No newline at end of file
swp.to_parquet('/g/arendt/npapadop/repos/coffe/data/swp_tmp.parquet')
\ No newline at end of file
......@@ -8,6 +8,6 @@
module load Anaconda3
source ~/.bash_profile
conda activate /g/arendt/npapadop/repos/condas/milot
conda activate /g/arendt/npapadop/repos/condas/CoFFE
python /g/arendt/npapadop/repos/coffe/scripts/io_read_swp.py
\ No newline at end of file
python /g/arendt/npapadop/repos/coffe/scripts/results_processing/io_read_swp.py
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment