README.md



scripts usage
Here we list the scripts in their logical use order:

Get sequence databases and obtain multiple sequence alignments for Spongilla.


databases.sh: download and build indices for the sequence databases UniRef30 and ColabFold
EnvDB. Modified from ColabFold.

databases_pdb.sh: download (sequence) PDB and build index

align.sh: build multiple sequence alignments for the Spongilla proteome. A wrapper around
colabfold_search.


Predict structures for Spongilla


fasta-splitter.pl: written by Kirill Kryukov. A utility to partition FASTA files into pieces.
Used with --n-parts 32 to split the Spongilla proteome in batches.

predict_structures.sh: a wrapper around colabfold_batch to submit jobs to the EMBL cluster.

submit_colab.sh: a simple for loop to handle all 32 batches.


Use FoldSeek to search against available structures


fs_query.sh: build structural database for Spongilla predicted structures.

fs_afdb.sh, fs_pdb.sh, fs_sp.sh: download the precomputed FoldSeek databases for AFDB, PDB, and
SwissProt, respectively. Split in three so we could run them in parallel.

fs_search_afdb.sh, fs_search_pdb.sh, fs_search_swissprot.sh: search with the Spongilla
structure database against the three target databases, AFDB, PDB, and SwissProt.