Commit c188b947 authored by Martin Larralde's avatar Martin Larralde
Browse files

Fix rendering of iterative search example [ci skip]

parent 04ccd313
Pipeline #32794 skipped
%% Cell type:markdown id:4bf9e5a3-df0f-414b-b0ea-25f8ae4e88cd tags:
# Run an iterative search to build a HMM for rhodopsins
%% Cell type:markdown id:c0c7cc2b-06b4-41af-a0ff-6fdee1e77b98 tags:
This examples shows an example workflow for building an HMM from a seed sequence through an iterative search in a sequence database, similar to the JackHMMER methodology. In the second half it will also introduce extra capabilities of the Python wrapper allowing the dynamic modification of the search parameters.
The goal of this example is to build a HMM specific to [halorhodopsins](https://en.wikipedia.org/wiki/Halorhodopsin), a class of light-activated ion pumps found in Archaea. The difficulty of that task comes from their similarity to [bacteriorhodopsins](https://en.wikipedia.org/wiki/Bacteriorhodopsin), which have a comparable conserved structure, but different activity.
![Classes of rhodopsins](https://els-jbs-prod-cdn.jbs.elsevierhealth.com/cms/attachment/612410/4909711/gr1.jpg)
*Figure from* [Zhang et al. (2011)](https://www.cell.com/fulltext/S0092-8674(11)01502-9#gr1).
In a second time we will then build a broader HMM for any kind of rhodopsin, while introducing how to implement a custom hit selection function. This will allow for the automatic exclusion of false positives coming from other organisms, using an additional filter based on taxonomy.
<div class="alert alert-info">
References
* [Johnson, L.S., Eddy, S.R. and Portugaly, E. *Hidden Markov model speed heuristic and iterative HMM search procedure.* BMC Bioinformatics 11, 431 (2010). doi:10.1186/1471-2105-11-431](https://doi.org/10.1186/1471-2105-11-431).
* [Zhang, Feng, Vierock J., Yizhar O., Fenno L. E., Tsunoda S., Kianianmomeni A., Prigge M., et al. *The Microbial Opsin Family of Optogenetic Tools*. Cell 147, no. 7 (23 December 2011): 1446–57. doi:10.1016/j.cell.2011.12.004](https://doi.org/10.1016/j.cell.2011.12.004).
</div>
%% Cell type:code id:c059c2f0-67eb-4219-b02a-11825f0f3388 tags:
``` python
import pyhmmer
pyhmmer.__version__
```
%% Cell type:markdown id:2e6a030d-6673-496d-9bce-c6d1813e7b1a tags:
## Downloading sequence data
For this example workflow we will be using SwissProt, directly downloading the sequences in FASTA format from the EMBL-EBI data server.
%% Cell type:code id:702cee55-e68a-4a81-8432-7b60c030164f tags:
``` python
import gzip
import urllib.request
alphabet = pyhmmer.easel.Alphabet.amino()
url = "http://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz"
with urllib.request.urlopen(url) as response:
reader = gzip.open(response, "rb")
swissprot = list(pyhmmer.easel.SequenceFile(reader, digital=True, alphabet=alphabet))
```
%% Cell type:markdown id:e61c9fca-d80c-4075-ac83-be78c90b2067 tags:
The SwissProt sequences all contain the source organism and NCBI Taxonomy accession in their FASTA header, and we will need these pieces of information later. For now, we can build an index that maps a sequence name to its source organism, and another index for taxonomy:
%% Cell type:code id:594c1165-8b56-4f11-b06e-21270d29fb5e tags:
``` python
import re
taxonomy_by_name = {}
organism_by_name = {}
description_by_name = {}
for seq in swissprot:
match = re.search(b"(.*) OS=(.*) OX=(\d+)", seq.description)
taxonomy_by_name[seq.name] = int(match.group(3))
organism_by_name[seq.name] = match.group(2)
description_by_name[seq.name] = match.group(1)
```
%% Cell type:markdown id:585238b8-2eb3-4438-8eeb-49bcc8820f98 tags:
## Running a simple search
The sequence we will be using for initial similarity search is the [halorhodopsin](https://www.uniprot.org/uniprot/B0R2U4) from *Halobacterium salinarum ATCC 29341*. First let's extract it from SwissProt:
%% Cell type:code id:1cfcc4c9-8092-4f1a-a0ba-de6c77544209 tags:
``` python
hop = next(x for x in swissprot if b'B0R2U4' in x.name)
```
%% Cell type:markdown id:d0469e2e-0bc3-457e-9abe-24a2948f4366 tags:
Now let's see if we can find close homologs to this one protein without losing the specificity. For this, we first need to create a pipeline and to search for the sequence above:
%% Cell type:code id:6f8cf129-4159-4f1d-b4d0-197c44079afc tags:
``` python
pli = pyhmmer.plan7.Pipeline(alphabet)
hits = pli.search_seq(hop, swissprot)
```
%% Cell type:markdown id:0584dbba-c23a-411f-8741-dbd6e6dc99bd tags:
We can visualize the score distribution for the hits to see what we should consider as the inclusion threshold:
%% Cell type:code id:c66cf473-6b9e-40eb-8ce4-5918e36f394c tags:
``` python
import matplotlib.pyplot as plt
plt.plot([hit.score for hit in hits], "o-")
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()
```
%% Cell type:markdown id:f5fa632c-b8d0-437f-914d-cb70288d3feb tags:
From this graph we can identify three groups of protein in the hits:
- The two first hits, with very high scores, that are likely very similar orthologs
- A group of more remote orthologs, with a score above 300 bits
- A group of very remote orthologs, under the 300 bits score mark.
If we went to run an iterative search with the default `Pipeline` parameters, all of these hits would be included in the next iteration, and we may lose the specifity. Let's check the organisms included with a bitscore cutoff of 300:
%% Cell type:code id:21cbd587-8aa2-472b-b57a-3a350bb037b1 tags:
``` python
for hit in hits:
if hit.score >= 100.0:
print(f"{hit.name.decode():<20}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name].decode()}")
```
%% Cell type:markdown id:c4cfac3f-1210-42a2-b3e4-060837ba0d18 tags:
Looks like all of these are indeed Halobacteria, so we should be fine running an iterative search with these. However, two hits under the 300 bits are still halorhodopsins according to UniProt annotations, so to keep them included we should lower the inclusion threshold to 250 bits.
%% Cell type:markdown id:55018b7d-d352-4731-95cb-717da9db1198 tags:
## Running a conservative iterative search
Let's run an iterative search until it converges with a bitscore cutoff of 250, as we selected previously. For this we need to create a new pipeline, and use the `Pipeline.iterate_seq` method with our query and target sequences:
%% Cell type:code id:163f8997-be82-4c3d-8334-7fd433d21d25 tags:
``` python
pli = pyhmmer.plan7.Pipeline(alphabet, incT=250, incdomT=250)
search = pli.iterate_seq(hop, swissprot)
```
%% Cell type:markdown id:775279dd-ab55-4038-9384-e5391453f27c tags:
Now that we have an iterator, we can keep iterating until it converges (in this context, converging means that no new hits are found by the newly created HMM in comparison to the previous one):
%% Cell type:code id:bd2edf8f-d2e1-4502-96dd-e32d4ef3065c tags:
``` python
iterations = []
while not search.converged:
iteration = next(search)
iterations.append(iteration)
print(f"Iteration={iteration.iteration} Hits={len(iteration.hits):2} Included={iteration.hits.hits_included:2} Converged={search.converged}")
```
%% Cell type:markdown id:879297d8-bfed-45ec-a697-68cddbcee5b6 tags:
We can plot the score distribution for each iteration:
%% Cell type:code id:1cc8e48b-ca13-4440-b7c7-6b0d5bb5e361 tags:
``` python
import matplotlib.pyplot as plt
for iteration in iterations:
plt.plot([hit.score for hit in iteration.hits], "o-", label=f"Iteration {iteration.iteration}")
plt.legend()
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()
```
%% Cell type:markdown id:e032ed20-c629-4d31-bbc5-c27f12fb7228 tags:
With a high cutoff, the search converges quite quickly, but adds 1 new sequence to the ones that were included in the simple search. Let's see which sequences have been included:
%% Cell type:code id:2f3b461c-46af-4ab6-b6f3-c85f5be654a7 tags:
``` python
for hit in iterations[-1].hits:
if hit.is_included():
print(f"{hit.name.decode():<20}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name][:60].decode()}")
```
%% Cell type:markdown id:2bcc83f3-2de3-4d65-995a-9d469257e276 tags:
The last included hit, the Cruxhalorhodopsin-1 fragment, was definitely under inclusion threshold in the simple search. However, the subsequent iterations helped train a new HMM that was able to rescue it, so that it's now included in the alignment. Let's have a look at the excluded hits, to make sure we are not missing anything:
%% Cell type:code id:e913ba65-55fb-4f00-9163-8bb2aa5b0608 tags:
``` python
for hit in iterations[-1].hits:
if not hit.is_included() and hit.score > 50.0:
print(f"{hit.name.decode():<20}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name][:60].decode()}")
```
%% Cell type:markdown id:a2bb8f8f-cbfe-4fe4-827e-22384576c432 tags:
It looks like we are not missing any halorhodopsin in our hits, so looks like we managed to build a sensitive HMM.
%% Cell type:markdown id:74982f8a-cddc-414c-b795-f8eabbb20985 tags:
## Running an iterative search with custom selection
The previous iterative search helped us create a HMM only covering halorhodopsins. However, let's try to build a very broad HMM for rhodopsin search in Halobacteria. For this round, we do not care about functional specificity, but we care about taxonomy. In particular, looking at the hits from the last iteration, there are some fungal protein among the archeal ones (the [opsin](https://www.uniprot.org/uniprot/Q9UW81) from *Neurospora crassa*, for instance). To build a HMM specific to archeal proteins we cannot rely on a simple score cutoff, so let's instead build a dedicated function to select hits between each iteration based on taxonomy.
%% Cell type:markdown id:03f4ab7d-7e74-44b9-9250-5d653387b037 tags:
For handling the taxonomy we will be using [`taxopy`](https://github.com/apcamargo/taxopy), a Python library for reading NCBI Taxonomy files. Let's start by loading the files so that we have a taxonomy database at hand:
For handling the taxonomy we will be using [taxopy](https://github.com/apcamargo/taxopy), a Python library for reading NCBI Taxonomy files. Let's start by loading the files so that we have a taxonomy database at hand:
%% Cell type:code id:0de9ea6b-fde6-49c1-91ac-70bcf65d13d0 tags:
``` python
import taxopy
taxdb = taxopy.TaxDb(keep_files=False)
```
%% Cell type:markdown id:7a1670ad-f8c7-4cf4-9601-48c0adfecfc1 tags:
Now let's write a selection function. We still rely on the score inclusion criterion, but we also will want to exclude any hit from an organism that is not in the Halobacteria class.
%% Cell type:code id:ee932652-6d1a-449e-bce1-5a91dc267307 tags:
``` python
def select_halobacteria_hits(top_hits):
for hit in top_hits:
taxid = taxonomy_by_name[hit.name]
taxon = taxopy.Taxon(taxid, taxdb)
if hit.is_included() and taxon.rank_name_dictionary.get('class') != "Halobacteria":
hit.manually_drop()
```
%% Cell type:markdown id:1e4aa39c-9b5a-4839-b1e4-3ccd902be377 tags:
Now run the iterative search with a lower inclusion threshold, and the selection function we just made:
%% Cell type:code id:9dbed2b9-644d-47f1-abbd-c2f96957dd17 tags:
``` python
pli = pyhmmer.plan7.Pipeline(alphabet, incT=100, incdomT=100)
search = pli.iterate_seq(hop, swissprot, select_hits=select_halobacteria_hits)
```
%% Cell type:code id:8d16a70b-539e-475a-9e55-1ae2360cece4 tags:
``` python
iterations = []
while not search.converged:
iteration = next(search)
iterations.append(iteration)
worst_score = min(hit.score for hit in iteration.hits if hit.is_included())
print(f"Iteration={iteration.iteration} Hits={len(iteration.hits):3} Included={iteration.hits.hits_included:3} Converged={search.converged}")
```
%% Cell type:markdown id:8ce15cde-df16-4756-bccc-b79dac34682c tags:
The search converges in 4 iterations. Let's have a look at the score distribution:
%% Cell type:code id:98283f37-fefd-4ec4-8a52-d7be49b17da9 tags:
``` python
import matplotlib.pyplot as plt
for iteration in iterations:
plt.plot([hit.score for hit in iteration.hits], "o-", label=f"Iteration {iteration.iteration}")
plt.legend()
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()
```
%% Cell type:markdown id:d2d656c7-000b-47c5-af10-cb35fdd83f02 tags:
We can make sure that we are not excluding any rhodopsin from Halobacteria by having a look at the excluded hits from the last iteration:
%% Cell type:code id:8ace1941-c9ec-4c70-8244-832acaf65b4e tags:
``` python
for hit in iterations[-1].hits:
if not hit.is_included():
print(f"{hit.name.decode():<24}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():64}\t{organism_by_name[hit.name][:60].decode()}")
```
%% Cell type:markdown id:dfdc754d-9362-4847-ac07-99038fbf1515 tags:
Notice that although it is above inclusion threshold, the [opsin](https://www.uniprot.org/uniprot/Q9UW81) from *Neurospora crassa* stays excluded, as expected from our selection function. Regarding the included targets in the final hits, they should only contain rhodopsins from Halobacteria:
%% Cell type:code id:6b4c2c80-d91a-45ae-b174-ffd454618a52 tags:
``` python
for hit in iterations[-1].hits:
if hit.is_included():
print(f"{hit.name.decode():<24}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name].decode()}")
```
%% Cell type:markdown id:1f308930-2a37-4a56-b6e1-c82100ab24ba tags:
## Recording bitscore cutoffs
A useful feature of HMMER is the possibility to use HMM-specific bitscore cutoffs when searching sequences with a HMM, instead of relying on E-value cutoffs. Three cutoffs categories are available:
- *Noise cutoffs (NC)*: The least stringent of the 3 cutoffs, normally set as the highest score seen for negative sequences when gathering members of full alignments.
- *Gathering cutoffs (GA)*: More stringent than noise cutoffs, normally computed as the cutoffs used when gathering members of full alignments.
- *Trusted cutoffs (TC)*: The most stringent of the 3 cutoffs, normally set according to the lowest scores seen for true homologous sequences that were above the GA gathering thresholds, when gathering members of full alignments.
%% Cell type:code id:14c5f093-88c7-4522-87cc-0fe5f2980092 tags:
``` python
hmm = iterations[-1].hmm
```
%% Cell type:markdown id:3872770c-8d25-4e87-85ef-b11685dd4217 tags:
Let's first build noise cutoffs based on the negative hits: we can use the highest scoring excluded sequence.
%% Cell type:code id:c8b59774-2d6a-4e22-873f-941fcabbca70 tags:
``` python
from math import ceil
noise_score = max(hit.score for hit in iterations[-1].hits if not hit.is_included())
noise_domscore = max(hit.best_domain.score for hit in iterations[-1].hits if not hit.is_included())
hmm.cutoffs.noise = ceil(noise_score), ceil(noise_domscore)
```
%% Cell type:markdown id:0981a1ca-ae1b-451d-8e89-b941f30ef980 tags:
For gathering, we can use the lowest scoring included sequence:
%% Cell type:code id:faf1cdb0-4733-4049-b414-ad7812f4449b tags:
``` python
from math import floor
gathering_score = min(hit.score for hit in iterations[-1].hits if hit.is_included())
gathering_domscore = min(hit.best_domain.score for hit in iterations[-1].hits if hit.is_included())
hmm.cutoffs.gathering = floor(gathering_score), floor(gathering_domscore)
```
%% Cell type:markdown id:3f321904-0032-41d6-9bd8-3327054e5b8b tags:
For trusted, it's a little bit less obvious. If you look at the bitscore distribution from the last iteration, we can see 3 distinct group of hits, one in the 390-350 range, one in the 315-260 range, and one in the 235-200 range. We may want to set the trusted cutoffs so that the only the hits in the top range can be trusted.
%% Cell type:code id:692d7c07-4724-45ac-ae87-29b37a723b72 tags:
``` python
from math import floor
i = max(i for i, hit in enumerate(iterations[-1].hits) if hit.score > 350)
hmm.cutoffs.trusted = floor(iterations[-1].hits[i].score), floor(iterations[-1].hits[i].best_domain.score)
```
%% Cell type:markdown id:3c945521-7d06-42be-ae0b-f4d801ac3a26 tags:
Otherwise, for a more unsupervised approach, we could try to extract a cutoffs by selecting the score of the hits where the derivative is at its maximum:
%% Cell type:code id:275c7bec-4e36-4c5a-ad46-a059b3444db1 tags:
``` python
diff = [
iterations[-1].hits[i-1].score - iterations[-1].hits[i].score
for i in range(1, len(iterations[-1].hits))
if iterations[-1].hits[i-1].is_included()
]
```
%% Cell type:code id:343d6df6-4d15-4650-8548-e2322be81bc7 tags:
``` python
import matplotlib.pyplot as plt
plt.plot(diff)
plt.ylabel("dscore")
plt.xlabel("Hit rank")
plt.show()
```
%% Cell type:code id:abb56159-5b1e-42b3-9e8c-f155a3547fc8 tags:
``` python
from math import floor
argmax = max(range(len(diff)), key=diff.__getitem__)
hit = iterations[-1].hits[argmax]
hmm.cutoffs.trusted = floor(hit.score), floor(hit.best_domain.score)
```
%% Cell type:markdown id:608b6e8d-d333-4764-b2df-230a77971d97 tags:
## Saving the HMM
Now that we are done creating the HMM, we can save it for later use:
%% Cell type:code id:35ea77f9-2067-4ca8-907d-c2553913de6c tags:
``` python
hmm.name = b"Rhodopsin"
hmm.description = None
hmm.accession = None
```
%% Cell type:code id:7f6d917d-049d-4260-b390-75766b921fdb tags:
``` python
with open("rhodopsin.hmm", "wb") as dst_file:
hmm.write(dst_file)
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment