Commit 3fd18532 authored by Martin Larralde's avatar Martin Larralde
Browse files

Update examples to display alignments of results with `print`

parent a051ea1d
Pipeline #32791 passed with stage
in 5 minutes and 3 seconds
%% Cell type:markdown id: tags:
# Analyse the active site of an enzymatic domain
%% Cell type:markdown id: tags:
This example is adapted from the method used by [AntiSMASH](https://antismash.secondarymetabolites.org/#!/about) to annotate biosynthetic gene clusters. AntiSMASH uses profile HMMs to annotate enzymatic domains in protein sequences. By matching the amino acids in the alignment, it can then predict the product specificity of the enzyme.
In this notebook, we show how to reproduce this kind of analysis, using a PKSI Acyltransferase domain built by the AntiSMASH authors (the HMM in HMMER2 format can be downloaded from [their git repository](https://github.com/antismash/antismash/blob/master/antismash/modules/active_site_finder/data/PKSI-AT.hmm2)).
<div class="alert alert-info">
References
* [Del Vecchio, F., H. Petkovic, S. G. Kendrew, L. Low, B. Wilkinson, R. Lill, J. Cortes, B. A. Rudd, J. Staunton, and P. F. Leadlay. 2003. *Active-site residue, domain and module swaps in modular polyketide synthases.* J Ind. Microbiol Biotechnol 30:489-494.](https://pubmed.ncbi.nlm.nih.gov/12811585/)
* [Medema MH, Blin K, Cimermancic P, de Jager V, Zakrzewski P, Fischbach MA,
Weber T, Takano E, Breitling R. *antiSMASH: rapid identification, annotation and
analysis of secondary metabolite biosynthesis gene clusters in bacterial and
fungal genome sequences.* Nucleic Acids Res. 2011 Jul:W339-46](https://pubmed.ncbi.nlm.nih.gov/21672958/).
</div>
%% Cell type:code id: tags:
``` python
import pyhmmer
pyhmmer.__version__
```
%% Cell type:markdown id: tags:
## Loading the HMM
%% Cell type:markdown id: tags:
Loading a HMMER profile is done with the `pyhmmer.plan7.HMMFile` class, which provides an iterator over the HMMs in the file. Since we only use a single HMM, we can simply use `next` to get the first (and only) `pyhmmer.plan7.HMM`.
%% Cell type:code id: tags:
``` python
with pyhmmer.plan7.HMMFile("data/hmms/txt/PKSI-AT.hmm") as hmm_file:
hmm = next(hmm_file)
```
%% Cell type:markdown id: tags:
## Loading digitized sequences
%% Cell type:markdown id: tags:
Easel provides the code necessary to load sequences from files in common biological formats, such as GenBank or FASTA. These utilities are wrapped by the `pyhmmer.easel.SequenceFile`, which provides an iterator over the sequences in the file. Note that `SequenceFile` tries to guess the format by default, but you can force a particular format with the `format` keyword argument.
%% Cell type:code id: tags:
``` python
with pyhmmer.easel.SequenceFile("data/seqs/PKSI.faa", digital=True) as seq_file:
sequences = list(seq_file)
```
%% Cell type:markdown id: tags:
<div class="alert alert-info">
Note
The C interface of Easel allows storing a sequence in two different modes: in _text_ mode, where the sequence letters are represented as individual characters (e.g. "A" or "Y"), and _digital_ mode, where sequence letters are encoded as digits. To make Python programs clearer, and to allow static typecheck of the storage mode, we provide two separate classes, `TextSequence` and `DigitalSequence`, that represent a sequence stored in either of these modes.
</div>
`SequenceFile` yields sequences in text mode, but HMMER expects sequences in digital mode, so we must digitize them. This requires the sequence alphabet to be known, but we can just use the `Alphabet` instance stored in the `alphabet` attribute of `hmm`.
%% Cell type:markdown id: tags:
## Running a search pipeline
%% Cell type:markdown id: tags:
With the sequences and the HMM ready, we can finally run the search pipeline: it has to be initialized with an `Alphabet` instance, so that the Plan7 background model can be configured accordingly. Then, we run the pipeline in search mode, providing it one HMM, and several sequences. This method returns a `TopHits` instance that is already sorted and thresholded.
%% Cell type:code id: tags:
``` python
pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
hits = pipeline.search_hmm(hmm, sequences)
```
%% Cell type:markdown id: tags:
## Rendering the alignments
%% Cell type:markdown id: tags:
`Domain` instances store all the required information to report results in their `alignment` attribute. We can show the alignment between a HMM and a sequence
like `hmmsearch` would as follow (using the first domain of the first hit as an example):
%% Cell type:code id: tags:
``` python
ali = hits[0].domains[0].alignment
print(" "*3, ali.target_name.decode())
print("{:3}".format(ali.hmm_from), ali.hmm_sequence, "{:3}".format(ali.hmm_to))
print(" "*3, ali.identity_sequence)
print("{:3}".format(ali.target_from), ali.target_sequence, "{:3}".format(ali.target_to))
print(" "*3, ali.hmm_name.decode())
print(ali)
```
%% Cell type:markdown id: tags:
You may also want to see where the domains are located in the input sequence; using the [DNA feature viewer](https://edinburgh-genome-foundry.github.io/DnaFeaturesViewer/) developed by the [Edinburgh Genome Foundry](https://edinburgh-genome-foundry.github.io/), we can build a summary graph aligning the protein sequences to the same reference axis:
%% Cell type:code id: tags:
``` python
from dna_features_viewer import GraphicFeature, GraphicRecord
import matplotlib.pyplot as plt
# create an index so we can retrieve a Sequence from its name
seq_index = { seq.name:seq for seq in sequences }
fig, axes = plt.subplots(nrows=len(hits), figsize=(16, 6), sharex=True)
for ax, hit in zip(axes, hits):
# add one feature per domain
features = [
GraphicFeature(start=d.alignment.target_from-1, end=d.alignment.target_to)
for d in hit.domains
]
length = len(seq_index[hit.name])
desc = seq_index[hit.name].description.decode()
# render the feature records
record = GraphicRecord(sequence_length=length, features=features)
record.plot(ax=ax)
ax.set_title(desc)
# make sure everything fits in the final graph!
fig.tight_layout()
```
%% Cell type:markdown id: tags:
## Checking individual positions for catalytic activity
%% Cell type:markdown id: tags:
First let's define a function to iterate over an alignement; this will come in handy later. This function yields the position in the alignment (using the HMM coordinates) and the aligned amino acid, skipping over gaps in the HMM sequence.
%% Cell type:code id: tags:
``` python
def iter_target_match(alignment):
position = alignment.hmm_from
for hmm_letter, amino_acid in zip(alignment.hmm_sequence, alignment.target_sequence):
if hmm_letter != ".":
yield position, amino_acid
position += 1
```
%% Cell type:markdown id: tags:
Now, for the final step, we want to check for the specificity of the enzyme domains; Del Vecchio *et al.* have identified two amino acids in the acyltransferase domain that once muted will decide of the enzyme specificity for either malonyl-CoA or methylmalonyl-CoA:
![](active_site.png)
For this, we need to check the alignment produced by HMMER, and verify the residues of the catalytic site correspond to the ones expected by the authors. We use the function we defined previously, first to check the core amino acids are not muted, and then to check the specificity of the two remaining residues.
%% Cell type:code id: tags:
``` python
POSITIONS = [ 93, 94, 95, 120, 196, 198]
EXPECTED = ['G', 'H', 'S', 'R', 'A', 'H']
SPECIFICITY = [195, 197]
for hit in hits:
print("\nIn sequence {!r}:".format(hit.name.decode()))
for domain in hit.domains:
ali = domain.alignment
aligned = dict(iter_target_match(ali))
print("- Found PKSI-AT domain at positions {:4} to {:4}".format(ali.target_from, ali.target_to))
try:
signature = [ aligned[x] for x in POSITIONS ]
spec = [ aligned[x] for x in SPECIFICITY ]
except KeyError:
print(" -> Domain likely too short")
continue
if signature != EXPECTED:
print(" -> Substrate specificity unknown")
elif spec == ["H", "F"]:
print(" -> Malonyl-CoA specific")
elif spec == ["Y", "S"]:
print(" -> Methylmalonyl-CoA specific")
else:
print(" -> Neither malonyl-CoA nor methylmalonyl-CoA specific")
```
......
......@@ -259,7 +259,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.2"
"version": "3.10.4"
}
},
"nbformat": 4,
......
%% Cell type:markdown id: tags:
# Build an HMM from a multiple sequence alignment
%% Cell type:code id: tags:
``` python
import pyhmmer
pyhmmer.__version__
```
%% Cell type:code id: tags:
``` python
alphabet = pyhmmer.easel.Alphabet.amino()
```
%% Cell type:markdown id: tags:
## Loading the alignment
%% Cell type:markdown id: tags:
A new HMM can be built from a single sequence, or from a multiple sequence alignment. Let's load an alignment in digital mode so that we can build our HMM:
%% Cell type:code id: tags:
``` python
with pyhmmer.easel.MSAFile("data/msa/LuxC.sto", digital=True, alphabet=alphabet) as msa_file:
msa = next(msa_file)
```
%% Cell type:markdown id: tags:
<div class="alert alert-info">
Note
In this example, we load a multiple sequence alignment from a file, but if your program produces alignment and you wish to make an HMM out of them, you can instantiate a `TextMSA` object yourself, e.g.:
```python
seq1 = pyhmmer.easel.TextSequence(name=b"seq1", sequence="WVPKQDFT")
seq2 = pyhmmer.easel.TextSequence(name=b"seq2", sequence="WL--PQGE")
msa = pyhmmer.easel.TextMSA(name=b"msa", sequences=[seq1, seq2])
```
Because we need a `DigitalMSA` to build the HMM, you will have to convert it first:
```python
msa_d = msa.digitize(alphabet)
```
</div>
%% Cell type:markdown id: tags:
## Building an HMM
Now that we have a multiple alignment loaded in memory, we can build a pHMM using a `pyhmmer.plan7.Builder`. This also requires a Plan7 background model to compute the transition probabilities.
%% Cell type:code id: tags:
``` python
builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)
hmm, _, _ = builder.build_msa(msa, background)
```
%% Cell type:markdown id: tags:
<div class="alert alert-warning">
Warning
HMMs must always have a name. The `Builder` will attempt to use the `MSA` name, but in the event your alignment file does not contain a name for the whole alignment (for instance, a `#=GF ID` line in a [Stockholm file](https://en.wikipedia.org/wiki/Stockholm_format)), this method will fail:
```python
>>> anonymous_msa = pyhmmer.easel.TextMSA(sequences=[seq1, seq2])
>>> builder.build_msa(anonymous_msa)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Could not build HMM: Unable to name the HMM.
```
</div>
%% Cell type:markdown id: tags:
We can have a look at the consensus sequence of the HMM with the `consensus` property:
%% Cell type:code id: tags:
``` python
hmm.consensus
```
%% Cell type:markdown id: tags:
## Saving the resulting HMM
%% Cell type:markdown id: tags:
Now that we have an HMM, we can save it to a file to avoid having to rebuild it every time. Using the `HMM.write` method lets us write the HMM in ASCII or binary format to an arbitrary file. The resulting file will also be compatible with the `hmmsearch` binary if you wish to use that one instead of pyHMMER.
%% Cell type:code id: tags:
``` python
with open("LuxC.hmm", "wb") as output_file:
hmm.write(output_file)
```
%% Cell type:markdown id: tags:
## Applying the HMM to a sequence database
%% Cell type:markdown id: tags:
Once a pHMM has been obtained, it can be applied to a sequence database with the `pyhmmer.plan7.Pipeline` object. Let's iterate over the protein sequences in a FASTA to see if our new HMM gets any hits:
%% Cell type:code id: tags:
``` python
with pyhmmer.easel.SequenceFile("data/seqs/LuxC.faa", digital=True, alphabet=alphabet) as seq_file:
sequences = list(seq_file)
pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background)
hits = pipeline.search_hmm(query=hmm, sequences=sequences)
```
%% Cell type:markdown id: tags:
We can then query the `TopHits` object to access the domain hits in the sequences:
%% Cell type:code id: tags:
``` python
ali = hits[0].domains[0].alignment
print(" "*3, ali.target_name.decode())
print("{:3}".format(ali.hmm_from), ali.hmm_sequence[:80] + "...")
print(" "*3, ali.identity_sequence[:80] + "...")
print("{:3}".format(ali.target_from), ali.target_sequence[:80] + "...")
print(" "*3, ali.hmm_name.decode())
print(ali)
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment