Unverified Commit 7913ed64 authored by tmsincomb's avatar tmsincomb Committed by GitHub
Browse files

Use `bytes` instead of `str` for name in `TextSequence` example (#20)

parent 0f98ebc9
Pipeline #32782 passed with stage
in 5 minutes and 42 seconds
%% Cell type:markdown id: tags:
# Build an HMM from a multiple sequence alignment
%% Cell type:code id: tags:
``` python
import pyhmmer
pyhmmer.__version__
```
%% Cell type:code id: tags:
``` python
alphabet = pyhmmer.easel.Alphabet.amino()
```
%% Cell type:markdown id: tags:
## Loading the alignment
%% Cell type:markdown id: tags:
A new HMM can be built from a single sequence, or from a multiple sequence alignment. Let's load an alignment in digital mode so that we can build our HMM:
%% Cell type:code id: tags:
``` python
with pyhmmer.easel.MSAFile("data/msa/LuxC.sto", digital=True, alphabet=alphabet) as msa_file:
msa = next(msa_file)
```
%% Cell type:markdown id: tags:
<div class="alert alert-info">
Note
In this example, we load a multiple sequence alignment from a file, but if your program produces alignment and you wish to make an HMM out of them, you can instantiate a `TextMSA` object yourself, e.g.:
```python
seq1 = pyhmmer.easel.TextSequence(name="seq1", sequence="WVPKQDFT")
seq2 = pyhmmer.easel.TextSequence(name="seq2", sequence="WL--PQGE")
msa = pyhmmer.easel.TextMSA(name="msa", sequences=[seq1, seq2])
seq1 = pyhmmer.easel.TextSequence(name=b"seq1", sequence="WVPKQDFT")
seq2 = pyhmmer.easel.TextSequence(name=b"seq2", sequence="WL--PQGE")
msa = pyhmmer.easel.TextMSA(name=b"msa", sequences=[seq1, seq2])
```
Because we need a `DigitalMSA` to build the HMM, you will have to convert it first:
```python
msa_d = msa.digitize(alphabet)
```
</div>
%% Cell type:markdown id: tags:
## Building an HMM
Now that we have a multiple alignment loaded in memory, we can build a pHMM using a `pyhmmer.plan7.Builder`. This also requires a Plan7 background model to compute the transition probabilities.
%% Cell type:code id: tags:
``` python
builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)
hmm, _, _ = builder.build_msa(msa, background)
```
%% Cell type:markdown id: tags:
We can have a look at the consensus sequence of the HMM with the `consensus` property:
%% Cell type:code id: tags:
``` python
hmm.consensus
```
%% Cell type:markdown id: tags:
## Saving the resulting HMM
%% Cell type:markdown id: tags:
Now that we have an HMM, we can save it to a file to avoid having to rebuild it every time. Using the `HMM.write` method lets us write the HMM in ASCII or binary format to an arbitrary file. The resulting file will also be compatible with the `hmmsearch` binary if you wish to use that one instead of pyHMMER.
%% Cell type:code id: tags:
``` python
with open("LuxC.hmm", "wb") as output_file:
hmm.write(output_file)
```
%% Cell type:markdown id: tags:
## Applying the HMM to a sequence database
%% Cell type:markdown id: tags:
Once a pHMM has been obtained, it can be applied to a sequence database with the `pyhmmer.plan7.Pipeline` object. Let's iterate over the protein sequences in a FASTA to see if our new HMM gets any hits:
%% Cell type:code id: tags:
``` python
with pyhmmer.easel.SequenceFile("data/seqs/LuxC.faa", digital=True, alphabet=alphabet) as seq_file:
sequences = list(seq_file)
pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background)
hits = pipeline.search_hmm(query=hmm, sequences=sequences)
```
%% Cell type:markdown id: tags:
We can then query the `TopHits` object to access the domain hits in the sequences:
%% Cell type:code id: tags:
``` python
ali = hits[0].domains[0].alignment
print(" "*3, ali.target_name.decode())
print("{:3}".format(ali.hmm_from), ali.hmm_sequence[:80] + "...")
print(" "*3, ali.identity_sequence[:80] + "...")
print("{:3}".format(ali.target_from), ali.target_sequence[:80] + "...")
print(" "*3, ali.hmm_name.decode())
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment