Fix use of deprecated `TopHits` properties in documentation

a13e32aa · Martin Larralde · 85c532e4 · a13e32aa
Commit a13e32aa authored 5 months ago by Martin Larralde
--- a/docs/examples/fetchmgs.ipynb
+++ b/docs/examples/fetchmgs.ipynb
@@ -160,7 +160,7 @@
   "source": [
    "results = []\n",
    "for hits in pyhmmer.hmmsearch(hmms, proteins, bit_cutoffs=\"trusted\"):\n",
-    "    cog = hits.query_name.decode()\n",
+    "    cog = hits.query.name.decode()\n",
    "    for hit in hits:\n",
    "        if hit.included:\n",
    "            results.append(Result(hit.name.decode(), cog, hit.score))"

 %% Cell type:markdown id:c9467c38 tags:

 # Fetch marker genes from a genome

 %% Cell type:markdown id:094d9734 tags:

 This example is adapted from the [fetchMGs](https://github.com/motu-tool/fetchMGs/) Perl script used in [mOTU](https://motu-tool.org/) to extract the 40 single-copy universal marker genes from a genome, annotating proteins to find the highest scoring proteins mapping to each of these marker genes.

 In this notebook, we show how to reproduce this kind of analysis, using `pyhmmer` instead of HMMER3 to perform the alignments and extract the bit scores.

 <div class="alert alert-info">

 References

 * [Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. *Toward automatic reconstruction of a highly resolved tree of life.* Science. 2006 Mar 3;311(5765):1283-7. Erratum in: Science. 2006 May 5;312(5774):697.](https://pubmed.ncbi.nlm.nih.gov/16513982/)
 * [Sorek R, Zhu Y, Creevey CJ, Francino MP, Bork P, Rubin EM. *Genome-wide experimental determination of barriers to horizontal gene transfer.* Science. 2007 Nov 30;318(5855):1449-52.](https://pubmed.ncbi.nlm.nih.gov/17947550/)

 </div>

 %% Cell type:code id:f21f9ece tags:

 ``` python
 import pyhmmer
 pyhmmer.__version__
 ```

 %% Cell type:markdown id:37db1b70 tags:

 ## Getting the cutoffs

 Each HMM has been calibrated and contains custom cutoffs, but they are not in Pfam format, so we need to use them externaly. Let's start by downloading the file with these cutoffs from the GitHub repository of `fetchMG`:

 %% Cell type:code id:77c04216 tags:

 ``` python
 import csv
 import io
 import urllib.request

 url = "https://github.com/motu-tool/FetchMGs/raw/1.3/fetchmgs/data/MG_BitScoreCutoffs.allhits.txt"

 cutoffs = {}
 with urllib.request.urlopen(url) as f:
    for line in csv.reader(io.TextIOWrapper(f), dialect="excel-tab"):
        if not line[0].startswith("#"):
            cutoffs[line[0]] = float(line[1])
 ```

 %% Cell type:markdown id:cc40f524 tags:

 ## Downloading the HMMs

 Since the HMMs for the universal marker genes are also hosted on the `fetchMG` GitHub repository, we can download them from there too. `pyhmmer.plan7.HMMFile` supports reading from a file-handle, so we can parse each HMM as we download it. We also use the occasion to update the bitscore cutoffs specific to each HMM that we just obtained in the previous step.

 %% Cell type:code id:cb1219de tags:

 ``` python
 import urllib.request
 import pyhmmer.plan7

 baseurl = "https://github.com/motu-tool/FetchMGs/raw/1.3/fetchmgs/data/{}.hmm"

 hmms = []
 for cog in cutoffs:
    with urllib.request.urlopen(baseurl.format(cog)) as f:
        with pyhmmer.plan7.HMMFile(f) as hmm_file:
            hmm = hmm_file.read()
        cutoff = cutoffs[hmm.name.decode()]
        hmm.cutoffs.trusted = (cutoff, cutoff)
        hmms.append(hmm)
 ```

 %% Cell type:markdown id:d4eff896 tags:

 ## Loading the sequences

 Now we need protein sequences to annotate. Let's use our set of protein sequences identified in the chromosome of [Anaerococcus provencensis](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=938293).

 %% Cell type:code id:d0e20139 tags:

 ``` python
 import pyhmmer.easel
 with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs_file:
    proteins = seqs_file.read_block()
 ```

 %% Cell type:markdown id:f1d021ae tags:

 ## Running the search pipeline

 With the proteins loaded, let's make a `namedtuple` that will contain the data we need to extract for each `hmmsearch` hit: the name of the query gene, the name of the marker gene which produced a hit, and the bitscore for the alignment.

 %% Cell type:code id:a5d53e11 tags:

 ``` python
 import collections
 Result = collections.namedtuple("Result", ["query", "cog", "bitscore"])
 ```

 %% Cell type:markdown id:f5baee6c tags:

 Now we can run the search pipeline: we annotate all the proteins will all the HMMs using the `pyhmmer.hmmsearch` function. Each HMM gives us a `TopHits` instance to process. Note that we run the search pipeline using the *trusted* bitscore cutoffs so that the cutoffs that we manually set on each HMMs are used to filter the results. We only keep hits that fall under the inclusion threshold for each HMM.

 %% Cell type:code id:00ae2763 tags:

 ``` python
 results = []
 for hits in pyhmmer.hmmsearch(hmms, proteins, bit_cutoffs="trusted"):
-    cog = hits.query_name.decode()
+    cog = hits.query.name.decode()
    for hit in hits:
        if hit.included:
            results.append(Result(hit.name.decode(), cog, hit.score))
 ```

 %% Cell type:markdown id:814642fa tags:

 ## Filtering results

 Now that we have all the hits that pass the bitscore thresholds, we can create a dictionary that maps each query protein to its highest scoring bitscore alignment, like in the original script. If a protein has alignments to two different marker genes with the same score, that query is ignored.

 %% Cell type:code id:d03a1ab3 tags:

 ``` python
 best_results = {}
 keep_query = set()
 for result in results:
    if result.query in best_results:
        previous_bitscore = best_results[result.query].bitscore
        if result.bitscore > previous_bitscore:
            best_results[result.query] = result
            keep_query.add(result.query)
        elif result.bitscore == previous_bitscore:
            if best_results[result.query].cog != hit.cog:
                keep_query.remove(result.query)
    else:
        best_results[result.query] = result
        keep_query.add(result.query)
 ```

 %% Cell type:markdown id:ee538a94 tags:

 Now we can get our final filtered results:

 %% Cell type:code id:85484ed1 tags:

 ``` python
 filtered_results = [best_results[k] for k in sorted(best_results) if k in keep_query]
 ```

 %% Cell type:markdown id:bd8701f5 tags:

 We can print them to see which gene maps to which marker gene, with the score for each alignment. Let's look at the top 10 results:

 %% Cell type:code id:b4d59cb9 tags:

 ``` python
 for result in filtered_results[:10]:
    print(result.query, "{:.1f}".format(result.bitscore), result.cog, sep="\t")
 ```

 %% Cell type:markdown id:74691d9c tags:

 All set! You can now use this list of identifiers to filter the initial protein array, to write proteins grouped by marker genes in a FASTA file, compute alignment against the [mOTUs database](https://zenodo.org/record/3364101), or just save it for future use.