Replace `t2pks.hmm` with `RREFam` in documentation [ci skip]

c5234e58 · Martin Larralde · 8cf9a2eb · c5234e58
Commit c5234e58 authored 11 months ago by Martin Larralde
--- a/docs/examples/performance_tips.ipynb
+++ b/docs/examples/performance_tips.ipynb
@@ -36,8 +36,8 @@
   "outputs": [],
   "source": [
    "import pyhmmer\n",
-    "%timeit list(pyhmmer.plan7.HMMFile(\"data/hmms/txt/t2pks.hmm\"))\n",
-    "%timeit list(pyhmmer.plan7.HMMFile(\"data/hmms/bin/t2pks.h3m\"))"
+    "%timeit list(pyhmmer.plan7.HMMFile(\"data/hmms/txt/RREFam.hmm\"))\n",
+    "%timeit list(pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\"))"
   ]
  },
  {
@@ -68,13 +68,13 @@
    "import time\n",
    "import pyhmmer\n",
    "\n",
-    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/t2pks.h3m\") as hmms:\n",
+    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n",
    "    with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seqs:\n",
    "        t1 = time.time()\n",
    "        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))\n",
    "        print(f\"- hmmsearch found a total of {total} hits in {time.time() - t1:.3} seconds\")\n",
    "        \n",
-    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/t2pks.h3m\") as hmms:\n",
+    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n",
    "    with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seqs:\n",
    "        t1 = time.time()\n",
    "        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmscan(seqs, hmms))\n",
@@ -119,7 +119,7 @@
   "source": [
    "# loading targets iteratively - slow, but no extra memory needed\n",
    "t1 = time.time()\n",
-    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/t2pks.h3m\") as hmms:\n",
+    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n",
    "    with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seqs:\n",
    "        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))\n",
    "        print(f\"- hmmsearch without prefetching took {time.time() - t1:.3} seconds\")\n",
@@ -128,7 +128,7 @@
    "t1 = time.time()\n",
    "with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seq_file:\n",
    "    seqs = seq_file.read_block()\n",
-    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/t2pks.h3m\") as hmms:\n",
+    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n",
    "    total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))\n",
    "    print(f\"- hmmsearch with prefetching took {time.time() - t1:.3} seconds\")"
   ]
@@ -184,7 +184,7 @@
    "print(f\"Available memory: {available_memory/1024:.1f} KiB\")\n",
    "print(f\"Database on-disk size: {target_size/1024:.1f} KiB\")\n",
    "\n",
-    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/t2pks.h3m\") as hmm_file:\n",
+    "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmm_file:\n",
    "    with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seq_file:\n",
    "        if target_size < available_memory * 0.1:\n",
    "            print(\"Pre-fetching targets into memory\")\n",
@@ -228,7 +228,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.11.8"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id:a916d57f-eb73-454a-8135-642b0030444e tags:

 # Performance tips and tricks

 %% Cell type:markdown id:0cdb5bde-9415-42a4-ad31-12410e4c29da tags:

 HMMER is a very comprehensive tool, and PyHMMER mimicks most of the internals quite closely, so the API can be a little overwhelming. There is often more than one way to perform the same operation, with various level of configuration, so it can be hard to know how to make the most of it.

 %% Cell type:markdown id:4d74dbb5-9485-433f-b174-c7116e412b60 tags:

 ## Storing HMMs

 HMMs can be stored in two different formats, *text* and a *binary*, which are equivalent. The `hmmconvert` binary from HMMER can be used to convert from one to the other; in PyHMMER, you can load both formats with an `HMMFile`, and write either of the two with the `HMM.write` method. The text HMM files often receive the `.hmm` extension while the binary HMM files have the `.h3m` extension.

 The binary format is **platform-specific**: it depends on the endianness of the local platform, so for distributing HMM files, the text format should be prefered. The main advantage of the binary format, however, is that it is much smaller (often half the size of the text format) and much faster to load. We can quickly see the difference by trying to load the same HMMs: loading a binary file is at least 10x faster.

 %% Cell type:code id:6ae3572f-2d46-4957-a9cf-26eaed141fca tags:

 ``` python
 import pyhmmer
-%timeit list(pyhmmer.plan7.HMMFile("data/hmms/txt/t2pks.hmm"))
-%timeit list(pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m"))
+%timeit list(pyhmmer.plan7.HMMFile("data/hmms/txt/RREFam.hmm"))
+%timeit list(pyhmmer.plan7.HMMFile("data/hmms/bin/RREFam.h3m"))
 ```

 %% Cell type:markdown id:e6032858-3f1b-4e62-a019-ffde2119a904 tags:

 Using a binary HMM file is thus recommended when the HMMs are going to be read a lot, for instance when developping a pipeline that will have to load the HMMs every time it's invoked.

 %% Cell type:markdown id:ebf8877e-6db1-470d-8ea5-9e1e392f6caf tags:

 ## Search and scan

 The core of HMMER is a comparison pipeline that tries to match a single profile HMM to a single sequence. To turn this into a many-to-many comparison pipeline, the single comparison is run inside of two loops (for every query, for every target, compare the query to the target). HMMER refers to it as a "scan" when the inner loop iterates over profiles HMMs, and a "search" when the inner loop iterates over sequences. They are equivalent, and the raw scores computed by one or the others will be the same, with the exception of the E-value (more on that later):

 %% Cell type:code id:75bb0c36-2c87-4df3-abbe-ae76325185fd tags:

 ``` python
 import time
 import pyhmmer

-with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
+with pyhmmer.plan7.HMMFile("data/hmms/bin/RREFam.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        t1 = time.time()
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
        print(f"- hmmsearch found a total of {total} hits in {time.time() - t1:.3} seconds")

-with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
+with pyhmmer.plan7.HMMFile("data/hmms/bin/RREFam.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        t1 = time.time()
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmscan(seqs, hmms))
        print(f"- hmmscan found a total of {total} hits in {time.time() - t1:.3} seconds")
 ```

 %% Cell type:markdown id:a0631dd0-c893-4df9-9c2a-67e82b3ed446 tags:

 Scanning, however, is always slower than searching whenever there is more than one HMM to process. Reconfiguring the internal comparison pipeline for a new HMM is a costly operation, while reconfiguring a pipeline for a new sequence is almost free. During a "search", the reconfiguration is done inside the outer loop, but during a "scan" the reconfiguration is done inside the inner loop. To reduce the overhead, it's recommended to use "search" whenever possible, even if that means untangling the results later.

 Switching from a "search" to a "scan" means however changing what HMMER considers the database to be, which affects the `Z` parameter of the pipeline, and ultimately the E-value being computed for each alignment. There are two ways to take care of this:

 - Manually set the `Z` parameter to the size of what you database is: for instance, if you annotate with Pfam v35.0 which contains 19,632 HMMs, you can manually set the `Z` parameter to `19632` and use `hmmsearch` or `hmmscan` indifferently to produce the same results.
 - Use bitscores and p-values instead of E-values to ensure reproducibility of the results even if the size of your targets changes.

 %% Cell type:markdown id:9562b324-1bb1-482e-8381-462cd32dc078 tags:

 ## Pre-fetching the target database

 %% Cell type:markdown id:a47103f3-4f56-4df4-ab36-d5e7a409b54e tags:

 In the original HMMER implementation, the target database is always read iteratively from disk for every query. This means the computation can become I/O bound if the target database is not read fast enough, and adds some overhead since the targets must be read again for every new query. In PyHMMER, you have the choice of how you want your targets to be handled: `hmmsearch`, `hmmscan` and `phmmer` accept being given targets both inside a file (a `SequenceFile` in digital mode for `hmmsearch` and `phmmer`, a `HMMPressedFile` for `hmmscan`) or inside a block with all targets already loaded in memory (a `DigitalSequenceBlock` for `hmmsearch` and `phmmer`, an `OptimizedProfileBlock` for `hmmscan`).

 %% Cell type:code id:d99e5d4c-e336-4899-9faa-b674f79a0789 tags:

 ``` python
 # loading targets iteratively - slow, but no extra memory needed
 t1 = time.time()
-with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
+with pyhmmer.plan7.HMMFile("data/hmms/bin/RREFam.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
        print(f"- hmmsearch without prefetching took {time.time() - t1:.3} seconds")

 # pre-fetching targets - fast, but needs the whole target database in memory
 t1 = time.time()
 with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()
-with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
+with pyhmmer.plan7.HMMFile("data/hmms/bin/RREFam.h3m") as hmms:
    total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
    print(f"- hmmsearch with prefetching took {time.time() - t1:.3} seconds")
 ```

 %% Cell type:markdown id:8a000d01-26f6-4bd1-9eed-fdcc2c88f20c tags:

 The difference is not huge for `t2pks`, which only contains 40 individual HMMs, but it scales with the total number of queries, and can make a lot of difference when searching target sequences with a large profile library such as Pfam. Therefore, you should try to load your whole sequence database into memory when possible to get actual speed improvements. When loaded into memory, an `OptimizedSequenceBlock` is about the size of a FASTA formatted sequence file, with some additional overhead when the file has plenty of smaller sequences (like a protein file, typically):

 %% Cell type:code id:665e19df-052d-4448-8516-ccabd3b95513 tags:

 ``` python
 import sys
 import os

 size = os.stat("data/seqs/PKSI.faa").st_size
 print(f"Size of FASTA file: {size / 1024:.1f} KiB")

 with pyhmmer.easel.SequenceFile("data/seqs/PKSI.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()
    print(f"Size of block storage: {sys.getsizeof(seqs)/1024:.1f} KiB")
    print(f"Size of sequence storage: {sum(sys.getsizeof(seq) for seq in seqs)/1024:.1f} KiB")
 ```

 %% Cell type:markdown id:84eb4961-286b-44e7-be72-4ca109b0a110 tags:

 To make sure the target database can be loaded into memory, you can use `psutil` (which is a PyHMMER dependency anyway) to detect the amount of available or total memory, and only try to load the targets if they are not to big. For instance, with the following code, you would pre-fetch targets if they take less that a 10% of the total memory of the local machine: for a consumer laptop with 8GiB of RAM, it means never loading a file larger than 800MiB, which is already much larger than the size of a complete bacterial genome.

 %% Cell type:code id:270385fa-3649-4417-a778-5aab8725e141 tags:

 ``` python
 import os
 import sys
 import psutil

 available_memory = psutil.virtual_memory().available
 target_size = os.stat("data/seqs/938293.PRJEB85.HG003687.faa").st_size
 print(f"Available memory: {available_memory/1024:.1f} KiB")
 print(f"Database on-disk size: {target_size/1024:.1f} KiB")

-with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmm_file:
+with pyhmmer.plan7.HMMFile("data/hmms/bin/RREFam.h3m") as hmm_file:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
        if target_size < available_memory * 0.1:
            print("Pre-fetching targets into memory")
            targets = seq_file.read_block()
            print(f"Database in-memory size: {(sys.getsizeof(targets) + sum(sys.getsizeof(target) for target in targets))/1024:.1f} KiB")
        else:
            targets = seq_file
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmm_file, targets))
        print(f"Found a total of {total} hits in target sequences")

 ```

 %% Cell type:markdown id:257da004-09b9-4ba2-adad-cf67e4d8e767 tags:

 <div class="alert alert-info">

 Note

 While as a rule of thumb a "search" should be prefered to reduce the profile reconfiguration overhead, there is one exception: when the target sequences can't fit in memory while the HMM queries could. For instance, to annotate 1,000,000 proteins with Pfam, it will be much more efficient to run a "scan" with the Pfam HMMs pre-loaded in memory (\~2 hours with 32 threads), than to run a "search" with the target sequences being read from the filesystem (\~9.5 hours with 32 threads). Having already the target sequences or HMMs in memory will always result in a massive speed gain provided the machine has enough RAM.

 </div>