HMMER is a very comprehensive tool, and PyHMMER mimicks most of the internals quite closely, so the API can be a little overwhelming. There is often more than one way to perform the same operation, with various level of configuration, so it can be hard to know how to make the most of it.
HMMs can be stored in two different formats, *text* and a *binary*, which are equivalent. The `hmmconvert` binary from HMMER can be used to convert from one to the other; in PyHMMER, you can load both formats with an `HMMFile`, and write either of the two with the `HMM.write` method. The text HMM files often receive the `.hmm` extension while the binary HMM files have the `.h3m` extension.
The binary format is **platform-specific**: it depends on the endianness of the local platform, so for distributing HMM files, the text format should be prefered. The main advantage of the binary format, however, is that it is much smaller (often half the size of the text format) and much faster to load. We can quickly see the difference by trying to load the same HMMs: loading a binary file is at least 10x faster.
Using a binary HMM file is thus recommended when the HMMs are going to be read a lot, for instance when developping a pipeline that will have to load the HMMs every time it's invoked.
The core of HMMER is a comparison pipeline that tries to match a single profile HMM to a single sequence. To turn this into a many-to-many comparison pipeline, the single comparison is run inside of two loops (for every query, for every target, compare the query to the target). HMMER refers to it as a "scan" when the inner loop iterates over profiles HMMs, and a "search" when the inner loop iterates over sequences. They are equivalent, and the raw scores computed by one or the others will be the same, with the exception of the E-value (more on that later):
Scanning, however, is always slower than searching whenever there is more than one HMM to process. Reconfiguring the internal comparison pipeline for a new HMM is a costly operation, while reconfiguring a pipeline for a new sequence is almost free. During a "search", the reconfiguration is done inside the outer loop, but during a "scan" the reconfiguration is done inside the inner loop. To reduce the overhead, it's recommended to use "search" whenever possible, even if that means untangling the results later.
Switching from a "search" to a "scan" means however changing what HMMER considers the database to be, which affects the `Z` parameter of the pipeline, and ultimately the E-value being computed for each alignment. There are two ways to take care of this:
- Manually set the `Z` parameter to the size of what you database is: for instance, if you annotate with Pfam v35.0 which contains 19,632 HMMs, you can manually set the `Z` parameter to `19632` and use `hmmsearch` or `hmmscan` indifferently to produce the same results.
- Use bitscores and p-values instead of E-values to ensure reproducibility of the results even if the size of your targets changes.
In the original HMMER implementation, the target database is always read iteratively from disk for every query. This means the computation can become I/O bound if the target database is not read fast enough, and adds some overhead since the targets must be read again for every new query. In PyHMMER, you have the choice of how you want your targets to be handled: `hmmsearch`, `hmmscan` and `phmmer` accept being given targets both inside a file (a `SequenceFile` in digital mode for `hmmsearch` and `phmmer`, a `HMMPressedFile` for `hmmscan`) or inside a block with all targets already loaded in memory (a `DigitalSequenceBlock` for `hmmsearch` and `phmmer`, an `OptimizedProfileBlock` for `hmmscan`).
The difference is not huge for `t2pks`, which only contains 40 individual HMMs, but it scales with the total number of queries, and can make a lot of difference when searching target sequences with a large profile library such as Pfam. Therefore, you should try to load your whole sequence database into memory when possible to get actual speed improvements. When loaded into memory, an `OptimizedSequenceBlock` is about the size of a FASTA formatted sequence file, with some additional overhead when the file has plenty of smaller sequences (like a protein file, typically):
To make sure the target database can be loaded into memory, you can use `psutil` (which is a PyHMMER dependency anyway) to detect the amount of available or total memory, and only try to load the targets if they are not to big. For instance, with the following code, you would pre-fetch targets if they take less that a 10% of the total memory of the local machine: for a consumer laptop with 8GiB of RAM, it means never loading a file larger than 800MiB, which is already much larger than the size of a complete bacterial genome.
While as a rule of thumb a "search" should be prefered to reduce the profile reconfiguration overhead, there is one exception: when the target sequences can't fit in memory while the HMM queries could. For instance, to annotate 1,000,000 proteins with Pfam, it will be much more efficient to run a "scan" with the Pfam HMMs pre-loaded in memory (\~2 hours with 32 threads), than to run a "search" with the target sequences being read from the filesystem (\~9.5 hours with 32 threads). Having already the target sequences or HMMs in memory will always result in a massive speed gain provided the machine has enough RAM.