Improve management of required database(s)
Current GECCO does a site install and unpacks the PFAM HMM to the data
folder. However, it should be possible to make it position independent by managing the setup of PFAM ourselves. The solutions are the following
Vendoring on PyPI
GZip Compressed, the PFAM HMMs are only 250MB, which is above the PyPI size limit, but this limit could be raised on request. Using that, we could directly distribute the HMMs with the source and avoid worrying about the availability of PFam.
Compression benchmarks:
- Brotli: 219MB
- Gzip: 253MB
- LZMA: 190MB
Installing in site package
During the install step, automatically download PFAM from the FTP server, and install it somewhere in the source tree before installing. This would make the database uninstallable without having to distribute it on PyPI, but would not allow GECCO to be released in wheel
format.
Downloading in cache
Using a cache directory, PFAM could be downloaded if not present already. The advantage is that there is no PyPI size limit to worry about anymore. The inconvenient is that the database would stay on the filesystem if GECCO gets uninstalled because pip
does not have any uninstall hook that could erase the database while uninstalling.