Improve z-shift kernel
The z-shift kernel currently does a sequential sum where it has a data race on writing data into the BLM array. Here are some options for resolving this:
- Use a parallel sum to add up the terms going into BLM
- Load the whole
dlmkpM slice into local memory at once - this can only be done for relatively small LMAX (approx < 30 depending on NSWORK).
- Slightly unpack the dlmkp array so that all k values are present in each slice rather than just K<=L
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information