Improve z-shift kernel
The z-shift kernel currently does a sequential sum where it has a data race on writing data into the BLM array. Here are some options for resolving this:
- Use a parallel sum to add up the terms going into BLM
- Load the whole
dlmkp
M slice into local memory at once - this can only be done for relatively small LMAX (approx < 30 depending on NSWORK). - Slightly unpack the dlmkp array so that all k values are present in each slice rather than just K<=L