Improve z-shift kernel

The z-shift kernel currently does a sequential sum where it has a data race on writing data into the BLM array. Here are some options for resolving this:

Use a parallel sum to add up the terms going into BLM
Load the whole dlmkp M slice into local memory at once - this can only be done for relatively small LMAX (approx < 30 depending on NSWORK).
Slightly unpack the dlmkp array so that all k values are present in each slice rather than just K<=L

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information