Reorganize kernels to reduce busy waiting
Many of the array calculations are triangular or even pyramidal because of symmetries and selection rules. When an OpenCL work group spans one of the array dimensions, this means that half of the execution time is spent busy waiting for other workers to complete a calculation which is not necessary for this worker.
It should in theory be possible to 'fold over' the top corner of a triangle to make a more rectangular execution pattern. This wouldn't work where there is a parallel scan calculation along the work group axis, but most of the kernels could benefit.