the straightforward logic, both aspects could be improved simultaneously by handling it all on the GPU, but the potential for auto-vectorization of the CPU code is too intriguing to ignore, so perhaps the GPU changes will have to be for another time. Going down the
4x Code Performance with SIMD