which load 128-bits, or 4 floats at a time. But how is that possible if values need to be contiguous in memory while our position and speed values are interleaved? If we look around, we'll find a hint with commands like `shuffle` and `unpack`. We again don't need to fully follow the
4x Code Performance with SIMD