and shuffle them around as needed for the actual math operations. Storing the values back to memory requires a bit more finagling, but the overhead is clearly worth it as we did get better performance. One of our machines supports larger SIMD registers. While we've seen SSE instructions
4x Code Performance with SIMD