The first operation is a member function to transpose a matrix. It isn't immediately obvious how to optimize this with these additional registers, but after playing around with it for a while, I started to make some progress. To test out the performance difference, I create two matrices in a test program - one with the SSE version of the function and one without it. Then I perform the operation 10,000,000 times on one of them, taking a cycle count before and after the loop. I then repeat it for the other matrix and compare. That should be enough times to average out any OS activity or swapping out threads that occurs during the test. Also, the memory caching mechanism of the processor should effectively be equated out of the test since all of the data are created on the stack - so both matrices reside in the L1 cache (this won't be the case in real use of the matrices, but they both have the same memory access penalty so its alright to ignore it for now).
On average, using std::swap to swap the required matrix entries (there are six swap operations) takes approximately 24 cycles, while my SSE version takes 20 cycles. That's a pretty good start, 17% faster without very much work input. Of course, matrix transposing isn't the most common operation, so we'll see much better performance gains with other functions. I'll post about those next time...