¶Compiler intrinsics... again
You know that episode of The Simpsons where Bart reaches for the electrified cookie jar and goes "ow," and then just keeps doing it again and again? Yeah, I'm like that with compiler intrinsics.
Let's take a simple routine:
__m128i fold1(__m128i x) {
__m128i mask = _mm_set1_epi16(0x5555);
return _mm_add_epi16(_mm_and_si128(mask, _mm_srli_epi16(x, 1)), _mm_and_si128(mask, x));
}
This is one step of a population count routine, which folds pairs of bits together into two-bit counts. (Yeah, I know this can be done better with subtraction, but popcount isn't the subject here.) Run this through VC10, and you get this:
movdqa xmm1,xmmword ptr [__xmm@0]
movdqa xmm2,xmm0
movdqa xmm0,xmm1
movdqa xmm3,xmm2
psrlw xmm3,1
pand xmm0,xmm3
pand xmm1,xmm2
paddw xmm0,xmm1
ret
Unnecessary moves blah blah blah... you've heard it here before. Then again, let's take a closer look. Why did the compiler emit the MOVDQA XMM3, XMM2 instruction? Hmm, it's because it did the shift next, but it still needed to keep "x" around for the second operation. And how about that PAND that follows? Well, it couldn't modify "mask," so it copied that too. Waaaiit a minute, it's just doing everything exactly the way I told it. That might be OK if x86 used three-argument form instructions, but since x86 is two-argument, that kinda sucks. What about if we rewrote the routine this way:
__m128i fold2(__m128i x) {
__m128i mask = _mm_set1_epi16(0x5555);
return _mm_add_epi16(_mm_and_si128(_mm_srli_epi16(x, 1), mask), _mm_and_si128(mask, x));
}
movdqa xmm1,xmmword ptr [__xmm@0]
movdqa xmm2,xmm0
psrlw xmm0,1
pand xmm0,xmm1
pand xmm1,xmm2
paddw xmm0,xmm1
ret
Well, that looks a bit better. It appears that Visual C++ is unable to take advantage of the fact that the binary operations used here are commutative, which means that the efficiency of the code generated can differ significantly based on the order of the arguments even though the result is the same. The upside is that you can swap around arguments to get better code; the downside is that you're doing what the code generator should be doing. Interestingly, based on some experiments it looks like the code generator can do this for scalar operations, so something didn't get hooked up or extended to the intrinsics portion.
Anyway, if you've got extra moves showing up in the disassembly when using intrinsics, try shaking the expression tree a bit and see if some of the moves fall out.