¶A Visual C++ x64 code generation peculiarity
Take an SSE2 routine to do some filtering:
#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = &table[*indices++ * 2]; acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }
What this routine does is that it uses each index to look up a premultiplied kernel, and adds that to a short output window (8 samples). The output stream has 4x rate compared to the input stream. In a real routine the kernels would typically be a bit longer, but an example of where you might use something like this is to simultaneously upsample and convert a row of pixels or a block of audio through a non-linear curve.
If we look at the output of the VC++ x86 compiler, the result is decent:
0020: 0F B6 01 movzx eax,byte ptr [ecx] 0023: C1 E0 05 shl eax,5 0026: 66 0F 6F C1 movdqa xmm0,xmm1 002A: 66 0F FD 04 38 paddw xmm0,xmmword ptr [eax+edi] 002F: 66 0F 6F 4C 38 10 movdqa xmm1,xmmword ptr [eax+edi+10h] 0035: 66 0F 7F 02 movdqa xmmword ptr [edx],xmm0 0039: 41 inc ecx 003A: 83 C2 10 add edx,10h 003D: 4E dec esi 003E: 75 E0 jne 00000020
However, if we look at x64:
0010: 41 0F B6 00 movzx eax,byte ptr [r8] 0014: 48 83 C1 10 add rcx,10h 0018: 49 FF C0 inc r8 001B: 03 C0 add eax,eax 001D: 48 63 D0 movsxd rdx,eax 0020: 48 03 D2 add rdx,rdx 0023: 41 FF C9 dec r9d 0026: 66 41 0F FD 0C D2 paddw xmm1,xmmword ptr [r10+rdx*8] 002C: 66 0F 6F C1 movdqa xmm0,xmm1 0030: 66 41 0F 6F 4C D2 movdqa xmm1,xmmword ptr [r10+rdx*8+10h] 10 0037: 66 0F 7F 41 F0 movdqa xmmword ptr [rcx-10h],xmm0 003C: 75 D2 jne 0010
It turns out that there are a couple of weirdnesses involved when the x64 compiler hits this code. The x86 compiler is able to fold the x2 from the indexing expression and the x16 from the 128-bit (__m128i) element size into a single x32, which is then converted into a left shift by 5 bits (shl). The x64 compiler is not, and ends up emitting x2 + x2 + x8. Why?
The clue as to what's going on is in the MOVSXD instruction, which is a sign extension instruction. According to the C/C++ standards, integral expressions involving values smaller than int are promoted to int, which in the case of Win32/Win64 is 32-bit. Therefore, the expression (*indices++ * 2) gives a signed 32-bit integer. For the x86 compiler, pointers are also 32-bit and so it just shrugs and uses the signed value. The x64 compiler has to deal with a conversion to a 64-bit pointer offset, however, and seems unable to recognize that an unsigned char multiplied by 2 will never be negative, so it emits sign extension code.
Therefore, we should change the code to remove the intermediate signed type:
#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = &table[*indices++ * 2U]; acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }
Now we are multiplying by an unsigned integer, so the result must be an unsigned int. The x64 compiler now generates the following:
0090: 4C 8B D2 mov r10,rdx 0093: 66 0F EF C9 pxor xmm1,xmm1 0097: 45 85 C9 test r9d,r9d 009A: 74 30 je 00CC 009C: 0F 1F 40 00 nop dword ptr [rax] 00A0: 41 0F B6 00 movzx eax,byte ptr [r8] 00A4: 48 83 C1 10 add rcx,10h 00A8: 49 FF C0 inc r8 00AB: 8D 14 00 lea edx,[rax+rax] 00AE: 48 03 D2 add rdx,rdx 00B1: 41 FF C9 dec r9d 00B4: 66 41 0F FD 0C D2 paddw xmm1,xmmword ptr [r10+rdx*8] 00BA: 66 0F 6F C1 movdqa xmm0,xmm1 00BE: 66 41 0F 6F 4C D2 movdqa xmm1,xmmword ptr [r10+rdx*8+10h] 10 00C5: 66 0F 7F 41 F0 movdqa xmmword ptr [rcx-10h],xmm0 00CA: 75 D4 jne 00A0
Better, but still not quite there. The x64 compiler no longer needs to sign extend the offset, and therefore can now take advantage of the implicit zero extension in x64 when working with 32-bit registers. (New x64 programmers are often confused by the compiler emitting MOV EAX, EAX instructions, which are not no-ops as they zero the high dword.) However, the compiler is still unable to fuse the additions together. A bit of experimentation with the kernel size multiplier reveals that the x64 compiler has an unusual attachment to the trick of doing an x2 add followed by an x8 scale in order to index 16-byte elements. In this particular case there's a possibility that the two adds might be faster than a shift on some CPUs, but with larger multipliers the compiler generates a SHL followed by an ADD, which is never optimal. Therefore, let's take over the indexing entirely:
#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = (const __m128i *)
((const char *)table + (*indices++ * 32U));
acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }
Ugly? Definitely, but we're having to work around optimizer shortcomings here. Result:
0060: 41 0F B6 00 movzx eax,byte ptr [r8] 0064: 48 83 C1 10 add rcx,10h 0068: 49 FF C0 inc r8 006B: C1 E0 05 shl eax,5 006E: 41 FF C9 dec r9d 0071: 66 0F FD 0C 10 paddw xmm1,xmmword ptr [rax+rdx] 0076: 66 0F 6F C1 movdqa xmm0,xmm1 007A: 66 0F 6F 4C 10 10 movdqa xmm1,xmmword ptr [rax+rdx+10h] 0080: 66 0F 7F 41 F0 movdqa xmmword ptr [rcx-10h],xmm0 0085: 75 D9 jne 0060
That's better.
Conclusion: check your critical loops after porting to x64, even if they're using intrinsics and don't require fixes. There may be changes in code generation due to both architectural and compiler differences.