¶Optimizing for the Intel Atom CPU
I recently picked up a netbook with the Intel Atom CPU in it, and was pleasantly surprised by its performance. The Atom CPU is no rocket, but it does run at 1.6GHz and it wasn't too long ago that the fastest desktop CPUs were still well below 1GHz. Yeah, it's in-order... but so was the Pentium 120 that I had when I started writing VirtualDub, so big deal. Unsurprisingly, the old MPEG-1 files I used to test with still played just fine.
Now, I was a little bit more worried about Altirra, because its system requirements are higher and it has a strict real-time requirement. I was relieved to find out that it runs in real time on the Atom at around 20% of the CPU, but what was surprising was that one particular loop in the video subsystem was taking a tremendous amount of CPU time:
for(int i=0; i<w4; ++i) {
dst[0] = dst[1] = colorTable[priTable[src[0]]];
dst[2] = dst[3] = colorTable[priTable[src[1]]];
dst[4] = dst[5] = colorTable[priTable[src[2]]];
dst[6] = dst[7] = colorTable[priTable[src[3]]];
src += 4;
dst += 8;
}
What this loop does is translate from raw playfield and sprite data into 8-bit pixels, first going through a priority table and then a color table. The highest dot clock on the Atari is 7MHz (one-half color clock per pixel), but this handles the low-resolution modes which can only output at 3.5MHz, so each pixel is doubled up. This routine wasn't showing up hot on the systems I had tried previously, but on the Atom-based system it was #2 on the CodeAnalyst profile, right below the CPU core.
I hadn't done any Atom optimization before, so I dug around the usual sites for information. Everyone knows the Atom is an in-order core, so lots of branching and cache misses are bad news. However, the loop above is fairly well behaved because the priority table is small (256 bytes) and the color table is even smaller (23 bytes). Looking through the Intel optimization guide, however, this caught my eye:
12.3.2.2 Address Generation
The hardware optimizes the general case of instruction ready to execute must have data ready, and address generation precedes data being ready. If address generation encounters a dependency that needs data from another instruction, this dependency in address generation will incur a delay of 3 cycles.
This has dire consequences for any routine that does heavy table lookups. Address generation interlock (AGI) stalls are a consequence of CPU pipelining setups where address generation is performed by a separate stage ahead of the main execution stage; the benefit is that address generation can overlap execution instead of extending instruction time, but the downside is that a stall has to occur if the data isn't ready in time. In IA-32, this first became a problem in the 80486, where a one-clock stall occurred if you indexed using the result of the previous instruction. AGI stalls then became slightly more serious with the Pentium, where you then had to ensure that an instruction pair didn't generate an address from the result of the previous pair, usually by putting another pair of instructions between. The Atom has a much larger window of 3 cycles to cover, which is a lot harder when you only have eight GPRs.
But it gets worse.
The Pentium has two execution pipes that run in parallel, called the U and V pipes, both of which can execute one load or store instruction per cycle. The Atom too has two integer execution units and can execute a pair of instructions per clock at peak. However, unlike the Pentium, the Atom can only execute integer loads and stores in pipe 0. This means that not only does the Atom have a huge latency window to cover when doing table lookups, but it's also bottlenecked on only one of its two execution pipes. Yuck.
How bad is it? Well, let's look at the code that Visual C++ 2005 generated for that loop:
00000021: 0F B6 11 movzx edx,byte ptr [ecx]
00000024: 0F B6 14 32 movzx edx,byte ptr [edx+esi]
00000028: 8A 14 3A mov dl,byte ptr [edx+edi]
0000002B: 88 50 01 mov byte ptr [eax+1],dl
0000002E: 88 10 mov byte ptr [eax],dl
00000030: 0F B6 51 01 movzx edx,byte ptr [ecx+1]
00000034: 0F B6 14 32 movzx edx,byte ptr [edx+esi]
00000038: 8A 14 3A mov dl,byte ptr [edx+edi]
0000003B: 88 50 03 mov byte ptr [eax+3],dl
0000003E: 88 50 02 mov byte ptr [eax+2],dl
00000041: 0F B6 51 02 movzx edx,byte ptr [ecx+2]
00000045: 0F B6 14 32 movzx edx,byte ptr [edx+esi]
00000049: 8A 14 3A mov dl,byte ptr [edx+edi]
0000004C: 88 50 05 mov byte ptr [eax+5],dl
0000004F: 88 50 04 mov byte ptr [eax+4],dl
00000052: 0F B6 51 03 movzx edx,byte ptr [ecx+3]
00000056: 0F B6 14 32 movzx edx,byte ptr [edx+esi]
0000005A: 8A 14 3A mov dl,byte ptr [edx+edi]
0000005D: 88 50 07 mov byte ptr [eax+7],dl
00000060: 88 50 06 mov byte ptr [eax+6],dl
00000063: 83 C1 04 add ecx,4
00000066: 83 C0 08 add eax,8
00000069: 83 EB 01 sub ebx,1
0000006C: 75 B3 jne 00000021
Take the Atom's behavior with regard to AGI stalls and memory access pipe restrictions into account and it's not hard to see that this is very, very bad code to execute on the Atom. It's been my experience that Visual C++ tends to make little or no effort at interleaving instructions, which is rational when you consider the behavior of many PPro-derived architectures with regard to out-of-order execution, register renaming stalls, and avoiding register spills with the pathetically low register count. On the Atom, however, it leads to very poor performance in this case because nearly all of the code is serialized in one pipe and all dependent lookups are placed back-to-back for maximum stallage.
So what can we do? Well, time to dust off the old Pentium-era U/V pipe skills:
;eax Temp 0
;ebx Temp 1
;ecx Source
;edx **Unused
;esi Color table
;edi Priority table
;ebp Destination
;esp Counter
xloop:
add ebp, 4 ;ALU1movzx eax, [ecx+1] ;ALU0movzx ebx, [ecx] ;ALU0
add ecx, 2 ;ALU1movzx eax, [esi+eax] ;ALU0movzx ebx, [esi+ebx] ;ALU0mov al, [edi+eax] ;ALU0mov bl, [edi+ebx] ;ALU0
mov ah, al ;ALU0/1shl eax, 16 ;ALU0
mov bh, bl ;ALU0/1mov ax, bx ;ALU0/1mov [ebp-4], eax ;ALU0
sub esp, 2 ;ALU0/1jae xloop ;B
Ugly? Definitely. You might have noticed that I reused the stack pointer but left EDX free. That's because I found a way to open up that register and was trying to open up another register so that I could increase the parallelism from 2x to 4x to remove the remaining AGI stalls, but I couldn't find a way to do it. One beneficial side effect to the Atom's in-order architecture that I've leveraged here is that partial register stalls largely aren't a problem. With most modern x86 CPUs, it's become regular practice to avoid merging results in low/high byte registers such as AH/AL, because of various penalties associated with doing so. At a minimum you'd end up taking a couple of clocks of penalties, and on older P6 era CPUs it would stall the pipeline. It appears that the only partial register issue in the Atom is that you can't have simultaneously executing instructions targeting different parts of the same register. That means it's open season again on combining bytes into words for free.
Anyway, benchmarking the old routine against the new routine for 228 source pixels (456 output pixels) gives 2700 clocks for the old routine and 1644 clocks for the new routine, a ~40% improvement. The CodeAnalyst profile of Altirra shows similar improvement, so this is a real gain. Unfortunately, on the Core 2 it's the opposite story with the new routine being half as fast: 775 clocks vs. 1412 clocks. This leads me to believe that optimizing for Atom largely requires new code paths instead of merely tweaking existing ones, in order to avoid regressions on faster machines.
Is it worth optimizing for Atom? I'm not sure. Certainly there are significant gains to be made, but not all applications are suited for netbooks. An Atom-based computer would surely not be a first choice for HD video editing. Optimizing for an architecture like this in a compiler also requires a very aggressive code generator, and my experience in the Pentium era was that compilers really weren't up to the task. Current versions of Visual Studio definitely aren't; supposedly Intel C/C++ now has some support for Atom optimization, but I don't know how effective it is. There's also the question of how much multithreading can help cover for the execution delays on single-threaded code, although in some ways that feels like replacing a big problem with an even bigger problem.
For the meantime, though, it definitely seems like what's old is new again.