¶Software pixel shader emulation in Windows Presentation Foundation (WPF)
Windows Presentation Foundation (WPF) gained an interesting feature in .NET Framework 3.5 SP1, which is the ability to execute pixel shader effects in software via a just-in-time (JIT) compiler. Issues with introducing features in service packs aside, this is a cool addition, since it allows the same pixel shader code to run on the GPU and the CPU with reasonable performance on the latter. It's certainly better than the old effects system, which only supported software mode and required you to write a custom routine in a separate DLL instead.
Of course, being a sort of graphics guy but not a .NET kind of person, I had to dig into the shader jitter....
Interface
The way you use a custom shader effect in WPF is via the System.Windows.Media.Effects namespace, deriving from ShaderEffect and attaching a PixelShader object to it. You attach a precompiled pixel shader to the PixelShader and then attach the ShaderEffect to a UI element. There are facilities for binding properties to samplers and float constants.
You can't get to the vertex shader, so you can't take advantage of the hardware interpolators and all precomputation will have to be done in C#. Samplers can be switched between point and bilinear sampling, but there are no mipmaps and no support for wrapping.
WPF allows you to select three modes for the pixel shader, Auto, HardwareOnly, and SoftwareOnly. When SoftwareOnly is used, WPF converts the pixel shader to SSE2-based code using just-in-time (JIT) compilation, which is what we're interested in here.
The documentation doesn't give a lot of direction or note gotchas and could use a lot of improvement, but I could say that about a lot of the .NET Framework docs. When I originally looked at the jitter I was stuck without the docs and had to wing it via Intellisense, but I when I got access to the docs again I was surprised to find that the docs still weren't a lot of help. There's a lot more useful information on Greg Schechter's blog about how to write custom ShaderEffects.
Validation
WPF does do some validation on the pixel shader, and will reject many shaders that are otherwise valid pixel shaders, even if you are running in hardware accelerated mode. First, your shader must use the vanilla ps_2_0 shader model – ps_1_1, ps_1_4, ps_2_a, ps_2_sw, and ps_3_0 will all be rejected. Second, attempting to use some features that aren't supported by WPF will also be caught and rejected, such as reading from color interpolators (v0).
What you can do, however, is cheat somewhat by compiling to the ps_2_a, ps_2_b, or ps_2_sw targets and then hack the version token to ps_2_0 (FFFF0200). You won't get away with trying to use gradients or predication, but arbitrary swizzle does work. That makes sense, since arbitrary swizzles are easy to do in jitted code, and there isn't any special encoding in shader bytecode for extended swizzles vs. standard ps2.0 swizzles. Doing this also allows you to exceed the standard ps_2_0 limits. I take no responsibility if you do this and your code breaks with a future WPF update, though.
Code generation
The jitter requires SSE2. It probably could have been implemented with SSE1+MMX, but the performance probably would have been mediocre, the fastest chips in that range being Athlon XPs. If you're experienced with writing vectorized image processing code, you'll beat the jitter handily, but otherwise, it doesn't do a bad job. I did all experimentation on an SSE4.1-capable CPU, but didn't see any instructions used beyond SSE2 profile.
All pixel arithmetic is done in single precision using SSE. This may be a bit slower than could be done with fixed-point, but at least there are no precision or range surprises. One gotcha is that this means NaNs can also appear, which you may not be used to if you have a shader model 2 level ATI card.
The jitter reorganizes shaders into structures of arrays (SOA) form and executes pixel shaders for four pixels in parallel. This means that a single SSE register holds one component of a register across four pixels. For instance, xmm0 might hold r0.x for pixels 0-3, and a dp3 instruction would look like this:
mulps xmm0, xmm3
mulps xmm1, xmm4
mulps xmm2, xmm5
addps xmm0, xmm1
addps xmm0, xmm2
SOA form avoids a lot of swizzle traffic that would result from cross-component operations like a dot product, since SSE is poor at horizontal data traffic and doesn't have free swizzles or write masks like pixel shaders do. The downsides are much greater register pressure, particularly due to constant bloating, and more complex execution for non-naturally vector operations like table lookups. Pixel shader hardware does this too in a way, but the hardware does 2x2 quads, whereas the jitter does 4x1. The hardware needs to do quads in order to compute mipmapping parameters and gradients, but the jitter never deals with mipmapped textures.
Surprisingly, complex scalar operations are expanded inline: sincos turns into a series of muls and adds, and log is also emitted inline (although it is quite expensive). This is different from the Direct3D PSGP, which calls out to CRT transcendental functions instead when compiling vertex shaders.
There isn't a lot of optimization done on the shaders. If you manage to get four rcps in a row, they'll all get coded even if they cancel out. Ordinarily this isn't too much of a problem, since the HLSL compiler will do a lot of optimization for you. It does mean there are some cases that only the jitter can optimize and that it misses, such as a vector multiply where two out of three components are multiplied by constant zero. The jitter will strip dead stores and remove redundant moves, though.
Texture sampling is very slow, as the jitter generates several pages of machine code for every texld instruction. I'm not kidding about this – here's the generated code for just one pixel out of a 4x1 block:
lea ebx,[edi+1]
mov ecx,dword ptr [esp+70h]
mov edx,dword ptr [esp+74h]
movd xmm2,ecx
movaps xmmword ptr [esp+100h],xmm2
movd xmm3,edx
movaps xmmword ptr [esp+130h],xmm3
mov edx,dword ptr [esp+17Ch]
mov esi,dword ptr [esp+78h]
lea ecx,[eax+1]
movd xmm4,esi
movaps xmmword ptr [esp+160h],xmm4
shl edx,2
mov esi,dword ptr [esp+38h]
mov dword ptr [esp+190h],esi
mov esi,dword ptr [esp+178h]
imul eax,edx
imul ecx,edx
mov edx,dword ptr [esp+178h]
lea edx,[edx+eax]
lea esi,[esi+eax]
mov eax,dword ptr [esp+178h]
movd xmm5,dword ptr [edx+edi*4]
movd xmm6,dword ptr [esi+ebx*4]
mov esi,dword ptr [esp+178h]
lea esi,[esi+ecx]
lea eax,[eax+ecx]
movd xmm7,dword ptr [esi+edi*4]
movd xmm0,dword ptr [eax+ebx*4]
punpcklbw xmm5,xmm5
punpcklbw xmm6,xmm6
punpcklbw xmm7,xmm7
punpcklbw xmm0,xmm0
punpcklwd xmm5,xmm5
punpcklwd xmm6,xmm6
punpcklwd xmm7,xmm7
punpcklwd xmm0,xmm0
psrld xmm5,18h
psrld xmm6,18h
psrld xmm7,18h
psrld xmm0,18h
cvtdq2ps xmm5,xmm5
cvtdq2ps xmm6,xmm6
cvtdq2ps xmm7,xmm7
cvtdq2ps xmm0,xmm0
mov eax,dword ptr [esp+48h]
mov ebx,dword ptr [esp+58h]
mov ecx,dword ptr [esp+68h]
movd xmm1,dword ptr [esp+190h]
movd xmm4,eax
pshufd xmm1,xmm1,0
pshufd xmm4,xmm4,0
movd xmm3,ebx
movd xmm2,ecx
pshufd xmm3,xmm3,0
pshufd xmm2,xmm2,0
mulps xmm0,xmm1
mulps xmm7,xmm3
mulps xmm1,xmm6
movaps xmm6,xmmword ptr [esp+130h]
addps xmm7,xmm0
mulps xmm3,xmm5
movaps xmm5,xmmword ptr [esp+100h]
mulps xmm4,xmm7
movaps xmm7,xmmword ptr [esp+160h]
addps xmm3,xmm1
shufps xmm5,xmm5,93h
mulps xmm2,xmm3
shufps xmm6,xmm6,93h
addps xmm2,xmm4
movaps xmmword ptr [esp+80h],xmm2
shufps xmm7,xmm7,93h
mov edx,dword ptr [esp+80h]
mov esi,dword ptr [esp+84h]
movd xmm0,edx
movd xmm1,esi
mov esi,dword ptr [esp+17Ch]
addps xmm5,xmm0
movaps xmmword ptr [esp+0F0h],xmm5
addps xmm6,xmm1
movaps xmmword ptr [esp+120h],xmm6
mov edi,dword ptr [esp+88h]
mov eax,dword ptr [esp+14h]
movd xmm2,edi
mov ebx,dword ptr [esp+24h]
addps xmm7,xmm2
movaps xmmword ptr [esp+150h],xmm7
Now imagine that included four times for every texld instruction in your shader.
Needless to say, this bloats the generated code very quickly, and it's not unusual to see a compiled pixel shader exceed 4K. Have you read the SIGGRAPH paper on Larrabee, where they explained that texture sampling couldn't be done efficiently on the main core? Well, here's an example. Part of this is due to SSE2's poor support for expanding byte components into floats and all of the data conversions needed to get coordinates and subtexel offsets to the right places, but there are also some optimization issues in this specific implementation. The most glaring one is the use of DIVPS to divide the components by 255 after bilinear filtering -- about twenty times slower than multiplication by inverse. The generated code also does runtime branching based on whether a sampler has bilinear filtering enabled, and computes a*(1-f) + b for linear interpolation when a + (b-a)*f probably would be faster. I didn't see any optimization for non-dependent texture fetches, so there's no major advantage to avoiding dependent reads – not that you could optimize them out much without access to the vertex shader anyway. One implication of all of this is that in some cases you may be better off using more ALU ops rather than a texture loop, even though the texture lookup would be much faster on the GPU.
The output section is the other slow part of the generated code. As I said above, the code works on 4x1 blocks, so it has to handle the oddballs at the end. Unfortunately, it does so by storing the vector and then copying each pixel with scalar ops and a branch check, so it incurs store forwarding stalls and is a bit slower than it could be:
movdqa xmmword ptr [esp-40h],xmm6
test esi,esi
mov edi,dword ptr [esp-40h]
mov dword ptr [edx],edi
je 047501A3
mov eax,dword ptr [esp-3Ch]
lea esi,[esi-1]
mov dword ptr [edx+4],eax
test esi,esi
je 047501B5
mov ebx,dword ptr [esp-38h]
lea esi,[esi-1]
mov dword ptr [edx+8],ebx
test esi,esi
je 047501C7
mov ecx,dword ptr [esp-34h]
lea esi,[esi-1]
mov dword ptr [edx+0Ch],ecx
I would have liked to see a straight 4x loop with a vector store followed by fixup code instead. The difference won't be noticeable for long shaders, but you might notice it on a very short one, such as if you're using the effect to generate an image instead of transforming one.
Overall, the performance of the generated code is decent for ALU operations, but there's significant loop overhead and texture sampling is slow, so you want to avoid multipassing as much as possible. I will say that the overhead of the rest of WPF seems to be a much bigger problem than the jitted code; I did see some major slowdowns in the <10 fps range once I tried more complicated shaders, but a lot of the slowness was due to what looked like a slow alpha blend routine in wpfgfx_0300.dll and a lot of wasteful per-frame allocation, which caused the GC heap size to skyrocket. I don't care if I do have 2GB of memory – it's obnoxious for a simple app displaying an image to jump up to 1.5GB and start swapping things out just because I resized the window.
Bugs
Overall, the shader jitter in WPF is a lot more robust than I had expected.
One of the first mistakes I made was to use sampler s0 without binding anything to it. This works fine in hardware – probably by chance – but it drove me nuts when I was trying to test the software mode and couldn't figure out why no sampler was bound. The software engine also returns the wrong value for an unbound sampler, giving a dull red when the color components should be all zero.
The rsq instruction is supposed to have no more than 1/2^-22 error, but the jitter compiles it to an RSQRTPS instruction, which only guarantees 1/2^-12 error. This means that expressions involving sqrt(), rsqrt(), and length() may be a bit sketchy in precision. The same goes for rcp, though I've heard modern CPUs actually compute it to much higher precision.
There are some funny little issues in the generated code, such as this fragment from a texld instruction:
maxps xmm5,xmm4
maxps xmm6,xmm4
maxps xmm5,xmm4
maxps xmm6,xmm4
I couldn't think of why you'd need to do this, even with NaN behavior involved (minps/maxps are asymmetric with respect to specials). I initially suspected that this was a blown texture clamp and that two minps instructions were missing, but I couldn't get it to blow up with negative texture coordinates.
Conclusion
The WPF pixel shader jitter is actually fairly robust and performant, and should reliably support just about any shader that works in hardware mode. I believe it's currently the closest you can get to high-performance vectorized code purely from C#. My main complaint is that it's in WPF – this is technology that I would have liked to see in a core API somewhere in DirectX or Windows, rather than in the .NET Framework.