Software pixel shader emulation in Windows Presentation Foundation (WPF)

¶Software pixel shader emulation in Windows Presentation Foundation (WPF)

Windows Presentation Foundation (WPF) gained an interesting feature in .NET Framework 3.5 SP1, which is the ability to execute pixel shader effects in software via a just-in-time (JIT) compiler. Issues with introducing features in service packs aside, this is a cool addition, since it allows the same pixel shader code to run on the GPU and the CPU with reasonable performance on the latter. It's certainly better than the old effects system, which only supported software mode and required you to write a custom routine in a separate DLL instead.

Of course, being a sort of graphics guy but not a .NET kind of person, I had to dig into the shader jitter....

Interface

The way you use a custom shader effect in WPF is via the System.Windows.Media.Effects namespace, deriving from ShaderEffect and attaching a PixelShader object to it. You attach a precompiled pixel shader to the PixelShader and then attach the ShaderEffect to a UI element. There are facilities for binding properties to samplers and float constants.

You can't get to the vertex shader, so you can't take advantage of the hardware interpolators and all precomputation will have to be done in C#. Samplers can be switched between point and bilinear sampling, but there are no mipmaps and no support for wrapping.

WPF allows you to select three modes for the pixel shader, Auto, HardwareOnly, and SoftwareOnly. When SoftwareOnly is used, WPF converts the pixel shader to SSE2-based code using just-in-time (JIT) compilation, which is what we're interested in here.

The documentation doesn't give a lot of direction or note gotchas and could use a lot of improvement, but I could say that about a lot of the .NET Framework docs. When I originally looked at the jitter I was stuck without the docs and had to wing it via Intellisense, but I when I got access to the docs again I was surprised to find that the docs still weren't a lot of help. There's a lot more useful information on Greg Schechter's blog about how to write custom ShaderEffects.

Validation

WPF does do some validation on the pixel shader, and will reject many shaders that are otherwise valid pixel shaders, even if you are running in hardware accelerated mode. First, your shader must use the vanilla ps_2_0 shader model – ps_1_1, ps_1_4, ps_2_a, ps_2_sw, and ps_3_0 will all be rejected. Second, attempting to use some features that aren't supported by WPF will also be caught and rejected, such as reading from color interpolators (v0).

What you can do, however, is cheat somewhat by compiling to the ps_2_a, ps_2_b, or ps_2_sw targets and then hack the version token to ps_2_0 (FFFF0200). You won't get away with trying to use gradients or predication, but arbitrary swizzle does work. That makes sense, since arbitrary swizzles are easy to do in jitted code, and there isn't any special encoding in shader bytecode for extended swizzles vs. standard ps2.0 swizzles. Doing this also allows you to exceed the standard ps_2_0 limits. I take no responsibility if you do this and your code breaks with a future WPF update, though.

Code generation

The jitter requires SSE2. It probably could have been implemented with SSE1+MMX, but the performance probably would have been mediocre, the fastest chips in that range being Athlon XPs. If you're experienced with writing vectorized image processing code, you'll beat the jitter handily, but otherwise, it doesn't do a bad job. I did all experimentation on an SSE4.1-capable CPU, but didn't see any instructions used beyond SSE2 profile.

All pixel arithmetic is done in single precision using SSE. This may be a bit slower than could be done with fixed-point, but at least there are no precision or range surprises. One gotcha is that this means NaNs can also appear, which you may not be used to if you have a shader model 2 level ATI card.

The jitter reorganizes shaders into structures of arrays (SOA) form and executes pixel shaders for four pixels in parallel. This means that a single SSE register holds one component of a register across four pixels. For instance, xmm0 might hold r0.x for pixels 0-3, and a dp3 instruction would look like this:

mulps xmm0, xmm3
mulps xmm1, xmm4
mulps xmm2, xmm5
addps xmm0, xmm1
addps xmm0, xmm2

SOA form avoids a lot of swizzle traffic that would result from cross-component operations like a dot product, since SSE is poor at horizontal data traffic and doesn't have free swizzles or write masks like pixel shaders do. The downsides are much greater register pressure, particularly due to constant bloating, and more complex execution for non-naturally vector operations like table lookups. Pixel shader hardware does this too in a way, but the hardware does 2x2 quads, whereas the jitter does 4x1. The hardware needs to do quads in order to compute mipmapping parameters and gradients, but the jitter never deals with mipmapped textures.

Surprisingly, complex scalar operations are expanded inline: sincos turns into a series of muls and adds, and log is also emitted inline (although it is quite expensive). This is different from the Direct3D PSGP, which calls out to CRT transcendental functions instead when compiling vertex shaders.

There isn't a lot of optimization done on the shaders. If you manage to get four rcps in a row, they'll all get coded even if they cancel out. Ordinarily this isn't too much of a problem, since the HLSL compiler will do a lot of optimization for you. It does mean there are some cases that only the jitter can optimize and that it misses, such as a vector multiply where two out of three components are multiplied by constant zero. The jitter will strip dead stores and remove redundant moves, though.

Texture sampling is very slow, as the jitter generates several pages of machine code for every texld instruction. I'm not kidding about this – here's the generated code for just one pixel out of a 4x1 block:

lea         ebx,[edi+1] 
mov         ecx,dword ptr [esp+70h] 
mov         edx,dword ptr [esp+74h] 
movd        xmm2,ecx 
movaps      xmmword ptr [esp+100h],xmm2 
movd        xmm3,edx 
movaps      xmmword ptr [esp+130h],xmm3 
mov         edx,dword ptr [esp+17Ch] 
mov         esi,dword ptr [esp+78h] 
lea         ecx,[eax+1] 
movd        xmm4,esi 
movaps      xmmword ptr [esp+160h],xmm4 
shl         edx,2 
mov         esi,dword ptr [esp+38h]
mov         dword ptr [esp+190h],esi 
mov         esi,dword ptr [esp+178h] 
imul        eax,edx 
imul        ecx,edx 
mov         edx,dword ptr [esp+178h] 
lea         edx,[edx+eax] 
lea         esi,[esi+eax] 
mov         eax,dword ptr [esp+178h] 
movd        xmm5,dword ptr [edx+edi*4] 
movd        xmm6,dword ptr [esi+ebx*4] 
mov         esi,dword ptr [esp+178h] 
lea         esi,[esi+ecx] 
lea         eax,[eax+ecx] 
movd        xmm7,dword ptr [esi+edi*4] 
movd        xmm0,dword ptr [eax+ebx*4] 
punpcklbw   xmm5,xmm5 
punpcklbw   xmm6,xmm6 
punpcklbw   xmm7,xmm7 
punpcklbw   xmm0,xmm0 
punpcklwd   xmm5,xmm5 
punpcklwd   xmm6,xmm6 
punpcklwd   xmm7,xmm7 
punpcklwd   xmm0,xmm0 
psrld       xmm5,18h 
psrld       xmm6,18h 
psrld       xmm7,18h 
psrld       xmm0,18h 
cvtdq2ps    xmm5,xmm5 
cvtdq2ps    xmm6,xmm6 
cvtdq2ps    xmm7,xmm7 
cvtdq2ps    xmm0,xmm0 
mov         eax,dword ptr [esp+48h]
mov         ebx,dword ptr [esp+58h]
mov         ecx,dword ptr [esp+68h]
movd        xmm1,dword ptr [esp+190h] 
movd        xmm4,eax 
pshufd      xmm1,xmm1,0 
pshufd      xmm4,xmm4,0 
movd        xmm3,ebx 
movd        xmm2,ecx 
pshufd      xmm3,xmm3,0 
pshufd      xmm2,xmm2,0 
mulps       xmm0,xmm1 
mulps       xmm7,xmm3 
mulps       xmm1,xmm6 
movaps      xmm6,xmmword ptr [esp+130h] 
addps       xmm7,xmm0 
mulps       xmm3,xmm5 
movaps      xmm5,xmmword ptr [esp+100h] 
mulps       xmm4,xmm7 
movaps      xmm7,xmmword ptr [esp+160h] 
addps       xmm3,xmm1 
shufps      xmm5,xmm5,93h 
mulps       xmm2,xmm3 
shufps      xmm6,xmm6,93h 
addps       xmm2,xmm4 
movaps      xmmword ptr [esp+80h],xmm2 
shufps      xmm7,xmm7,93h 
mov         edx,dword ptr [esp+80h] 
mov         esi,dword ptr [esp+84h] 
movd        xmm0,edx 
movd        xmm1,esi 
mov         esi,dword ptr [esp+17Ch] 
addps       xmm5,xmm0 
movaps      xmmword ptr [esp+0F0h],xmm5 
addps       xmm6,xmm1 
movaps      xmmword ptr [esp+120h],xmm6 
mov         edi,dword ptr [esp+88h] 
mov         eax,dword ptr [esp+14h] 
movd        xmm2,edi 
mov         ebx,dword ptr [esp+24h] 
addps       xmm7,xmm2 
movaps      xmmword ptr [esp+150h],xmm7

Now imagine that included four times for every texld instruction in your shader.

Needless to say, this bloats the generated code very quickly, and it's not unusual to see a compiled pixel shader exceed 4K. Have you read the SIGGRAPH paper on Larrabee, where they explained that texture sampling couldn't be done efficiently on the main core? Well, here's an example. Part of this is due to SSE2's poor support for expanding byte components into floats and all of the data conversions needed to get coordinates and subtexel offsets to the right places, but there are also some optimization issues in this specific implementation. The most glaring one is the use of DIVPS to divide the components by 255 after bilinear filtering -- about twenty times slower than multiplication by inverse. The generated code also does runtime branching based on whether a sampler has bilinear filtering enabled, and computes a*(1-f) + b for linear interpolation when a + (b-a)*f probably would be faster. I didn't see any optimization for non-dependent texture fetches, so there's no major advantage to avoiding dependent reads – not that you could optimize them out much without access to the vertex shader anyway. One implication of all of this is that in some cases you may be better off using more ALU ops rather than a texture loop, even though the texture lookup would be much faster on the GPU.

The output section is the other slow part of the generated code. As I said above, the code works on 4x1 blocks, so it has to handle the oddballs at the end. Unfortunately, it does so by storing the vector and then copying each pixel with scalar ops and a branch check, so it incurs store forwarding stalls and is a bit slower than it could be:

movdqa      xmmword ptr [esp-40h],xmm6 
test        esi,esi 
mov         edi,dword ptr [esp-40h] 
mov         dword ptr [edx],edi 
je          047501A3 
mov         eax,dword ptr [esp-3Ch] 
lea         esi,[esi-1] 
mov         dword ptr [edx+4],eax 
test        esi,esi 
je          047501B5 
mov         ebx,dword ptr [esp-38h] 
lea         esi,[esi-1] 
mov         dword ptr [edx+8],ebx 
test        esi,esi 
je          047501C7 
mov         ecx,dword ptr [esp-34h] 
lea         esi,[esi-1] 
mov         dword ptr [edx+0Ch],ecx

I would have liked to see a straight 4x loop with a vector store followed by fixup code instead. The difference won't be noticeable for long shaders, but you might notice it on a very short one, such as if you're using the effect to generate an image instead of transforming one.

Overall, the performance of the generated code is decent for ALU operations, but there's significant loop overhead and texture sampling is slow, so you want to avoid multipassing as much as possible. I will say that the overhead of the rest of WPF seems to be a much bigger problem than the jitted code; I did see some major slowdowns in the <10 fps range once I tried more complicated shaders, but a lot of the slowness was due to what looked like a slow alpha blend routine in wpfgfx_0300.dll and a lot of wasteful per-frame allocation, which caused the GC heap size to skyrocket. I don't care if I do have 2GB of memory – it's obnoxious for a simple app displaying an image to jump up to 1.5GB and start swapping things out just because I resized the window.

Bugs

Overall, the shader jitter in WPF is a lot more robust than I had expected.

One of the first mistakes I made was to use sampler s0 without binding anything to it. This works fine in hardware – probably by chance – but it drove me nuts when I was trying to test the software mode and couldn't figure out why no sampler was bound. The software engine also returns the wrong value for an unbound sampler, giving a dull red when the color components should be all zero.

The rsq instruction is supposed to have no more than 1/2^-22 error, but the jitter compiles it to an RSQRTPS instruction, which only guarantees 1/2^-12 error. This means that expressions involving sqrt(), rsqrt(), and length() may be a bit sketchy in precision. The same goes for rcp, though I've heard modern CPUs actually compute it to much higher precision.

There are some funny little issues in the generated code, such as this fragment from a texld instruction:

maxps       xmm5,xmm4
maxps       xmm6,xmm4
maxps       xmm5,xmm4
maxps       xmm6,xmm4

I couldn't think of why you'd need to do this, even with NaN behavior involved (minps/maxps are asymmetric with respect to specials). I initially suspected that this was a blown texture clamp and that two minps instructions were missing, but I couldn't get it to blow up with negative texture coordinates.

Conclusion

The WPF pixel shader jitter is actually fairly robust and performant, and should reliably support just about any shader that works in hardware mode. I believe it's currently the closest you can get to high-performance vectorized code purely from C#. My main complaint is that it's in WPF – this is technology that I would have liked to see in a core API somewhere in DirectX or Windows, rather than in the .NET Framework.

3 comments | Sep 15, 2008 at 01:45 | default

Current version

Navigation

Archives

¶Software pixel shader emulation in Windows Presentation Foundation (WPF)

Comments