¶Good approximation, bad approximation
Numerical approximations are a bit of an art. There are frequently tradeoffs available between speed and accuracy, and knowing how much you can skimp on one to improve the other takes a lot of careful study.
Take the humble reciprocal operation, for instance.
The reciprocal, y = 1/x, is a fairly basic operation, but it's also an expensive one. It can easily be implemented in terms of a divide, but division is itself a very expensive operation — it can't be parallelized as easily as addition or multiplication and typically takes on the order of 10-20 times longer. There are algorithms to compute the reciprocal directly to varying levels of precision, and some CPUs provide acceleration for that. x86 CPUs are among them, providing the RCPSS and RCPPS opcodes to compute scalar and vectorized reciprocal approximations, respectively.
However, there is a gotcha.
For any approximation, the first question you should ask is how accurate it is. RCPSS and RCPPS are documented as having a relative error no greater than 1.5*2^-12, or approximately 12 bits of precision. That's fine, and good enough for a lot of purposes, especially with refinement. The second question you should ask is whether there are any special values involved that deserve special attention. I can think of five that ideally should be exact:
- 0
- -/+ 1.0
- -/+ infinity
RCPSS/RCPPS do handle zero and infinity correctly, which is good. Sadly, 1.0 is handled less gracefully, as it comes out as 0.999756 (3F7FF000), at least on Core 2 CPUs. Even worse, if you attempt to refine the result using Newton-Raphson iterations:
x' = x * (2 - x*c)
...the result converges to 0.99999994039535522 (3F7FFFFF), a value just barely off from 1.0 that in many cases it will be printed as 1.0, such as in the VC++ debugger. This leads to lots of fun tracking down why calculations are slewing away from unity when they shouldn't, only to discover that the innocuous little 1.0 in the corner is actually an impostor, and then having to slow down an otherwise speedy routine to handle this special case. Argh!
If I had to take a guess as to why Intel did this, it's probably to avoid the need to propagate carries from the mantissa to the exponent, because otherwise the top 12 bits of the mantissa can go through a lookup table and the exponent can produce the result exponent and top bit of new mantissa. It's still really annoying, though. I have to go fix the rcp instruction in my shader engine, for instance, because the DirectX docs explicitly require that rcp of 1.0 stays 1.0. Curiously, they don't mention -1.0. I guess it just goes to show how hard it is to specify a good approximation.