For gcc, the sqrt() function is already a compiler intrinsic. By default, gcc produces an SSE sqrtsd instruction when you call sqrt(). The fsqrt in your example is actually the old pre-SSE instruction. You can force gcc to produce it by turning off SSE (with the -mfpmath=387 option) but the SSE variant is probably faster.
The article you linked is 10 years old and intended for 32-bit Windows programmers, I don’t think it’s applicable to modern 64-bit CPUs and compilers.
That article is a full decade old. Compilers are much better now. Are you sure that sqrt14 is still better today? Playing with godbolt.org it looks to me like gcc generates something very similar already.
There is a CPU instruction specifically designed to accelerate normalization:
RSQRTSS
It computes 1/sqrt(x) so that the divisions aren’t necessary, and it uses single-precision floats. I’d be very interested to see how much it can speed things up. I don’t know if there is an easy way to get gcc to produce this instruction.
So gcc slips in a couple extra instructions to try to improve upon the precision of RSQRTSS, but overall the code is pretty tight. Just make sure to compile with the -mfpmath=sse, -msse and -ffast-math options, and use gcc 5 or later.
goldbolt.org is really fun for this kind of stuff, we can see that the compiler is very sensitive sometimes. Using 1.0f vs 1.0 generates different code with gcc, though not with clang:
According to e.g. the AMD RYZEN spec sheet, RSQRTSS has five time the throughput of SQRTSS, so you lose some precision but you gain a lot of speed. Also the divisions becomes unnecessary, and the throughput of MULSS is at least three times the throughput of DIVSS.
So I’d be very interested to see what the real-world performance is, because in theory it’s pretty good.
So playing with godbolt more, clang seems to already generate rsqrtss with today’s VTK code (even though vtk uses division and not multiplication!), but I can only convince gcc to use it if you don’t return the norm from Normalize():
Just a note that this flag makes VTK ABI incompatible with any code passing floats and doubles across its API boundaries (which are a lot of places). It does this by basically ignoring what C and C++ require floating point types to do in various situation. Instead, maybe some of the things it does can be hinted at/handled in VTK code explicitly?
There is no double-precision version. According to this stackoverflow discussion the rsqrtss instruction by itself provides only 11 accurate bits, much less than even the 24-bit mantissa of a single-precision float.