Better sqrt

Ron84 · June 3, 2020, 5:59pm

Is someone able to translate the sqrt14-function into asm of gcc?

double inline __declspec (naked) __fastcall sqrt14(double n)
{
	_asm fld qword ptr [esp+4]
	_asm fsqrt
	_asm ret 8
}

Reference: https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi

dgobbi · June 3, 2020, 8:53pm

For gcc, the sqrt() function is already a compiler intrinsic. By default, gcc produces an SSE sqrtsd instruction when you call sqrt(). The fsqrt in your example is actually the old pre-SSE instruction. You can force gcc to produce it by turning off SSE (with the -mfpmath=387 option) but the SSE variant is probably faster.

The article you linked is 10 years old and intended for 32-bit Windows programmers, I don’t think it’s applicable to modern 64-bit CPUs and compilers.

seanm · June 3, 2020, 8:53pm

That article is a full decade old. Compilers are much better now. Are you sure that sqrt14 is still better today? Playing with godbolt.org it looks to me like gcc generates something very similar already.

Ron84 · June 3, 2020, 9:11pm

It was just an idea. I would like to replace vtkMath::Normalize, because it is too slow for my use-case. It is too slow on my old Core2Duo…

dgobbi · June 3, 2020, 9:24pm

There is a CPU instruction specifically designed to accelerate normalization:

RSQRTSS

It computes 1/sqrt(x) so that the divisions aren’t necessary, and it uses single-precision floats. I’d be very interested to see how much it can speed things up. I don’t know if there is an easy way to get gcc to produce this instruction.

dgobbi · June 3, 2020, 9:48pm

Actually, gcc automatically produces RSQRTSS if the -ffast-math option is used:

g++-10 -O3 -mfpmath=sse -ffast-math -S normalize.cxx -o normalize.a

code for normalize.cxx:

#include <cmath>

void normalize(float x[3])
{
  float f = 1.0f/std::sqrt(x[0]*x[0] + x[1]*x[1] + x[2]*x[2]);
  x[0] *= f;
  x[1] *= f;
  x[2] *= f;
}

produced assembly code:

_Z9normalizePf:
	movss	(%rdi), %xmm4
	movss	4(%rdi), %xmm3
	movss	8(%rdi), %xmm2
	movaps	%xmm4, %xmm1
	movaps	%xmm3, %xmm0
	mulss	%xmm3, %xmm0
	mulss	%xmm4, %xmm1
	addss	%xmm0, %xmm1
	movaps	%xmm2, %xmm0
	mulss	%xmm2, %xmm0
	addss	%xmm1, %xmm0
	movaps	%xmm0, %xmm5
	rsqrtss	%xmm5, %xmm5  # here it is
	mulss	%xmm5, %xmm0
	mulss	%xmm5, %xmm0
	mulss	.LC1(%rip), %xmm5
	addss	.LC0(%rip), %xmm0
	mulss	%xmm5, %xmm0
	mulss	%xmm0, %xmm4
	mulss	%xmm0, %xmm3
	mulss	%xmm2, %xmm0
	movss	%xmm4, (%rdi)
	movss	%xmm3, 4(%rdi)
	movss	%xmm0, 8(%rdi)
	ret
	.align 4
.LC0:
	.long	3225419776
	.align 4
.LC1:
	.long	3204448256

So gcc slips in a couple extra instructions to try to improve upon the precision of RSQRTSS, but overall the code is pretty tight. Just make sure to compile with the -mfpmath=sse, -msse and -ffast-math options, and use gcc 5 or later.

seanm · June 3, 2020, 10:23pm

goldbolt.org is really fun for this kind of stuff, we can see that the compiler is very sensitive sometimes. Using 1.0f vs 1.0 generates different code with gcc, though not with clang:

https://godbolt.org/z/HRVgtm

It’s not immediately clear which of these is faster at run time, and RSQRTSS has less precision.

Sean

dgobbi · June 3, 2020, 10:33pm

According to e.g. the AMD RYZEN spec sheet, RSQRTSS has five time the throughput of SQRTSS, so you lose some precision but you gain a lot of speed. Also the divisions becomes unnecessary, and the throughput of MULSS is at least three times the throughput of DIVSS.

So I’d be very interested to see what the real-world performance is, because in theory it’s pretty good.

seanm · June 3, 2020, 11:02pm

There’s no double precision variant, is there?

seanm · June 3, 2020, 11:37pm

So playing with godbolt more, clang seems to already generate rsqrtss with today’s VTK code (even though vtk uses division and not multiplication!), but I can only convince gcc to use it if you don’t return the norm from Normalize():

https://godbolt.org/z/zenyAc

Sean

ben.boeckel · June 4, 2020, 1:51am

Just a note that this flag makes VTK ABI incompatible with any code passing floats and doubles across its API boundaries (which are a lot of places). It does this by basically ignoring what C and C++ require floating point types to do in various situation. Instead, maybe some of the things it does can be hinted at/handled in VTK code explicitly?

Ron84 · June 4, 2020, 3:39pm

A good question!

dgobbi · June 4, 2020, 4:12pm

There is no double-precision version. According to this stackoverflow discussion the rsqrtss instruction by itself provides only 11 accurate bits, much less than even the 24-bit mantissa of a single-precision float.