Avx2 + tsx

bronxzv · Jun 1, 2012

kernelc said:
Based on what I read about HPC workloads, it seems that non-destructive operations where something to desire.

this is a common misconception about FMA4, look at actual FMA3 code below, extra moves are generally not required, and the seldom extra moves are handled at the rename stage anyway on modern CPUs

Code:

vfmsub231ps ymm9, ymm10, ymm14 ;470.23
vfmsub231ps ymm8, ymm15, ymm12 ;470.23
vmulps ymm4, ymm4, ymm6 ;471.7
vfmsub231ps ymm7, ymm11, ymm13 ;470.23
vfmadd231ps ymm14, ymm11, ymm5 ;471.7
vfmadd231ps ymm13, ymm12, ymm5 ;471.7
vfmadd213ps ymm5, ymm10, ymm15 ;471.7
vfmadd231ps ymm14, ymm8, ymm4 ;471.7
vfmadd231ps ymm13, ymm9, ymm4 ;471.7
vfmadd213ps ymm4, ymm7, ymm5 ;471.7
vmulps ymm9, ymm14, ymm14 ;471.7
vfmadd231ps ymm9, ymm13, ymm13 ;471.7
add r12d, 8 ;466.30
vfmadd231ps ymm9, ymm4, ymm4 ;471.7

FMA4 will be just more readable in assembly with no real performance advantage

Edrick · Jun 1, 2012

bronxzv said:
why will they do such a strange thing ? FMA4 has no significant advantage in practice vs. FMA3, what can be the impact ? 0.1% speedup ?

Pure marketing. No other reason in my opinion.

bronxzv · Jun 1, 2012

Edrick said:
Pure marketing. No other reason in my opinion.

you should have another definition of "marketing" than me then, there is a lot of *useful* things they can add to the ISA to satisfy market needs (for example wider vectors, scatter instructions, predication masks, ...)

kernelc · Jun 1, 2012

Edrick said:
Huh? Just about every game uses FMA (since GPUs have supported FMA for quite a few generations now). Now game developers can offload some of the FMA tasks to the CPUs.

GPU use MADD (multiply-add, similar to FMA but with rounding errors) in the shaders and TMUs because many operations include multiplication + addition (even a simple RBG bilinear sample require 16 muls + 12 adds).

The floating point math used inside game engine (to generate vertex coordinates) is generally based on matrix multiplications, so it will not benefit so much by FMA.

On the other side, AI and generic code does not use FPU at all (or very little), so FMA will, again, never used.

In the end: at the moment, FMA is useful mainly into the HPC/workstation world (for example, Itanium got FMA from the very start). FMA will be useful on stream data processing also (eg: video encoding/decoding) but the necessary compiler/application support has to materialize yet.

Regards.

kernelc · Jun 1, 2012

bronxzv said:
this is a common misconception about FMA4, look at actual FMA3 code below, extra moves are generally not required, and the seldom extra moves are handled at the rename stage anyway on modern CPUs

I read that on articles written by HPC guys. However, it is entirely possibile that they want FMA4 basically for easier assembly code.

Regards.

bronxzv · Jun 1, 2012

kernelc said:
I read that on articles written by HPC guys.

Yes it's a very common myth, I discussed it here just after compiling my 1st FMA3 program:
http://www.realworldtech.com/forums/index.cfm?action=detail&id=125386&threadid=125386&roomid=2

With FMA3 support in Trinity I suppose it's just a matter of recompiling some code to compare the realworld difference in performance (if any) between FMA3 and FMA4

kernelc · Jun 1, 2012

bronxzv said:
Yes it's a very common myth, I discussed it here just after compiling my 1st FMA3 program:
http://www.realworldtech.com/forums/index.cfm?action=detail&id=125386&threadid=125386&roomid=2

With FMA3 support in Trinity I suppose it's just a matter of recompiling some code to compare the realworld difference in performance (if any) between FMA3 and FMA4

Mmm... interesting. I missed this thread on RWT!

Thank you

bronxzv · Jun 1, 2012

kernelc said:
Mmm... interesting. I missed this thread on RWT!

Thank you

btw I'm sure you'll agree with this comment of mine

http://www.realworldtech.com/forums/index.cfm?action=detail&id=125479&threadid=125386&roomid=2

based on your deep analysis of the Bulldozer (lack of) cache bandwidth

kernelc · Jun 1, 2012

bronxzv said:
btw I'm sure you'll agree with this comment of mine

http://www.realworldtech.com/forums/index.cfm?action=detail&id=125479&threadid=125386&roomid=2

based on your deep analysis of the Bulldozer (lack of) cache bandwidth

Sure: Bulldozer L2 bandwidth is quite low, so a more compact instruction can be somewhat advantageous.

A FMA3 vs FMA4 code recompilation on Trinity can be quite interesting...

Thanks.

bronxzv · Jun 1, 2012

kernelc said:
The floating point math used inside game engine (to generate vertex coordinates) is generally based on matrix multiplications, so it will not benefit so much by FMA.

a typical 3D transform is a 3x3 by 3x1 multiply + a 3x1 vector add, i.e. a perfect fit for FMA (9 muls and 9 adds perfectly balanced)

kernelc · Jun 1, 2012

bronxzv said:
a typical 3D transform is a 3x3 by 3x1 multiply + a 3x1 vector add, i.e. a perfect fit for FMA (9 muls and 9 adds perfectly balanced)

When I did some test with OpenGL transformation, I saw much more MULs that ADDs.

I miss something?

EDIT: for my experiments, I used only fixed-pipeline functions as glMultMatrix, glLoadMatrix and the likes. URL: http://www.talisman.org/opengl-1.1/Reference/glMultMatrix.html
Maybe you are referring to vertex-shader style programming?

Thanks.

bronxzv · Jun 1, 2012

kernelc said:
When I did some test with OpenGL transformation, I saw much more MULs that ADDs.

I miss something?

Thanks.

difficult to say without seeing the code, if you work only with 4x4 matrices (homogeneous coordinates), it can lead to somewhat less balanced add/mul in some cases AFAIK, also 3x3 by 3x3 or 4x4 by 4x4 multiply has more muls than adds (27 muls and 18 adds for 3x3), cross products are even less balanced (6 muls and 3 adds)

btw, did you measure the actual runtime use or the occurences in the code, generally the critical loops are with the balanced cases (vertices and normals transforms) and outer loops with square matrices multiplies (scene graph traversal)

bronxzv · Jun 1, 2012

kernelc said:
When I did some test with OpenGL transformation, I saw much more MULs that ADDs.

I miss something?

EDIT: for my experiments, I used only fixed-pipeline functions as glMultMatrix, glLoadMatrix and the likes. URL: http://www.talisman.org/opengl-1.1/Reference/glMultMatrix.html
Maybe you are referring to vertex-shader style programming?

Thanks.

generally the hot spots will use

Code:

[FONT=Lucida Console]c[0] c[4]  c[8] c[12]   v[0][/FONT]
[FONT=Lucida Console]c[1] c[5]  c[9] c[13] X v[1][/FONT]
[FONT=Lucida Console]c[2] c[6] c[10] c[14]   v[2][/FONT]
[FONT=Lucida Console]c[3] c[7] c[11] c[15]   v[3][/FONT]

from your example, i.e 16 muls and 12 adds, it's already a pretty good fit with FMA

or in most cases

Code:

c[0] c[4] c[8]      v[0]     c[14] 
c[1] c[5] c[9]   X  v[1]  +  c[12]
c[2] c[6] c[10]     v[2]     c[13]

will be enough if you don't need w (which is useful mostly for perspective projection), this is the perfect add/mul balance I was refering to

kernelc · Jun 1, 2012

Thank you bronxzv

I profiled these thing years ago... I should check code / profiler settings.

Anyway, the most important thing is that FMA can be used for vertex transformations. This is surely a more common scenario then HPC

Thank you.

bronxzv · Jun 1, 2012

kernelc said:
Thank you bronxzv

I profiled these thing years ago... I should check code / profiler settings.

Anyway, the most important thing is that FMA can be used for vertex transformations. This is surely a more common scenario then HPC

Thank you.

just one more example of perfect fit with FMA is the evaluation of polynomials using the Horner's scheme:

((((a5*x+a4)*x+a3)*x+a2)*x+a1)*x+a0

Avx2 + tsx

bronxzv

Senior member

Edrick

Golden Member

bronxzv

Senior member

kernelc

Member

kernelc

Member

bronxzv

Senior member

kernelc

Member

bronxzv

Senior member

kernelc

Member

bronxzv

Senior member

kernelc

Member

bronxzv

Senior member

bronxzv

Senior member

kernelc

Member

bronxzv

Senior member

TRENDING THREADS