Avx2 + tsx

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

bronxzv

Senior member
Jun 13, 2011
460
0
71
Based on what I read about HPC workloads, it seems that non-destructive operations where something to desire.

this is a common misconception about FMA4, look at actual FMA3 code below, extra moves are generally not required, and the seldom extra moves are handled at the rename stage anyway on modern CPUs

Code:
vfmsub231ps ymm9, ymm10, ymm14 ;470.23
vfmsub231ps ymm8, ymm15, ymm12 ;470.23
vmulps ymm4, ymm4, ymm6 ;471.7
vfmsub231ps ymm7, ymm11, ymm13 ;470.23
vfmadd231ps ymm14, ymm11, ymm5 ;471.7
vfmadd231ps ymm13, ymm12, ymm5 ;471.7
vfmadd213ps ymm5, ymm10, ymm15 ;471.7
vfmadd231ps ymm14, ymm8, ymm4 ;471.7
vfmadd231ps ymm13, ymm9, ymm4 ;471.7
vfmadd213ps ymm4, ymm7, ymm5 ;471.7
vmulps ymm9, ymm14, ymm14 ;471.7
vfmadd231ps ymm9, ymm13, ymm13 ;471.7
add r12d, 8 ;466.30
vfmadd231ps ymm9, ymm4, ymm4 ;471.7

FMA4 will be just more readable in assembly with no real performance advantage
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Pure marketing. No other reason in my opinion.

you should have another definition of "marketing" than me then, there is a lot of *useful* things they can add to the ISA to satisfy market needs (for example wider vectors, scatter instructions, predication masks, ...)
 
Last edited:

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
Huh? Just about every game uses FMA (since GPUs have supported FMA for quite a few generations now). Now game developers can offload some of the FMA tasks to the CPUs.

GPU use MADD (multiply-add, similar to FMA but with rounding errors) in the shaders and TMUs because many operations include multiplication + addition (even a simple RBG bilinear sample require 16 muls + 12 adds).

The floating point math used inside game engine (to generate vertex coordinates) is generally based on matrix multiplications, so it will not benefit so much by FMA.

On the other side, AI and generic code does not use FPU at all (or very little), so FMA will, again, never used.

In the end: at the moment, FMA is useful mainly into the HPC/workstation world (for example, Itanium got FMA from the very start). FMA will be useful on stream data processing also (eg: video encoding/decoding) but the necessary compiler/application support has to materialize yet.

Regards.
 

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
this is a common misconception about FMA4, look at actual FMA3 code below, extra moves are generally not required, and the seldom extra moves are handled at the rename stage anyway on modern CPUs

I read that on articles written by HPC guys. However, it is entirely possibile that they want FMA4 basically for easier assembly code.

Regards.
 
Last edited:

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net

bronxzv

Senior member
Jun 13, 2011
460
0
71
The floating point math used inside game engine (to generate vertex coordinates) is generally based on matrix multiplications, so it will not benefit so much by FMA.

a typical 3D transform is a 3x3 by 3x1 multiply + a 3x1 vector add, i.e. a perfect fit for FMA (9 muls and 9 adds perfectly balanced)
 

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
a typical 3D transform is a 3x3 by 3x1 multiply + a 3x1 vector add, i.e. a perfect fit for FMA (9 muls and 9 adds perfectly balanced)

When I did some test with OpenGL transformation, I saw much more MULs that ADDs.

I miss something?

EDIT: for my experiments, I used only fixed-pipeline functions as glMultMatrix, glLoadMatrix and the likes. URL: http://www.talisman.org/opengl-1.1/Reference/glMultMatrix.html
Maybe you are referring to vertex-shader style programming?

Thanks.
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
When I did some test with OpenGL transformation, I saw much more MULs that ADDs.

I miss something?

Thanks.

difficult to say without seeing the code, if you work only with 4x4 matrices (homogeneous coordinates), it can lead to somewhat less balanced add/mul in some cases AFAIK, also 3x3 by 3x3 or 4x4 by 4x4 multiply has more muls than adds (27 muls and 18 adds for 3x3), cross products are even less balanced (6 muls and 3 adds)

btw, did you measure the actual runtime use or the occurences in the code, generally the critical loops are with the balanced cases (vertices and normals transforms) and outer loops with square matrices multiplies (scene graph traversal)
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
When I did some test with OpenGL transformation, I saw much more MULs that ADDs.

I miss something?

EDIT: for my experiments, I used only fixed-pipeline functions as glMultMatrix, glLoadMatrix and the likes. URL: http://www.talisman.org/opengl-1.1/Reference/glMultMatrix.html
Maybe you are referring to vertex-shader style programming?

Thanks.

generally the hot spots will use

Code:
[FONT=Lucida Console]c[0] c[4]  c[8] c[12]   v[0][/FONT]
[FONT=Lucida Console]c[1] c[5]  c[9] c[13] X v[1][/FONT]
[FONT=Lucida Console]c[2] c[6] c[10] c[14]   v[2][/FONT]
[FONT=Lucida Console]c[3] c[7] c[11] c[15]   v[3][/FONT]

from your example, i.e 16 muls and 12 adds, it's already a pretty good fit with FMA

or in most cases

Code:
c[0] c[4] c[8]      v[0]     c[14] 
c[1] c[5] c[9]   X  v[1]  +  c[12]
c[2] c[6] c[10]     v[2]     c[13]

will be enough if you don't need w (which is useful mostly for perspective projection), this is the perfect add/mul balance I was refering to
 
Last edited:

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
Thank you bronxzv

I profiled these thing years ago... I should check code / profiler settings.

Anyway, the most important thing is that FMA can be used for vertex transformations. This is surely a more common scenario then HPC

Thank you.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Thank you bronxzv

I profiled these thing years ago... I should check code / profiler settings.

Anyway, the most important thing is that FMA can be used for vertex transformations. This is surely a more common scenario then HPC

Thank you.

just one more example of perfect fit with FMA is the evaluation of polynomials using the Horner's scheme:

((((a5*x+a4)*x+a3)*x+a2)*x+a1)*x+a0
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |