Discussion Rudi_Float_Bench v0.02a

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Jul 27, 2020
25,371
17,604
146


ArrogantHair's 285K: 86 Mops / thread

My 12700K: ~65 Mops / thread

In roughly 3 years, Intel made a leap of 30% float IPC.

Of course, this is going to be a smaller gap if compared with a liquid cooled 12700K.
 
Reactions: ArrogantHair

Schmide

Diamond Member
Mar 7, 2002
5,699
948
126
The reason it is able to utilize smt so well is it isn't stressing the parallelism of the processor. It is one long chain of adds each dependent on the previous value. You could add a parallel chain that would allow the instructions to queue together.

Quick code up. Not tested. Note this is hard coded to max parallel at 4 chains. I don't think you could get more than that. Max any current generation processor can do is is probably 2. It would produce the same basic loop until called with parallel > 1.

Edit: See code below for consistency
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
699
1,418
96
Nice! Something to play with
well if you read my post from march
loop carried dependency will not allow hitting optimal performance, since every iteration depends on results of previous iteration. Since most cpus are able to execute 2 fadds per cycle, second unit will not be used in parallel.

The optimal unroll ratio is number of execution units x latency of the operation, provided you have enough architectural registers. For Zen5 it will be 6 512b adds. For Zen4 3 512b adds but 6 256b adds. Though the easiest to write that would be in inline assembly if you ditch msvc for clang, then you can get rid of the volatile. Not neccessary though.
 

Schmide

Diamond Member
Mar 7, 2002
5,699
948
126
well if you read my post from march


The optimal unroll ratio is number of execution units x latency of the operation, provided you have enough architectural registers. For Zen5 it will be 6 512b adds. For Zen4 3 512b adds but 6 256b adds. Though the easiest to write that would be in inline assembly if you ditch msvc for clang, then you can get rid of the volatile. Not neccessary though.

Edit: Apparently, at least clang, needs for loops and #define for #pragma unroll

Edit: After reading it #pragma unroll was making clang over optimize it

Edit: made initialization of the volatile automatic.

Edit: now returns number of operations performed

Edit: final version. I promise. Made it a template so now it can be compiled with any number of parallels. (but not apple parallels) See main for calling convention

This version works the same for both clang and msvc

So https://godbolt.org/z/Kj33jEvMz

Code:
#include <immintrin.h>
template<unsigned int MAX_PARALLEL = 6>
int rudiparallel(int subLoop = 256) 
{
   if (subLoop <= 0)
      return 0;
   __m512 a[MAX_PARALLEL];
   __m512 b[MAX_PARALLEL];
   volatile float va[MAX_PARALLEL];
   int maxp = 0;
   do {
      va[maxp] = static_cast<float>(maxp + 1);
      a[maxp] = _mm512_set1_ps(va[maxp]);
      b[maxp] = _mm512_set1_ps(0.0f);
   } while (++maxp < MAX_PARALLEL);
   int subLooper = subLoop;
   do {
      maxp = 0;
      do {
         b[maxp] = _mm512_add_ps(b[maxp], a[maxp]);
      } while (++maxp < MAX_PARALLEL);
   } while (--subLooper);
   maxp = 0;
   do {
      va[maxp] = *reinterpret_cast<float *>(&b[maxp]);
   } while (++maxp < MAX_PARALLEL);
   return subLoop * MAX_PARALLEL;
}
int main() 
{
    int numOps = rudiparallel<3>(); // numOps = 256 * 3
    return rudiparallel(); // returns 256 * 6 ops
}

Inner loop clang

Code:
.LBB0_3:
        vaddps  zmm11, zmm1, zmm11
        vaddps  zmm10, zmm2, zmm10
        vaddps  zmm9, zmm3, zmm9
        vaddps  zmm7, zmm5, zmm7
        vaddps  zmm4, zmm6, zmm4
        vaddps  zmm0, zmm8, zmm0
        vaddps  zmm11, zmm1, zmm11
        vaddps  zmm10, zmm2, zmm10
        vaddps  zmm9, zmm3, zmm9
        vaddps  zmm7, zmm5, zmm7
        vaddps  zmm4, zmm6, zmm4
        vaddps  zmm0, zmm8, zmm0
        add     eax, 2
        jne     .LBB0_3

inner loop msvc

Code:
$LL7@rudiparall:
        vaddps  zmm5, zmm5, zmm11
        vaddps  zmm4, zmm4, zmm10
        vaddps  zmm3, zmm3, zmm9
        vaddps  zmm2, zmm2, zmm8
        vaddps  zmm1, zmm1, zmm7
        vaddps  zmm0, zmm0, zmm6
        sub     r8d, 1
        jne     SHORT $LL7@rudiparall

So msvc should be fine with a forced 6 parallel loop

gcc need -O3 to unroll the same as msvc

Code:
.L2:
        vaddps  zmm0, zmm0, zmm11
        vaddps  zmm5, zmm5, zmm10
        vaddps  zmm4, zmm4, zmm9
        vaddps  zmm3, zmm3, zmm8
        vaddps  zmm2, zmm2, zmm7
        vaddps  zmm1, zmm1, zmm6
        sub     eax, 1
        jne     .L2
 
Last edited:
Reactions: MS_AT
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |