Discussion Rudi_Float_Bench v0.02a

ArrogantHair · Apr 19, 2025

285K fully tuned p57E51 R41 N34 D38 8600Mhz 24Gbx2

igor_kavinski · Apr 21, 2025

Yes, I used 7-Max because I was so embarrassed by my previous score that was lower than that of a 5950X!

Det0x · May 6, 2025

Gave this benchmark a spin

Workload seems very light, cores hardly warming up while running, even the AVX512 version

igor_kavinski · May 6, 2025

Wow. 16% faster than my AVX-512 score!

Sgraffite · May 6, 2025

Ryzen AI 9 HX 370

igor_kavinski · May 14, 2025

@adroc_thurston Wanna hear the harsh truth from you.

How's my synthetic AVX-512 benchmark?

adroc_thurston · May 14, 2025

igor_kavinski said:
How's my synthetic AVX-512 benchmark?

I ain't a SIMD guy.
You can always crawl onto xitter or 3.5 discords Mysticial is in and ask him to torture you for giggles.

igor_kavinski · May 26, 2025

Not sure why using 7-Max improves this benchmark's score.

igor_kavinski · May 27, 2025

ArrogantHair's 285K: 86 Mops / thread

My 12700K: ~65 Mops / thread

In roughly 3 years, Intel made a leap of 30% float IPC.

Of course, this is going to be a smaller gap if compared with a liquid cooled 12700K.

igor_kavinski · May 31, 2025

ASUS ROG Z13 Flow Strix Halo scores from an overclock.net user:

On Battery

Max power mode

igor_kavinski · Jun 14, 2025

9955HX3D score from @leoneazzurro

With 7-Max:

igor_kavinski · Jun 21, 2025

Det0x said:
Kinda strange benchmark, i'm seeing almost 100% thread scaling on 16 core Zen5 with SMT enabled/disabled

As noted in this post: http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/rudi_float_bench-v0-02a.2628323/post-41412252

SMT is leading to almost double score.

Can someone with 7950X run both benchmarks with and without SMT so we can see the scaling for Zen 4?

Schmide · Jun 21, 2025

The reason it is able to utilize smt so well is it isn't stressing the parallelism of the processor. It is one long chain of adds each dependent on the previous value. You could add a parallel chain that would allow the instructions to queue together.

Quick code up. Not tested. Note this is hard coded to max parallel at 4 chains. I don't think you could get more than that. Max any current generation processor can do is is probably 2. It would produce the same basic loop until called with parallel > 1.

Edit: See code below for consistency

igor_kavinski · Jun 21, 2025

Nice! Something to play with

MS_AT · Jun 21, 2025

igor_kavinski said:
Nice! Something to play with

well if you read my post from march

MS_AT said:
loop carried dependency will not allow hitting optimal performance, since every iteration depends on results of previous iteration. Since most cpus are able to execute 2 fadds per cycle, second unit will not be used in parallel.

The optimal unroll ratio is number of execution units x latency of the operation, provided you have enough architectural registers. For Zen5 it will be 6 512b adds. For Zen4 3 512b adds but 6 256b adds. Though the easiest to write that would be in inline assembly if you ditch msvc for clang, then you can get rid of the volatile. Not neccessary though.

ArrogantHair · Jun 21, 2025

igor_kavinski said:
Can someone with 7950X run both benchmarks with and without SMT so we can see the scaling for Zen 4?

7950X

With SMT

ArrogantHair · Jun 21, 2025

igor_kavinski said:
Can someone with 7950X run both benchmarks with and without SMT so we can see the scaling for Zen 4?

7950X

Without SMT

Schmide · Jun 21, 2025

MS_AT said:
well if you read my post from march

The optimal unroll ratio is number of execution units x latency of the operation, provided you have enough architectural registers. For Zen5 it will be 6 512b adds. For Zen4 3 512b adds but 6 256b adds. Though the easiest to write that would be in inline assembly if you ditch msvc for clang, then you can get rid of the volatile. Not neccessary though.

Edit: Apparently, at least clang, needs for loops and #define for #pragma unroll

Edit: After reading it #pragma unroll was making clang over optimize it

Edit: made initialization of the volatile automatic.

Edit: now returns number of operations performed

Edit: final version. I promise. Made it a template so now it can be compiled with any number of parallels. (but not apple parallels) See main for calling convention

This version works the same for both clang and msvc

So https://godbolt.org/z/Kj33jEvMz

Code:

#include <immintrin.h>
template<unsigned int MAX_PARALLEL = 6>
int rudiparallel(int subLoop = 256) 
{
   if (subLoop <= 0)
      return 0;
   __m512 a[MAX_PARALLEL];
   __m512 b[MAX_PARALLEL];
   volatile float va[MAX_PARALLEL];
   int maxp = 0;
   do {
      va[maxp] = static_cast<float>(maxp + 1);
      a[maxp] = _mm512_set1_ps(va[maxp]);
      b[maxp] = _mm512_set1_ps(0.0f);
   } while (++maxp < MAX_PARALLEL);
   int subLooper = subLoop;
   do {
      maxp = 0;
      do {
         b[maxp] = _mm512_add_ps(b[maxp], a[maxp]);
      } while (++maxp < MAX_PARALLEL);
   } while (--subLooper);
   maxp = 0;
   do {
      va[maxp] = *reinterpret_cast<float *>(&b[maxp]);
   } while (++maxp < MAX_PARALLEL);
   return subLoop * MAX_PARALLEL;
}
int main() 
{
    int numOps = rudiparallel<3>(); // numOps = 256 * 3
    return rudiparallel(); // returns 256 * 6 ops
}

Inner loop clang

Code:

.LBB0_3:
        vaddps  zmm11, zmm1, zmm11
        vaddps  zmm10, zmm2, zmm10
        vaddps  zmm9, zmm3, zmm9
        vaddps  zmm7, zmm5, zmm7
        vaddps  zmm4, zmm6, zmm4
        vaddps  zmm0, zmm8, zmm0
        vaddps  zmm11, zmm1, zmm11
        vaddps  zmm10, zmm2, zmm10
        vaddps  zmm9, zmm3, zmm9
        vaddps  zmm7, zmm5, zmm7
        vaddps  zmm4, zmm6, zmm4
        vaddps  zmm0, zmm8, zmm0
        add     eax, 2
        jne     .LBB0_3

inner loop msvc

Code:

$LL7@rudiparall:
        vaddps  zmm5, zmm5, zmm11
        vaddps  zmm4, zmm4, zmm10
        vaddps  zmm3, zmm3, zmm9
        vaddps  zmm2, zmm2, zmm8
        vaddps  zmm1, zmm1, zmm7
        vaddps  zmm0, zmm0, zmm6
        sub     r8d, 1
        jne     SHORT $LL7@rudiparall

So msvc should be fine with a forced 6 parallel loop

gcc need -O3 to unroll the same as msvc

Code:

.L2:
        vaddps  zmm0, zmm0, zmm11
        vaddps  zmm5, zmm5, zmm10
        vaddps  zmm4, zmm4, zmm9
        vaddps  zmm3, zmm3, zmm8
        vaddps  zmm2, zmm2, zmm7
        vaddps  zmm1, zmm1, zmm6
        sub     eax, 1
        jne     .L2

Discussion Rudi_Float_Bench v0.02a

ArrogantHair

Member

igor_kavinski

Lifer

Det0x

Golden Member

igor_kavinski

Lifer

Sgraffite

Member

igor_kavinski

Lifer

adroc_thurston

Diamond Member

igor_kavinski

Lifer

igor_kavinski

Lifer

igor_kavinski

Lifer

igor_kavinski

Lifer

igor_kavinski

Lifer

Schmide

Diamond Member

igor_kavinski

Lifer

MS_AT

Senior member

ArrogantHair

Member

ArrogantHair

Member

Schmide

Diamond Member

TRENDING THREADS