Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

HurleyBird · Sep 29, 2023

adroc_thurston said:
Only if it loads 1c/1t pairs

Doesn't it? If you constrain Cinebench to only 2-threads on Zen4 with SMT enabled, will you really only see a ~1.3x uplift compared to a ~2x uplift with SMT disabled? If that's the case, I will be extremely surprised.

adroc_thurston · Sep 29, 2023

HurleyBird said:
Doesn't it?

It doesn't, it loads all threads on one socket and only then moves on and kinda stops scaling after 160t altogether.

HurleyBird · Sep 29, 2023

adroc_thurston said:
It doesn't, it loads all threads on one socket and only then moves on and kinda stops scaling after 160t altogether.

It would be a combination in that case, right? 1t/1c on one socket, then 2t/1c on that same socket, and repeat the process on socket 2?

adroc_thurston · Sep 29, 2023

HurleyBird said:
It would be a combination in that case, right

No it'll load all 128t on one socket then move on, in 1c/2t increments.
There's more than enough Rome/Milan cinememe runs please go and compare them directly.
It's a less acute Turin situation.

HurleyBird · Sep 29, 2023

adroc_thurston said:
No it'll load all 128t on one socket then move on, in 1c/2t increments.

Not a server guy and this is hard for me to either verify or disprove. Chat GPT-4 seems to think that inside the same socket all the physical cores are usually assigned before logical, but we know it isn't always trustworthy.

Can you provide a source, or absent that, explain the benefit of scheduling tasks this way? The only thing I can think of is improving security a bit (speculative/sidechannel attacks) in some scenarios, at least when everything aligns properly.

adroc_thurston · Sep 29, 2023

HurleyBird said:
explain the benefit of scheduling tasks this way?

It's a single task silly parallel workload of rendering aka the thing you do NOT want to spill out of the socket.
Same exactly reason why MCM GPUs like Mi300 have been a never ever thing until recently.

Please just quote your estimate for socket level SIR2017 bumps for Turin over Genoa (both 96c and 128c) and be done with it.

HurleyBird · Sep 29, 2023

adroc_thurston said:
It's a single task silly parallel workload of rendering aka the thing you do NOT want to spill out of the socket.

That makes sense but isn't what I'm asking. I'm asking about the how and why of thread scheduling inside the socket, not across them.

adroc_thurston said:
Please just quote your estimate for socket level SIR2017 bumps for Turin over Genoa (both 96c and 128c) and be done with it.

Why would I have one? I'm not trying to be Nostradamus, I'm trying to make sense of the existing information.

adroc_thurston · Sep 29, 2023

HurleyBird said:
I'm asking about the how and why of thread scheduling inside the socket

It consumes all 128t available on socket before spilling to the next P.

HurleyBird said:
Why would I have one

Have your skin in the game.

HurleyBird · Sep 29, 2023

adroc_thurston said:
It consumes all 128t available on socket before spilling to the next P.

Right, again, that's not in contention. It's the claim that inside each socket the scheduler will not, in contrast to modern consumer schedulers, attempt to minimize SMT usage. That's not relevant to the first socket in this scenario, but it is to the second.

adroc_thurston · Sep 29, 2023

HurleyBird said:
but it is to the second.

Cinememe doesn't scale to 2p in any relevant way, shape or form.
Just quote your SIR bumps in %%.

Saylick · Sep 29, 2023

adroc_thurston said:
Cinememe doesn't scale to 2p in any relevant way, shape or form.
Just quote your SIR bumps in %%.

To clarify, SIR = SpecInt Rate?

adroc_thurston · Sep 29, 2023

Saylick said:
To clarify, SIR = SpecInt Rate?

Yes, socket SIR score is how vendors love guiding their things.
You see it everywhere, Intel, AMD, ARM, whatever.

I.e. Turin is %redacted% amount faster in SIR, EMR is low teens over SPR, GNR is ~2x EMR and so on and so forth.

Not the end-it-all metric but a useful proxy.

Saylick · Sep 29, 2023

adroc_thurston said:
Yes, socket SIR score is how vendors love guiding their things.
You see it everywhere, Intel, AMD, ARM, whatever.

I.e. Turin is %redacted% amount faster in SIR, EMR is low teens over SPR, GNR is ~2x EMR and so on and so forth.

Not the end-it-all metric but a useful proxy.

Gotcha.

I'll toss what should be a relatively safe ring into the hat: 30% Spec Int Rate improvement at 1T over Zen 4, the same as what Jim Keller estimated in his presentation.

Markfw · Sep 29, 2023

Is that what you are referring to ?

I can't find the same for the 8490H (SR)

adroc_thurston · Sep 29, 2023

Markfw said:
Is that what you are referring to ?

Evidently, yes.

Markfw · Sep 29, 2023

I found it !

So Genoa is 20% faster.

Markfw · Sep 29, 2023

Oh, and for similar cores.... so 64 cores gets 615 and 60 cores of SR get 495. So at the closest we can bench, its 25% faster for for 7% more Genoa cores.

H433x0n · Sep 29, 2023

Markfw said:
Oh, and for similar cores.... so 64 cores gets 615 and 60 cores of SR get 495. So at the closest we can bench, its 25% faster for for 7% more Genoa cores.
View attachment 86497

That’s not a perfect apples/apples. You’d want to compare single socket configurations or compare dual socket configurations. A dual socket setup is always less than 2x of a single socket.

adroc_thurston · Sep 29, 2023

Plus you'd want something not top vendor submission (i.e. something built with GCC using -O2, maybe -O3 tops).

Markfw · Sep 29, 2023

H433x0n said:
That’s not a perfect apples/apples. You’d want to compare single socket configurations or compare dual socket configurations. A dual socket setup is always less than 2x of a single socket.

Best that I could find. From experience with mine I am sure the SR chips are far inferior. This and the benchmarks I have posted elsewhere show that also.

DisEnchantment · Sep 29, 2023

Gideon said:
So, if the slides are to believed (and they look quite authentic), the core ends up being much more similar to Alder Lake than I anticipated. But still noticeably fatter.

The same 12-way 48KB L1 cache as Colden Cove (hopefully without the latency penalty)

8-wide dispatch (+2 vs Alder Lake and Zen 4)

6 ALUs (+1 vs Alder lake +2 vs Zen 4)

4 loads / 2 stores per cycle (vs 3/2 for Golden cove, 2 /1 for Zen 4)

- if I'm reading this right, these are 512bit (64 byte) ? That's a massive uplift from Zen 4 if true (4x the throughput in ideal AVX-512 scenarios)

The biggest unknown for me is how do they plan to feed the beast? There are no mentions of any decoder changes, surely it would be an absurd bottleneck, if not changed?

Anyway looking forward to comparisons to the Arrow Lake core. In the end, they couuld end up pretty similar in width - so it would all come down to execution.

Looking at this again it, vs Z4

+2 rename/dispatch
+2 ALUs
+1 LD/cycle
512b FP width
64B LD/ST queues
48K L1D
OOO structures increased
Usual generational architectural improvements scattered around
New BP with larger BTBs -> zero bubble conditional branches sounded like the patent I listed before where a second BP scans the other conditional branch
Decode width unknown, doubtful it is going to be beyond 6 wide if at all they even increase.
uop cache unknown
"2 basic block fetch" -~~-> Does this mean 2x fetch and decode blocks akin to Tremont?~~

Does not seem terribly bloated, would have indeed seems akin to the Z2 -> Z3 evolution. The unknowns however do seem like the kind of big ticket items. I think the zero bubble conditional branch could be tied to the "2 basic block fetch".

Low Power core

Probably the low power core option is not having 512b FP pipes or 64B LD/ST queues (they mentioned FP 512 variants, which would mean 512 pipes and data structures not standard across all core)
Denser node/efficiency optimized libs as usual
Cache reduction as usual
If ~~the 2x basic block fetch is akin to what I described, they could clock gate the second fetch block aggressively for mobile~~

However a major departure from Zen 3/4 series are the unified schedulers for INT and FP back to Zen 2 style. Would be interesting to see latencies with Zen 5.

DisEnchantment · Sep 29, 2023

Gideon said:
The same 12-way 48KB L1 cache as Colden Cove (hopefully without the latency penalty)

From the slide, it says 4 cycle L1D (compared to 5 cycles for GLC). So no regression even when they doubled the LD/ST width.

Z5 would have 2.7x more L1D BW than Z4. 4/3 x LD/Cycle * 2x width

adroc_thurston · Sep 29, 2023

DisEnchantment said:
Does not seem terribly bloated

Wait until you see the ROB and PRF sizes.

Saylick · Sep 29, 2023

DisEnchantment said:
Looking at this again it, vs Z4

+2 rename/dispatch

+2 ALUs

+1 LD/cycle

512b FP width

64B LD/ST queues

48K L1D

OOO structures increased

Usual generational architectural improvements scattered around

New BP with larger BTBs -> zero bubble conditional branches sounded like the patent I listed before where a second BP scans the other conditional branch

Decode width unknown, doubtful it is going to be beyond 6 wide if at all they even increase.

uop cache unknown

"2 basic block fetch" --> Does this mean 2x fetch and decode blocks akin to Tremont?

Does not seem terribly bloated, would have indeed seems akin to the Z2 -> Z3 evolution. The unknowns however do seem like the kind of big ticket items. I think the zero bubble conditional branch could be tied to the "2 basic block fetch".

Low Power core

Probably the low power core option is not having 512b FP pipes or 64B LD/ST queues (they mentioned FP 512 variants, which would mean 512 pipes and data structures not standard across all core)

Denser node/efficiency optimized libs as usual

Cache reduction as usual

If the 2x basic block fetch is akin to what I described, they could clock gate the second fetch block aggressively for mobile

However a major departure from Zen 3/4 series are the unified schedulers for INT and FP back to Zen 2 style. Would be interesting to see latencies with Zen 5.

Do you think you can do us a favor and prepare this data into a table which compares various architectures (e.g. Zen 3, Zen 4, Zen 5, GLC)? Please and thank you!

DisEnchantment · Sep 29, 2023

adroc_thurston said:
Wait until you see the ROB and PRF sizes.

If they implemented something like what is described below in their patents, it wont be as bloated as other designs.

APPARATUS AND METHODS EMPLOYING A SHARED READ PORT REGISTER FILE
From <https://www.freepatentsonline.com/y2023/0034072.html>
Methods and systems for utilizing a master-shadow physical register file based on verified activation
From <https://www.freepatentsonline.com/11599359.html>

Diminishing returns at the expense of a lot of power/area unless they are going more in the direction of the ideas in their patents

Saylick said:
Do you think you can do us a favor and prepare this data into a table which compares various architectures (e.g. Zen 3, Zen 4, Zen 5, GLC)? Please and thank you!

I have a table from Zen 1 to Zen 4, but Zen 5 has too many unknowns so not posting at the moment.

Two patents possibly related to the conditional branch thing
ALTERNATE PATH FOR BRANCH PREDICTION REDIRECT
From <https://www.freepatentsonline.com/y2022/0075624.html>
Instruction address translation and caching for primary and alternate branch prediction paths
From <https://www.freepatentsonline.com/11579884.html>

And the possible patents for the '2 basic block fetch' thing, if it is a thing at all.
PROCESSOR WITH MULTIPLE OP CACHE PIPELINES
From <https://www.freepatentsonline.com/y2022/0100663.html>
PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES
From <https://www.freepatentsonline.com/y2022/0100519.html>

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Moderator Emeritus, Elite Member

Platinum Member

Moderator Emeritus, Elite Member

Moderator Emeritus, Elite Member

Senior member

Platinum Member

Moderator Emeritus, Elite Member

Golden Member

Golden Member

Platinum Member

Diamond Member

Golden Member