Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 217 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

biostud

Lifer
Feb 27, 2003
18,317
4,842
136
Yeah, after all, X570 didn't have any follow up (except a silent revision) when zen 3 launched. I think at that point motherboard manufacturers probably realised that devaluing perfectly capable motherboards by releasing the same ones but with a different name was a bad idea. And now that AMD requires every AM5 motherboard to have USB BIOS flashback, this is a complete non-issue
For AM4 we had X370, X470 and X570 so they could launch one with even more connectors, but I doubt it. Maybe a X870 with PCIe 6 with zen 6.
 

qmech

Member
Jan 29, 2022
82
179
66
I don't get why people think GB6 MT not scaling perfectly is a problem. I don't know what it is doing, maybe it is being stupid, but in the real world most problems aren't like SPEC or CB's MT where the way they do it allows perfect scaling until you hit system limits like memory bandwidth or cache coherence traffic.

With regards to multi-core, GeekBench have explicitly stated that their design goal was to mimic multi-core scaling in an undisclosed set of client applications and that those apps exhibited limited to negative scaling past 4 cores (GB is a little annoying in that they insist on using the word "core" interchangeably with "thread" much too often). The technical way they in which they achieved this was two-fold:

#1. They switched from "discrete tasks" to "shared tasks". This means all threads now work on the same task until that task is complete before proceeding to the next task. Compared to the previous method, this increases the benefit of caches closer to the core, eliminates most of the downsides to shared caches, and also absolutely trashes high core-count scaling, especially if multiple groups of cores (e.g. AMD's core complexes) are used. It should be noted that this switch appears to have been done on all subtasks, even those where "real world" would never see a "shared task" approach.

#2. They changed the weights of each subtask to achieve the scaling that was observed in their aforementioned (undisclosed) set of client applications. There are no public weights, since this would imply considering one subtest more important than another, so instead they modified the workload for each test as a proxy. Obviously, this means they cannot have the same workload for the single-core and multi-core version of the same subtest (and indeed they don't).

As a side note, this is a fairly fragile way of setting up a benchmark if your goal is to have it keep tracking a fixed set of client applications and it, unsurprisingly, promptly broke, forcing them to issue updates.

In any event, if you want to know about the internals of GB6, they have published some details. If you do look at those, it is very easy to criticize the subtests, but since these are all chosen simply as a proxy for an entirely different set of applications, it does not really matter much what the individual subtests actually are.

For those that are more interested in what GB6 actually *measures*, it is quite easy to take a look in their database and isolate a single line of similar processors and watch how they scale. I have not performed a rigorous analysis of Geekbench 6 and the results in their database are very noisy, so this approach obviously comes with some caveats. It would be relatively easy to perform some consistent tests across a variety of simple platforms, varying as few parameters as possible at a time, but I have not seen such an analysis.

If you looked at AMD's Dragon Range of mobile processors, you would notice that for the single-core score, the only variable with a significant correlation is boost frequency. Base frequency (range[2.5;4.0 GHz]) and L3 cache (range[32;128]) do not appear to affect the score much at all. For multi-core it is a bit harder to separate the variables, but core count does appear to the biggest factor, along with base frequency - although the latter is likely a proxy for max power. Again, L3 cache does not appear to be a major contributor.

Overall, judging by the actual scores, GB6 seems to care very little for L3 cache. It is somewhat counter-intuitive that a big 3D v-cache has almost no effect on either single-core or multi-core, despite the rather huge effect it has on many workloads (including games). The relatively small effect of L3 cache, but large effect of bandwidth to memory, appears to reflect a mix of fairly small datasets and very large datasets, with very little in between. This limit, particularly when coupled with the "shared task" approach that mitigates downsides to shared caches, really skews the results quite heavily in favor of Apple-style caches, as opposed to AMD's (or even Intel's P-cores).

I have not performed a rigorous analysis of Geekbench 6 and the results in their database are very noisy. It would be relatively easy to perform some consistent tests across a variety of simple platforms, varying as few parameters as possible at a time.

Inserting my opinion on this, specifically as it pertains to multi-core performance, I have trouble finding anything in my daily usage that reflects GB6's vision of what multi-core is. When I have dozens of tabs open all chugging away with their BS javascript and moronic h264 video ads, that type of workload is not reflected by GB6 scores. When I apply a filter to a photo or encode a small video to send to my family, that also is not reflected by GB6 scores. If I played a game (I rarely find time to), or if I analyze a chess position (blitz chess I do have time for), or if I do something that stresses the computer for more than a few *seconds* (minutes at best), or if I compile something in the background while continuing to actually use my computer, then those things *too* are not reflected by GB6 scores.

So, yes, the next time I open a single Word document or a single web page in a browser with no other active tabs, then I will be sure to think just how well GB6 reflects Real World multi-core scaling.
 

eek2121

Platinum Member
Aug 2, 2005
2,974
4,110
136
Thought it was GB 5 that tracks pretty well with SPEC? Maybe GB 6 ST does too, but I doubt GB 6 MT does. IIRC there was a paper on it, but idk if that was just a fever dream or what lol

I feel like the pattern is much more complicated than just "AMD does worse on CB" and more to do with the specific arch implementation. Bcuz Zen 5 is such a large change, I have no reason to believe that pattern would hold.
GB5 tracks closely with SPEC under optimal conditions (read: you know how to properly run benchmarks)

GB6 does not. GB6 is attempting to measure performance from the perspective of a user, which may be different from absolute performance, depending on the user. They aren’t there yet.

People used to hate on GB5 because it didn’t fully load your system and because of the crypto score.

You don’t need to bog down your system with a power virus to know potentially how fast it can be. You are not measuring your cooling performance, but the overall processing performance.

And the crypto score? it was limited to 5% of overall, so even if your chip scored absolute garbage, but had amazing crypto, it wouldn’t save you. Some people can’t math, however, so due to complaints they removed it.
They can easily fix the GB6 score problem by introducing a third MT Max score where everything is embarrassingly parallel.
I agree, I thought about this after they launched it.
The same can be said in reverse, we've been fooling people for years into believing the high core count CPUs are twice or three times as fast in consumer MT workloads.

Was that not the same kind of lie, only upside down?
Depends in the workload. I routinely run workloads on my 7950X that run 2x faster than an 8 core part would. My GPU also executes tasks that are embarrassingly parallel on a day to day basis.

Just because SOME workloads don’t scale, doesn’t mean none do.

The absolute performance of a chip is good to know. GB6 is inadequate for this reason.

Average end user performance is good to know as well. GB6 is one of the few benchmarks trying to measure this. I give them a C- for the effort.

I am currently in the design phase of an open source benchmark that is more similar to GB5. Once something comes of it, i will post links here, though it may be a while since I am quite busy.
 

Hail The Brain Slug

Diamond Member
Oct 10, 2005
3,187
1,555
136
Does anybody know what upgrades or differences the B750/X770 motherboards will bring with Zen 5?
I do not believe there will be a new chipset or new boards - Maybe some 600 series refresh boards only. There's no new IOD with new IO to justify a new board design at all. I believe this will follow the same pattern B550/X570 did with Zen 3, which shared the same IOD as Zen 2. Only a few refresh boards.
 

Hans Gruber

Platinum Member
Dec 23, 2006
2,176
1,128
136
The only reason I ask about new motherboard chipsets. When considering upgrading to AM5, I want to make sure if I get a B650/X670 will not be replaced with a new chipset when Zen 5 is released. I am keeping my eyes peeled for Zen 4 bundles in the next few months.
 

LightningZ71

Golden Member
Mar 10, 2017
1,645
1,929
136
I don't know of anything that the b650/x670 combo is missing that it desperately needs that's worthy of a major revision. However, I DO expect a platform revision for Zen6. There os supposedly a big philosophy of design change in the whole package and I suspect that it will require a change in socket characteristics.
 

DaaQ

Golden Member
Dec 8, 2018
1,336
957
136
With regards to multi-core, GeekBench have explicitly stated that their design goal was to mimic multi-core scaling in an undisclosed set of client applications and that those apps exhibited limited to negative scaling past 4 cores (GB is a little annoying in that they insist on using the word "core" interchangeably with "thread" much too often). The technical way they in which they achieved this was two-fold:

#1. They switched from "discrete tasks" to "shared tasks". This means all threads now work on the same task until that task is complete before proceeding to the next task. Compared to the previous method, this increases the benefit of caches closer to the core, eliminates most of the downsides to shared caches, and also absolutely trashes high core-count scaling, especially if multiple groups of cores (e.g. AMD's core complexes) are used. It should be noted that this switch appears to have been done on all subtasks, even those where "real world" would never see a "shared task" approach.

#2. They changed the weights of each subtask to achieve the scaling that was observed in their aforementioned (undisclosed) set of client applications. There are no public weights, since this would imply considering one subtest more important than another, so instead they modified the workload for each test as a proxy. Obviously, this means they cannot have the same workload for the single-core and multi-core version of the same subtest (and indeed they don't).

As a side note, this is a fairly fragile way of setting up a benchmark if your goal is to have it keep tracking a fixed set of client applications and it, unsurprisingly, promptly broke, forcing them to issue updates.

In any event, if you want to know about the internals of GB6, they have published some details. If you do look at those, it is very easy to criticize the subtests, but since these are all chosen simply as a proxy for an entirely different set of applications, it does not really matter much what the individual subtests actually are.

For those that are more interested in what GB6 actually *measures*, it is quite easy to take a look in their database and isolate a single line of similar processors and watch how they scale. I have not performed a rigorous analysis of Geekbench 6 and the results in their database are very noisy, so this approach obviously comes with some caveats. It would be relatively easy to perform some consistent tests across a variety of simple platforms, varying as few parameters as possible at a time, but I have not seen such an analysis.

If you looked at AMD's Dragon Range of mobile processors, you would notice that for the single-core score, the only variable with a significant correlation is boost frequency. Base frequency (range[2.5;4.0 GHz]) and L3 cache (range[32;128]) do not appear to affect the score much at all. For multi-core it is a bit harder to separate the variables, but core count does appear to the biggest factor, along with base frequency - although the latter is likely a proxy for max power. Again, L3 cache does not appear to be a major contributor.

Overall, judging by the actual scores, GB6 seems to care very little for L3 cache. It is somewhat counter-intuitive that a big 3D v-cache has almost no effect on either single-core or multi-core, despite the rather huge effect it has on many workloads (including games). The relatively small effect of L3 cache, but large effect of bandwidth to memory, appears to reflect a mix of fairly small datasets and very large datasets, with very little in between. This limit, particularly when coupled with the "shared task" approach that mitigates downsides to shared caches, really skews the results quite heavily in favor of Apple-style caches, as opposed to AMD's (or even Intel's P-cores).

I have not performed a rigorous analysis of Geekbench 6 and the results in their database are very noisy. It would be relatively easy to perform some consistent tests across a variety of simple platforms, varying as few parameters as possible at a time.

Inserting my opinion on this, specifically as it pertains to multi-core performance, I have trouble finding anything in my daily usage that reflects GB6's vision of what multi-core is. When I have dozens of tabs open all chugging away with their BS javascript and moronic h264 video ads, that type of workload is not reflected by GB6 scores. When I apply a filter to a photo or encode a small video to send to my family, that also is not reflected by GB6 scores. If I played a game (I rarely find time to), or if I analyze a chess position (blitz chess I do have time for), or if I do something that stresses the computer for more than a few *seconds* (minutes at best), or if I compile something in the background while continuing to actually use my computer, then those things *too* are not reflected by GB6 scores.

So, yes, the next time I open a single Word document or a single web page in a browser with no other active tabs, then I will be sure to think just how well GB6 reflects Real World multi-core scaling.
Correct me if I'm wrong, but wasn't Geekbench initially a smartphone benchmark?
I seem to remember it from the galaxy nexus days. Which would be Nexus 5.
 

qmech

Member
Jan 29, 2022
82
179
66
Correct me if I'm wrong, but wasn't Geekbench initially a smartphone benchmark?
I seem to remember it from the galaxy nexus days. Which would be Nexus 5.
If I remember my history correctly, GB was originally written by a guy who thought his PowerPC Mac was slow and wrote his own benchmark to prove it, with Mac and Windows the first two OSes.

It certainly does seem like a smart phone focused benchmark these days, though.
 

Mopetar

Diamond Member
Jan 31, 2011
7,961
6,312
136
If I remember my history correctly, GB was originally written by a guy who thought his PowerPC Mac was slow and wrote his own benchmark to prove it, with Mac and Windows the first two OSes.

It certainly does seem like a smart phone focused benchmark these days, though.

That's what the author of Geekbench has said. Ars did an interview with him around the time GB6 launched and he said as much.

ArsTechnica said:
"I just switched over to the Mac back in about 2002," Poole told Ars. "So I was getting used to that ecosystem. And then the [Power Mac] G5 came out and I thought, oh, this looks really cool. I went out, bought one of the new G5s, and it felt slower than my previous Mac. And I thought, well, this is really strange; what's going on. ... So, you know, I grabbed what [benchmarks] I could download and ran them and got really confused, because what the benchmarks were saying wasn't jiving with my experience.

"So I actually went and I reverse-engineered one of the popular benchmarks and found that the tests were, for lack of a better word, terrible," said Poole. "They weren't really testing anything substantial, you know, doing really simple arithmetic operations on really small amounts of data, not really testing anything. And so I thought, how hard can it be to write a benchmark? Maybe I should write my own."
 
Reactions: DaaQ and Exist50

moinmoin

Diamond Member
Jun 1, 2017
4,988
7,758
136
The same can be said in reverse, we've been fooling people for years into believing the high core count CPUs are twice or three times as fast in consumer MT workloads.

Was that not the same kind of lie, only upside down?
Only if you consider the use case of PCs to be running a single application at a time, mobile OS style. But even with ST heavy or Geekbench 6 style "MT" applications you can easily fill plenty more cores as soon as you run multiple instance of them at once. Something a modern browser may already be doing by running every tab in a separate process.

So calling it a lie to me would reek of gaslighting with the possible agenda of turning PCs into mobile style platforms.

In general benchmarks tend to be too optimistic since they basically require clean room tests whereas more realistic use cases have plenty of applications and services running concurrently. Even there MT (but not GB6's "MT") is worth more as it's a much better indicator of whether the system will be capable of running everything smoothly. Percentiles in game benchmarks partly reflect that. There are also some few attempts at "noisy" benchmarks where existing benchmarks are run during constant but stable activity by other processes in the background (I wish such would be easier to find, I think Computerbase.de did one once, based on investigating micro stutters or some such).
 

randomhero

Member
Apr 28, 2020
183
249
116
If these diagrams are true, then there is new IO die also for Turin since there is support for cxl2.0 and 6000MT DDR5. CCD in standard Turin are losing some bandwidth. Turin dense CCD are using some dual ring setup so L3 latency is going up but for cloud it should not matter much.
Edge platform (Sienna replacement) will have beastly PPW or PPC(ost) when it comes out.
But only if diagrams are true.
 

yuri69

Senior member
Jul 16, 2013
400
650
136
IF gen 3 is the same as Genoa. MI300 got Gen4. meh

Solving the problem of 16c per L3 for Zen 5c was surely interesting, although Zen 6c reportedly comes with 32c per L3.

Anyway, now we need *the numbers*.
 
Reactions: lightmanek

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
If these diagrams are true, then there is new IO die also for Turin since there is support for cxl2.0 and 6000MT DDR5. CCD in standard Turin are losing some bandwidth. Turin dense CCD are using some dual ring setup so L3 latency is going up but for cloud it should not matter much.
Edge platform (Sienna replacement) will have beastly PPW or PPC(ost) when it comes out.
But only if diagrams are true.
Yes, this must be a new die as the current one has only 12 GMI links while they now need 16. But so far nothing points to each of them being narrower.

Regarding RAM bandwidth: 25% gain is indeed a bit behind the curve. The question is, if this is a bottleneck for typical workloads in the addressed markets.
 

MadRat

Lifer
Oct 14, 1999
11,915
258
126
Same as in 1995, it seems like AMD still needs to figure out how to run asynchronous memory to bus. Intel's patent lock on the breakthrough should be expired by now. Surely there is a solution to take better advantage of these non-synchronized memory speeds.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,821
3,308
136
Same as in 1995, it seems like AMD still needs to figure out how to run asynchronous memory to bus. Intel's patent lock on the breakthrough should be expired by now. Surely there is a solution to take better advantage of these non-synchronized memory speeds.
But you can , its just terrible for performance because of obvious reasons............
 
Reactions: Thunder 57

moinmoin

Diamond Member
Jun 1, 2017
4,988
7,758
136
Same as in 1995, it seems like AMD still needs to figure out how to run asynchronous memory to bus.
It actually already does for mobile parts, for saving on power consumption depending on whether the CPU (preferring low latency) or iGPU (preferring high bandwidth) requests data.
 
Reactions: MadRat

PJVol

Senior member
May 25, 2020
572
496
136
It actually already does for mobile parts, for saving on power consumption depending on whether the CPU (preferring low latency) or iGPU (preferring high bandwidth) requests data.
It can since Zen2, although going further upstream from umc, in-sync or 1:2 modes required.
There's also umc and ddr phy clock domains between ccm and dram.
 
Reactions: moinmoin

Thunder 57

Platinum Member
Aug 19, 2007
2,753
3,970
136
Can we please stop with the accusations that Geekbench 6 is useless and non-transparent because it changed how it was approaching multi-threading compared to v5?

I'm not going to read 49 pages, why don't you point out the relevant context?

I'll just edit this right away. I don't see how focusing on "shared task" vs "seperate task" should be a big deal. Is not most MT software using "seperate task" methods? Where is this coming from anywahy, who are you responding to?
 
Last edited:
Reactions: Mopetar and Thibsie

coercitiv

Diamond Member
Jan 24, 2014
6,279
12,295
136
I'm not going to read 49 pages, why don't you point out the relevant context?
You only need to read page 2 that contains the summary, and then click on Multithreading for example. It's a well structured document.

Here's the text:
Geekbench 6 uses a “shared task” model for multi-threading, rather than the “separate task” model used in earlier versions of Geekbench. The “shared task” approach better models how most applications use multiple cores.

The "separate task" approach used in Geekbench 5 parallelizes workloads by treating each thread as separate. Each thread processes a separate independent task. This approach scales well as there is very little thread-to-thread communication, and the available work scales with the number of threads. For example, a four-core system will have four copies, while a 64-core system will have 64 copies.

The "shared task" approach parallelizes workloads by having each thread processes part of a larger shared task. Given the increased inter-thread communication required to coordinate the work between threads, this approach may not scale as well as the "separate task" approach.

And here's a shorter version in case 3 paragraphs is too much, emphasis mine:
Geekbench 6 uses a “shared task” model for multi-threading, where each thread works on a part of a bigger task. This mimics how most applications use multiple cores.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |