Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 148 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ajay

Lifer
Jan 8, 2001
15,783
7,995
136
and what does gaming have to do with desktop ? Yes, its a PART of desktop, but there are others. Some want office, some want mini workstations, some participate in DC or things that require a lot of cores. Why do you think DIY owns gaming ? Consoles ? handhelds ? Gaming is all over, but they don't own DIY.
A bit confused here about how your DC software works. For high performance servers (high load) it’s always been about ST * Cores for net performance. That’s the mistake some RISC vendors made that ceded the High performance crown in servers to Intel. Does DC software run differently (F@H DIDN'T, except for Big Units where a minimum number of cores needed to be used).
 
Reactions: Joe NYC

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,670
14,676
136
A bit confused here about how your DC software works. For high performance servers (high load) it’s always been about ST * Cores for net performance. That’s the mistake some RISC vendors made that ceded the High performance crown in servers to Intel. Does DC software run differently (F@H DIDN'T, except for Big Units where a minimum number of cores needed to be used).
Lets put it this way. My Genoa farm kills everything for performance, especially when avx-512 is used, which the recent PG race proved. My 4 Genoa + 6 7950x + 2 7763 Rome plus 2 7V12 Rome, beat everything except 1620 older Xeon cores. Yes, its ST * cores, and Genoa has both. But on the flip 9554 Genoa and 64 cores runs at 3.5 g side... 64 cores of Genoa in one chip = 2.7 7950x (16 cores), so ST works, but who wants 3 times as many boxes ???? 64 core 9554 runs at 3.5 ghz fully loaded, so its a good compromise.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,783
7,995
136
But on the flip 96554 Genoa and 64 cores runs at 3.5 g side... 64 cores of Genoa in one chip = 2.7 7950x (16 cores), so ST works, but who wants 3 times as many boxes ???? 64 core 9554 runs at 3.5 ghz fully loaded, so its a good compromise.
Thanks, I understand your point now. Wish I had the dosh to get back into DC. Electricity rate in NH have gone through the roof in recent years. Coal and Nuclear plant shutdowns, along with increased NG prices are killing us (on top of hardware costs).
 
Reactions: Joe NYC

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,670
14,676
136
Thanks, I understand your point now. Wish I had the dosh to get back into DC. Electricity rate in NH have gone through the roof in recent years. Coal and Nuclear plant shutdowns, along with increased NG prices are killing us (on top of hardware costs).
Its not cheap for me either, even with Hydro and windmill as the sources (There might be others also). That month cost me $1000. But efficiency is also king. One 320 watt Genoa almost equals 3 142 watt 7950x's (all my 7950x's are ECO mode set)
 

StefanR5R

Elite Member
Dec 10, 2016
5,633
8,107
136
Below in the spoiler is some off-topic talk. (The last from me in this thread, I promise.) I look forward to Zen 5 discussion.
Fixed it for you.
Your "fix" does not make sense.
There is virtually no limit to the number of cores which you can buy or rent *right now*. In the big picture, your costs are scaling merely linearly with the desired core count, give or take. It's very affordable today to go orders of magnitude beyond 16c/32t. — In contrast, there is a definite, absolute limit to the single-core performance you can buy or rent right now. No matter how much you are ready to pay, you can't break that ceiling. Yet at the same time, there are countless workloads, not the least among those which are typically executed on so-called desktop computers, at which (1.) application performance is governed by single-thread performance _and_ (2.) user experience would benefit from higher application performance.

I am beyond certain many if not most engineering loads need a combination of both. Engineering workloads are vast and diverse.
We have a large bunch of AMD EPYC 9374F clusters and we choose this specific SKU due to the high boost it has among the 4th gen family. Sometimes I wish it has more cores but the number ST passes in our loads are high that currently we are looking to migrate to 2P 9374F clusters instead of 1P 64C clusters.
It is my experience too with engineering applications that a mixture of ST _and_ MT computing is a common case, with both portions taking up notable fractions of the overall run time. Sometimes the presence of intense ST computing can be chalked up as the usual lack of optimization, because software product management sets other priorities. Other times it really is because the particular sub-problem is technically very difficult to parallelize.

A bit confused here about how your DC software works. For high performance servers (high load) it’s always been about ST * Cores for net performance. That’s the mistake some RISC vendors made that ceded the High performance crown in servers to Intel. Does DC software run differently (F@H DIDN'T, except for Big Units where a minimum number of cores needed to be used).
In the bigger picture, Distributed Computing is all about embarrassingly parallel computing. It entirely depends on the ability to divide a huge computing task in to a very large number of small tasks which can be solved almost independently of each other, with very little communication happening between work distribution/ result collection server and compute client, and no communication at all between compute clients. The clients can be small or big, slow or fast; they can be online 24/7 or down for much of the day or week. From the POV of the science project, the sum of performances of all these clients makes up the project performance.

Given this, an individual contributor who wants to donate more computer time can choose between deploying few big or more small clients, and between having few clients active much of the time or more clients but at part-time.

The picture becomes more differentiated when we look at those individual tasks on the compute clients; they differ between Distributed Computing projects. Most typically they are
  • single-threaded tasks which have modest resource requirements per task, such that e.g. a 16c/32t CPU on dual-channel memory has no problems to run 32 of these tasks in parallel,
  • single-threaded tasks which offload most of their computation onto a GPU. (A cheap consumer GPU works fine for that.)
But there are also some Distributed Computing projects which hand out
  • multi-threaded tasks which easily scale to a modest number of threads, such that one would typically run one such task at once on a typical desktop CPU or a small number of such tasks concurrently on typical server CPUs,
  • single- or multithreaded tasks which have one or another special resource requirement, such that they are difficult to scale to the particular hardware which a Distributed Computing contributor has got at hand.
    For example, there have been meteorology tasks out there which created so much result data, that a somewhat faster & wider computer could easily compute these tasks faster than it could then upload the results to the project server, over a common home/small office internet link. (Incidentally, the result collection server of this project broke down several times under the sheer rate of data returns from all clients combined. Project management and service provider had underestimated the demand on the collection server.)
    As another example of such less common task types, the tasks of the latest Distributed Computing competition which Markfw mentioned, required 30 MBytes CPU cache per task instance, otherwise they would be slowed down a lot by heavy memory accesses.
 

H433x0n

Senior member
Mar 15, 2023
933
1,032
96
Lol, it's old stuff for projections.
It’s no more than 9 months old.
1t IPC isn't N-copy SIR IPC.
It doesn’t state that anywhere. The numbers it provides for Zen 3 & Zen 4 are the proper IPC. Why would they give the legit values for previous arch but sandbag Zen 5 with a Turin-D core in a suboptimal scenario for an internal document? That doesn’t make any sense.

Also you said it’s an old invalid document anyway? So that’s not representative of Turin-D anymore?
 

adroc_thurston

Platinum Member
Jul 2, 2023
2,818
4,152
96
It’s no more than 9 months old.
A bit older.
It doesn’t state that anywhere
All server IPC projections for all vendors tend to be N-copy for both IPC and perf total.
Why would they give the legit values for previous arch
Those are specifically server numbers.
Zen4 is 14% SIR over 13% client 1t due to higher SMT yield (which they've disclosed this hotchips).
That doesn’t make any sense.
Yes it does when you know Turin SIR perf numbers.
 
Reactions: Tlh97 and Joe NYC

Saylick

Diamond Member
Sep 10, 2012
3,272
6,755
136
Might as well show the slides MLID revealed so that we're all on the same page regarding what's discussed:



Edit: Had some time to look at the slides more closely and here's my thoughts.

1) That timeline slide... The years seem off. It's almost the end of 2023 and Zen 5 isn't even out yet, which suggests that it's either a fake slide, this timeline isn't to scale, or it's out of date. Seeing as how Covid delayed things, if it's an out of date slide, it probably dates to sometime before Covid.
2) Regarding the Zen 5 block diagram, here's Zen 4s for reference.

Zen 4 has 4 ALUs vs 6 for Zen 5.
Zen 4 has 6 op dispatch vs 8 for Zen 5.
Zen 4 has 3 load, 2 stores vs 4 loads, 2 stores for Zen 5.
Zen 4 has 32 kib L1D cache vs 48 kib for Zen 5.
 
Last edited:

adroc_thurston

Platinum Member
Jul 2, 2023
2,818
4,152
96
Zen 4 has 4 ALUs vs 6 for Zen 5.
Zen 4 has 6 op dispatch vs 8 for Zen 5.
Zen 4 has 3 load, 2 stores vs 4 loads, 2 stores for Zen 5.
Zen 4 has 32 kib L1D cache vs 48 kib for Zen 5.
Yes, Zen5 looks for the most part similar to Nuvia Phoenix or Apple Firestorm/Avalanche/Everest 'cept the non-baby mode FPU.
Which is...
I mean I've said that a thousand times over before.
 
Reactions: Tlh97 and inf64

Gideon

Golden Member
Nov 27, 2007
1,688
3,844
136
So, if the slides are to believed (and they look quite authentic), the core ends up being much more similar to Alder Lake than I anticipated. But still noticeably fatter.


  • The same 12-way 48KB L1 cache as Colden Cove (hopefully without the latency penalty)
  • 8-wide dispatch (+2 vs Alder Lake and Zen 4)
  • 6 ALUs (+1 vs Alder lake +2 vs Zen 4)
  • 4 loads / 2 stores per cycle (vs 3/2 for Golden cove, 2 /1 for Zen 4)
    • - if I'm reading this right, these are 512bit (64 byte) ? That's a massive uplift from Zen 4 if true (4x the throughput in ideal AVX-512 scenarios)
The biggest unknown for me is how do they plan to feed the beast? There are no mentions of any decoder changes, surely it would be an absurd bottleneck, if not changed?


Anyway looking forward to comparisons to the Arrow Lake core. In the end, they couuld end up pretty similar in width - so it would all come down to execution.
 

Gideon

Golden Member
Nov 27, 2007
1,688
3,844
136
Zen 4 has 3 load, 2 stores vs 4 loads, 2 stores for Zen 5.
Hmm, Have I misunderstood it. I always thought i'ts two 256-bit loads and one 256 bit store:


This is likely because AMD didn’t implement wider buses to the L1 data cache. Zen 4’s L1D can handle two 256-bit loads and one 256-bit store per cycle, which means that vector load/store bandwidth remains unchanged from Zen 2. The Gigabyte leak suggested alignment changed to 512-bit, but that clearly doesn’t apply for stores.
 

adroc_thurston

Platinum Member
Jul 2, 2023
2,818
4,152
96

Henry swagger

Senior member
Feb 9, 2022
397
255
106
This infraction is for trolling AMD discussions again. Restrain your overzealous support for Intel to the appropriate discussions.
So, if the slides are to believed (and they look quite authentic), the core ends up being much more similar to Alder Lake than I anticipated. But still noticeably fatter.


  • The same 12-way 48KB L1 cache as Colden Cove (hopefully without the latency penalty)
  • 8-wide dispatch (+2 vs Alder Lake and Zen 4)
  • 6 ALUs (+1 vs Alder lake +2 vs Zen 4)
  • 4 loads / 2 stores per cycle (vs 3/2 for Golden cove, 2 /1 for Zen 4)
    • - if I'm reading this right, these are 512bit (64 byte) ? That's a massive uplift from Zen 4 if true (4x the throughput in ideal AVX-512 scenarios)
The biggest unknown for me is how do they plan to feed the beast? There are no mentions of any decoder changes, surely it would be an absurd bottleneck, if not changed?


Anyway looking forward to comparisons to the Arrow Lake core. In the end, they couuld end up pretty similar in width - so it would all come down to execution.
Zen 5 can compete of they can hit 6.2 gjz.. pr arrow lake wins easily
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |