Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 433 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

adroc_thurston

Platinum Member
Jul 2, 2023
2,814
4,133
96
RaptorLake has no inter-chiplet latency problem and the RAM controller is on the same chip
ye
Thanks to 2MB L2 instead of 1.25MB, RaptorLake gains approximately +4-5% higher IPC.
In select* workloads.
Zen has a RAM controller on a separate IOD
d2d penalty is tiny.
so it needs larger L3 to compensate
It needs larger L3 to compensate for a skinnier core with less reordering capacity.
and this is mainly why it benefits from large L3 + VCache in games.
it's a skinnier core.
Moreover, communication of one CCD with the neighboring CCD is via IOD.
Not that relevant.
Zen 6 is expected to introduce a single CCD with 16 cores and a shared 64MB L3.
WRONG
 

deasd

Senior member
Dec 31, 2013
528
806
136


So we now have:

9800X, 8 cores, 170w TDP
Clock regression, ~100Mhz
IPC, ~10% compared to Zen4 <NEW>

OMG. I strongly recommend Mike Clark don't wake up anytime soon and keep sleeping until Zen6.
 

StefanR5R

Elite Member
Dec 10, 2016
5,633
8,107
136
The following is ONLY from the perspective of a Distributed computing perspective. This means 2 things, computing power and efficiency. We use all cores and most of the time SMP.

Zen 1 : way better than bulldozer in all respects, and if I remember correctly cheaper and more efficient than the Intel counter parts.
Zen 2 : small improvements in performance, about the same efficiency as Zen 1.
Zen 3 : MUCH better performance AND efficiency compared to Zen 2 The larger L3 cache made a big difference in some apps.
Zen 4 : MUCH better performance than Zen 3, but about the same efficiency. But in apps that use avx-512, nothing could touch the performance. For primegrid, we had to disable SMT and pin cores to a CCX for maximum performance, but when we do, nothing that Intel has comes close.
No, there must have been mixed up something.
Zen 1 -> Zen 2: circa double the FP throughput per core, circa double the throughput/Watt
Zen 2 -> Zen 3: some throughput increase but barely any throughput/Watt increase in most cases, big benefit to special multithreaded workloads which have larger than 16 MB cache footprint
Zen 3 -> Zen 4: notably higher throughput and throughput/Watt, additional performance increase in vectorized FP workloads​
in various Distributed Computing applications. (These are applications which are highly parallel/ almost entirely compute-bound/ power-limited workloads with FP focus. One could conclude that the manufacturing node updates are all what counts in this set of workloads. But really, microarchitecture updates <edit: and SOC updates> and node updates go hand in hand as they enable and leverage each other.)

[I don't have Zen 1/ Naples (but Broadwell-EP which has got similar throughput/Watt), nor do I have Zen 3 myself. I do have Zen 2/ Rome and Zen 4/ Genoa in machines which are configured to same core counts and similar power budgets. My conclusions relative to Zen 1 and Zen 3 rely on what I have seen from others' computers.]

Zen 5 in Distributed Computing? I trust that AMD carves out a decent perf/W update once again, despite only a minor manufacturing node update. But how much? Various hints earlier in this thread sounded promising to me. Though so far, 1T or/and iso-clock or/and integer performance characteristics have been more of a focus in this thread so far, rather than nT iso-power FP.

For primegrid, we had to disable SMT and pin cores to a CCX for maximum performance,
Actually SMT does measurably improve throughput in PrimeGrid on Zen 4, desktop and server, and does improve perf/W slightly. In contrast, on Zen 2 and Zen 3, SMT usage in PrimeGrid provides no or sometimes a small host throughput advantage but always reduces perf/W. (PrimeGrid is vectorized FP with large cache footprint, but not too large on Zen 3 and 4 if the user gives hints to the OS's process scheduler. Zen 2's cache is too small in many but not all of PrimeGrid's currently active projects.)
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,633
8,107
136
Zen 5 in Distributed Computing? I trust that AMD carves out a decent perf/W update once again, despite only a minor manufacturing node update. But how much? Various hints earlier in this thread sounded promising to me. Though so far, 1T or/and iso-clock or/and integer performance characteristics have been more of a focus in this thread so far, rather than nT iso-power FP.
PS, in the context of this specific application scenario, I expect that the top-end Zen 5 desktop SKUs with 230 W PPT limit (170 W TDP) make considerably more efficient use of this power budget, compared to their direct predecessors. Which would make them more attractive to somebody like myself who is interested in perf/Watt and in perf/host.
 

adroc_thurston

Platinum Member
Jul 2, 2023
2,814
4,133
96
Separate cache pools waste most of cache capacity to duplication
Ughh not quite.
And AMD did make it clear that unified 32MB cache pool of Zen3 is responsible for most part of game speed up.
Not even.
That's mainly for a different reason - that is, eliminating the inter-CCX penalty that made Zen2 suffer
Not even that, Zen3 was a major improvement in other ways, most notably branch prediction.
 

naukkis

Senior member
Jun 5, 2002
726
610
136
Ughh not quite.

Not even.

Not even that, Zen3 was a major improvement in other ways, most notably branch prediction.

Zen3 doubled 1-thread cache. That's always massive uplift for cache sensitive applications and gave some places 100% IPC uplift. Direct AMD quote from link posted before: "It also transitioned to a new "unified complex" design that brought 8 cores and 32MB of L3 cache into a single group of resources. This dramatically reduced core-to-core and core-to-cache latencies by making every element of the die a next-door neighbor with minimal communication time. Latency-sensitive tasks like PC gaming especially benefited from this change, as tasks now have direct access to twice as much L3 cache versus "Zen 2."
 
Reactions: igor_kavinski

naukkis

Senior member
Jun 5, 2002
726
610
136

Gideon

Golden Member
Nov 27, 2007
1,687
3,837
136
I have noticed that there is less contribution from the technically inclined posters (including myself), so a separation of threads and focus may help to keep likeminded forum members engaged.
Sorta OT bu interesting stuff is happening at Chips & Cheese:

Among other things:
The first project we have been working on is a new microbenchmark framework. This new framework will hopefully allow for standardization between different tests to keep things as consistent as possible. In the short term, this will also allow folks other than the Chips and Cheese team to add to the Chips and Cheese test suite.


In the long term, we hope that this framework will allow for more tests to be written than the current Chips and Cheese team could ever write on its own, along with diversifying test authors.

So it seems at least more high-quality tech-content is in order

BTW This surprised me:
As for the members of the board:
...
  • George J. Cozma: President and Chairman
  • Dr. Ian Cutress: Vice President
  • Ryan Mull: Treasurer
  • ...
 

AMDK11

Senior member
Jul 15, 2019
314
206
116
Kinda wish C&C did more profile testing on games where Z3 is miles ahead of Z2. I think people have assumed it is the cache when it could more so be something core specific.
Core architecture changes + unified L3 cache as a whole architecture. I don't know how you can still think that L3 was completely irrelevant to IPC.

F.P.S. ≠ I.P.C.
Not true. If core A 4GHz + VCache compared to core B 4GHz without VCache allows you to get +15% more FPS, this is an increase in the IPC of the processor.

I don't know how you can deny facts and logic. Massacre
 
Reactions: spursindonesia

StefanR5R

Elite Member
Dec 10, 2016
5,633
8,107
136
"Instructions Per Cycle" means instructions per cycle. Is that so hard to memorize?

Edit, as an example, when one processor spins on a lock for 0.2 ms, and the other for 0.3 ms, which of the two processors got the higher Instructions Per Cycle count?
 

AMDK11

Senior member
Jul 15, 2019
314
206
116
"Instructions Per Cycle" means instructions per cycle. Is that so hard to memorize?

Edit, as an example, when one processor spins on a lock for 0.2 ms, and the other for 0.3 ms, which of the two processors got the higher Instructions Per Cycle count?
A processor with a 0.2 ms latch can process more instructions per cycle than a processor with a 0.3 ms latch.

Less downtime means the core can accept more instructions (data) and process more of them at the same time.

This is an increase in the number of instructions processed per cycle, or IPC.

The goal of next-generation architecture designs is to reduce latency (core downtime/empty cycles) and increase throughput to maximize core saturation with data (instructions).

And it doesn't matter whether it's L0, L1, L2, L3, or even L4 cache, the cache is always part of the architecture and is designed to allow the core architecture to achieve higher IPC.

Will you continue to try to distort reality?
 
Last edited:

StefanR5R

Elite Member
Dec 10, 2016
5,633
8,107
136
Will you continue to try to distort reality?
If you need to know: I can't continue something which I never started. Just take what I wrote and avoid to add meaning which is not in there.

BTW, the wording in the quote could come across as an attempt on an insult on first glance, but since we are in a technical forum, I am sure this was not the intention.
 

AMDK11

Senior member
Jul 15, 2019
314
206
116
I think that the topic of the impact of cache and delays on IPC has been developed so much that it is obvious. Let's go ahead and continue the topic of Zen5.
 

Wolverine2349

Senior member
Oct 9, 2022
205
80
61
That's mainly for a different reason - that is, eliminating the inter-CCX penalty that made Zen2 suffer

Yes good point. And yet it stopped at 8 cores on a single CCX/CCD with Zen 3 and no further improvement with Zen 4 and 5 and probably beyond as well. Beyond 8 cores there is still the inter CCX/CCD penalty that makes CPUs suffer.

When will there if ever be more than 8 cores on a single CCX/CCD. Zen 4 no and Zen 5 also no and even Zen 6 appears to be no though that still too far out.

Given games this matters a lot would be nice to see. True no more than 8 cores needed for gaming for now, though there are a few games that marginally benefit from more than 8 cores and that number may increase over time to where it matters even more. I hope it does not for any games unless they get a CPU with more than 8 cores on a CCD very soon.

DO not want dual CCD for the rare games that can schedule around latency penalty. Best to have one size fits all for all gaming scenarios as many games dual CCDs is a bad latency penalty.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |