Question Zen 6 Speculation Thread

yuri69 · May 3, 2025

Joe NYC said:
Which also says that performance of Zen 4 without V-Cache was not compelling enough to pay the premium.

It also forced AMD to offer deep discounts.

One have to be reasonable. That performance improvement would have been insane (think Zen 5 40% IPC) to be compelling for that premium.

As already mentioned, the competition was there - Intel's Raptor was fresh from owen and pretty good vs Zen 4.

Win2012R2 · May 3, 2025

inquiss said:
X3D production is inherently serial between the bit that makes the chip and the bit that puts the cache on the chip.

GPU production is inherently serial because you need to get chip first, then get shipped to be placed into boards, add memory etc - all extra steps required, per GPU.

Yes a lot more "capacity" to do this better understood simpler stuff, this certainly costs $15-20 that would allegedly be in ballpark X3D integration. Low capacity? Order more! Can't be expensive if it only costs that little per chip.

inquiss said:
You don't need to make 80s or 90s for the other one to be available

Yes you do, if you want successful commercial launch, which is ALL THIS ABOUT - not nerdy: "this chip is ready to use before extra X3D step, therefore it's a delay, can't have that!".

StefanR5R · May 3, 2025

Gideon said:
Shared L2 (Apple and Qualcomm have shown you can do this with a manageable latency penalty: ~5ns vs ~2.5ns for 16x the size, original source)

Hmm, isn't 5 twice that of 2.5? The graphs which you are linked are informative. But even more informative would be simulations of a choice of different workloads, with different cache hit rates and different cache latencies.

[Shared L2…]

Win2012R2 said:
Bad for server CPUs, such as Zen ones.

LightningZ71 said:
Shared L2 caches in server provessors is so bad that IBM went ahead and designed a whole Power processor generation that shared all the L2 caches among the cores in one dynamic virtual L3 pool.

This paints a too simplistic picture of what they actually did. First, they still have L2 and L3 (and L4 even), just that L3 and L4 are now "virtual" by way of sharing SRAM with L2 dynamically. Still, L3 (let alone L4) latency is higher than L2 latency — typically much higher. Second, and this is what your argument neglects, there is a bunch of QoS not-so-secret sauce in place, to keep the extent of sharing in check, IOW to maintain a prioritized rest of low-latency L2 cache for each active core. If a core is not completely idle, it does have L2 cache which other cores can't displace. So in this regard, these mainframe CPUs don't really differ from what @Win2012R2 claimed about server CPUs.

Meanwhile, Xeon's shared L3 cache has become so unwieldy, i.e. cache latency has become so nonuniform across the entirety of the L3 cache, that they now offer the option of dividing the SoC into NUMA domains (clustering modes), such that groups of cores preferrably access nearest subsets of the overall L3 cache. Oh and this option is on by default. Turns out, making shared caches bigger gets tricky if you also want to keep them fast.

moinmoin · May 3, 2025

Win2012R2 said:
GPU production is inherently serial because you need to get chip first, then get shipped to be placed into boards, add memory etc - all extra steps required, per GPU.

Are you trying to make sure everybody gets thinking you indeed make bad faith arguments?

Because the difference is very clear: The GPU production is bog standard stuff any ODM can and will do. V-cache package on the other hand is seriously bottlenecked by being a high end proprietary packaging technique only few companies can do at select packaging factories.

LightningZ71 · May 3, 2025

StefanR5R said:
Hmm, isn't 5 twice that of 2.5? The graphs which you are linked are informative. But even more informative would be simulations of a choice of different workloads, with different cache hit rates and different cache latencies.

[Shared L2…]

This paints a too simplistic picture of what they actually did. First, they still have L2 and L3 (and L4 even), just that L3 and L4 are now "virtual" by way of sharing SRAM with L2 dynamically. Still, L3 (let alone L4) latency is higher than L2 latency — typically much higher. Second, and this is what your argument neglects, there is a bunch of QoS not-so-secret sauce in place, to keep the extent of sharing in check, IOW to maintain a prioritized rest of low-latency L2 cache for each active core. If a core is not completely idle, it does have L2 cache which other cores can't displace. So in this regard, these mainframe CPUs don't really differ from what @Win2012R2 claimed about server CPUs.

Meanwhile, Xeon's shared L3 cache has become so unwieldy, i.e. cache latency has become so nonuniform across the entirety of the L3 cache, that they now offer the option of dividing the SoC into NUMA domains (clustering modes), such that groups of cores preferrably access nearest subsets of the overall L3 cache. Oh and this option is on by default. Turns out, making shared caches bigger gets tricky if you also want to keep them fast.

I never said that it was perfect, but they have a VERY effective strategy for sharing it that hides a lot of the problems and also gives some big advantages for things that have high locality in a smaller footprint.

Win2012R2 · May 4, 2025

moinmoin said:
The GPU production is bog standard stuff any ODM can and will do. V-cache package on the other hand is seriously bottlenecked by being a high end proprietary packaging technique only few companies can do at select packaging factories.

If it costs 15-20 bucks (as claimed here and retail pricing suggest it can't be far off) per X3D packing then it is comparable (cost wise) with what is done got GPUs.

"seriously bottlenecked", "high end", "proprietary" - big words for when actual costs are that low.

Scale up those factories then - place long term big orders, there - solved it for you. That can't be expensive if they only charge 15-20 per CPU which adds at least 100 to sale price.

StefanR5R · May 4, 2025

[the future of private versus shared caches, with the example of Telum]

LightningZ71 said:
I never said that it was perfect, but they have a VERY effective strategy for sharing it that hides a lot of the problems and also gives some big advantages for things that have high locality in a smaller footprint.

True. Sharing of SRAM is so important for them that they do it even across package, across sockets, and across drawers. Evidently they can stomach the huge amount of metadata broadcasting and data movements across their entire system which this caching regime entails. It is a four-level cache hierarchy, applied in a quite special field of computing.

Anyway; I am rephrasing the arguments so far:
@Gideon in #2,669 — SRAM has got an area cost. Wring out the best functionality out of a given SRAM size. Turn the level 2 from private to shared.
@Win2012R2 in #2,677/#2,682 — Private caches are a feature, not a bug, for sake of a resemblance of determinism in an environment with potentially noisy neighbors.
@LightningZ71 in #2,681 — Witness ~~POWER's~~ Telum's sharing.
@StefanR5R in #2,703 — There is still private L2 cache left in Telum.

But yes, Telum is an example of making more out of a given SRAM size: They did not drop a cache level from their hierarchy, but they brought dynamics into which cache line is serving at which cache level.

So… what's the future? Will x86 CPU makers drop a cache level in client CPUs? Will ARM vendors add a cache level in client CPUs?

On the one hand, SRAM area scaling has been lagging behind core logic area scaling.
On the other hand, core logic clock speed keeps rising (not as fast as some might wish, but still) whereas DRAM latency is basically stuck where it was already many years ago, IOW, the latency gap between cores and main memory keeps widening.
In addition, as SRAM caches grow larger in size and topologically, and even begin to get spread over different chiplets, cache latency becomes increasingly nonuniform, from fast near cache to slow remote cache, differently for individual cores.

Or will x86 or ARM CPU makers take a page out of IBM's book, or in Dr. Ian Cutress' words, did IBM preview the future of CPU caches? Well, there are the costs which I mentioned, but maybe they eventually become affordable for server or client even, if limited to on-chip or on-package dynamic sharing (costs in terms of transistors, routing, and operating power). Plus monetary costs of R&D, both if licensing the technology or if working around IBM's patented IP.

Edit,
this all strayed off the topic of the thread. As for Zen 6: Strix Halo is said to give a preview of Zen 6 client. Strix Halo's cache setup is unchanged versus dual-CCD Ryzens, except that there is a MALL cache now which could be used for the CPU cores but is, by default, used by the GPU only. As for Zen 6 server, I wonder if and what it might inherit from MI300.

OneEng2 · May 4, 2025

Joe NYC said:
Which also says that performance of Zen 4 without V-Cache was not compelling enough to pay the premium.

It also forced AMD to offer deep discounts.

Yes it did. I wonder if AMD will make a bridge platform to DDR6 in the future as a result of that experience?

511 said:
Raptor Lake was better at that time as well and people on LGA1700 had a good upgrade path

It was better at some things, but at the same time, AMD took a devastating lead in DC. Hard to fault the strategy IMO.

511 · May 4, 2025

OneEng2 said:
It was better at some things, but at the same time, AMD took a devastating lead in DC. Hard to fault the strategy IMO.

No doubt with Zen 4 but now the lead is way less with GNR in DC but the client took a hit.

mpumalanga · May 4, 2025

511 said:
No doubt with Zen 4 but now the lead is way less with GNR in DC but the client took a hit.

Does this have any relevance to the thread topic (zen 6 speculation) ?

511 · May 4, 2025

mpumalanga said:
Does this have any relevance to the thread topic (zen 6 speculation) ?

recalling history to analyze future 🤣

Gideon · May 4, 2025

StefanR5R said:
Or will x86 or ARM CPU makers take a page out of IBM's book, or in Dr. Ian Cutress' words, did IBM preview the future of CPU caches? Well, there are the costs which I mentioned, but maybe they eventually become affordable for server or client even, if limited to on-chip or on-package dynamic sharing (costs in terms of transistors, routing, and operating power). Plus monetary costs of R&D, both if licensing the technology or if working around IBM's patented IP.

Dr Ian Cutress said:
In the Q&A following the session, Dr. Christian Jacobi (Chief Architect of Z) said that the system is designed to keep track of data on a cache miss, uses broadcasts, and memory state bits are tracked for broadcasts to external chips. These go across the whole system, and when data arrives it makes sure it can be used and confirms that all other copies are invalidated before working on the data. In the slack channel as part of the event, he also stated that lots of cycle counting goes on!

Overall complexity is also part of the equation. I'm quite sure the boost and sleep behavior of Zen cores is vastly more complicated than on Telium, thus cycle counting is also a more complex endeavor. Actually, do IBM's chips even change frequency at all beyond some "no load at all" power states? I'm not sure they need to, considering the systems they're running in (business mainframes doing transactions 24/7, etc.).

Anyway, in the software world, added complexity eventually really slows you down, even if the code itself is sound. I'm talking about the kind of complexity where some simple changes in the hardest modules take you a week to implement when it would be half a day elsewhere. I'm quite sure it's the same in hardware design. Thus, the idea might be great, but would take too long to design or could slow down future iterations too much.

OneEng2 · May 4, 2025

511 said:
No doubt with Zen 4 but now the lead is way less with GNR in DC but the client took a hit.

GNR is still significantly behind Turin. It just isn't hilariously behind any more. It is not accurate to say that a 20-40% lead is not "significant". It is just that Intel was behind by >100% for quite some time so suddenly "20-40%" seems like they have "caught up".

Additionally, I heard the argument that muli-socket in GNR had a bug and therefore those benchmarks were not valid. We are closing in on a year since the GNR release and those benchmarks still appear valid. Just like Intel said ARL just needed an update and performance would be greatly improved.

No, I still believe that while GNR was a big step up for Intel, they are still way behind the curve in DC. Intel desperately needs CWF to best not just Turin, but AMD's Zen 6 based EYPC as well or it will continue to devastate Intel's bottom line.

mpumalanga said:
Does this have any relevance to the thread topic (zen 6 speculation) ?

Certainly. We are all speculating on what features and capabilities will be in Zen 6 and how well these features and capabilities will meet market demand compared to Intel's features and capabilities.

History of both companies processor lineups seems particularly relevant in this context don't you think?

511 · May 4, 2025

OneEng2 said:
GNR is still significantly behind Turin. It just isn't hilariously behind any more. It is not accurate to say that a 20-40% lead is not "significant". It is just that Intel was behind by >100% for quite some time so suddenly "20-40%" seems like they have "caught up".

It was 20% in Phoronix suit in SpecInt they are pretty even not counting Turin D ofc not to mention phoronix forgot to review the HW accelerator in GNR.
I am expecting Zen 6 to be 192C/384T and Zen6C to be 256C/512T for the servers we have a new socket as well.

OneEng2 said:
Additionally, I heard the argument that muli-socket in GNR had a bug and therefore those benchmarks were not valid. We are closing in on a year since the GNR release and those benchmarks still appear valid. Just like Intel said ARL just needed an update and performance would be greatly improved.

Phoronix never rebenched those CPUs after the initial review did he?

OneEng2 said:
No, I still believe that while GNR was a big step up for Intel, they are still way behind the curve in DC. Intel desperately needs CWF to best not just Turin, but AMD's Zen 6 based EYPC as well or it will continue to devastate Intel's bottom line.

It is not going to happen CWF is not going to beat Zen 6 but Zen6 is close to DMR launch than it is to Clearwater Forest best guess is DMR/Zen 6 will launch Q3 26 like last year.

DrMrLordX · May 4, 2025

OneEng2 said:
Intel desperately needs CWF to best not just Turin, but AMD's Zen 6 based EYPC as well or it will continue to devastate Intel's bottom line.

Clearwater Forest is supposedly the last product of its kind, and it also won't be a direct competitor to anything EPYC except for dense processsors (currently Turin-d). Prediction: it'll probably match Turin-d in performance at slightly lower power. It will not challenge anything Zen6.

OneEng2 · May 4, 2025

511 said:
It was 20% in Phoronix suit in SpecInt they are pretty even not counting Turin D ofc not to mention phoronix forgot to review the HW accelerator in GNR.
I am expecting Zen 6 to be 192C/384T and Zen6C to be 256C/512T for the servers we have a new socket as well.

Agree on all points. If you are expecting EPYC Zen6c to be 256c/512t, what are you expecting CWF to be?

511 said:
Phoronix never rebenched those CPUs after the initial review did he?

Not that I am aware. In fact, I can't find any more recent reviews for GNR or Turin.

511 said:
It is not going to happen CWF is not going to beat Zen 6 but Zen6 is close to DMR launch than it is to Clearwater Forest best guess is DMR/Zen 6 will launch Q3 26 like last year.

Yea, well they are saying H1 2026 for CWF .... which generally means end of May 2026 in my book. Best info I can find on EPYC Venice is "2026" which I am guessing means end of Q4. Also, based on the fact that the first wafers seem to be EPYC and not desktop, either desktop is later, or desktop is N3P. I have heard little to no info on Diamond Rapids.

DrMrLordX said:
Clearwater Forest is supposedly the last product of its kind, and it also won't be a direct competitor to anything EPYC except for dense processsors (currently Turin-d). Prediction: it'll probably match Turin-d in performance at slightly lower power. It will not challenge anything Zen6.

Possibly. How many cores do you figure CWF will have?

Also, is Diamond Rapids N3B/N3E/18A? Intel 3?

DrMrLordX · May 4, 2025

OneEng2 said:
Possibly. How many cores do you figure CWF will have?

I haven't been following it, though if Intel has the same plans as they did for Sierra Forest, they're going to want 144-288c parts. Whether they'll get there or not is another story.

OneEng2 said:
Also, is Diamond Rapids N3B/N3E/18A? Intel 3?

It's supposed to be 18a. Whether or not Intel can release it on 18a in an acceptable state remains to be seen (it may need to wait for 18ap).

511 · May 4, 2025

OneEng2 said:
Yea, well they are saying H1 2026 for CWF .... which generally means end of May 2026 in my book. Best info I can find on EPYC Venice is "2026" which I am guessing means end of Q4. Also, based on the fact that the first wafers seem to be EPYC and not desktop, either desktop is later, or desktop is N3P. I have heard little to no info on Diamond Rapids.

Venice is N2 AMD made a PR with it for Desktop I t

OneEng2 said:
Possibly. How many cores do you figure CWF will have?

288C/T

OneEng2 said:
Also, is Diamond Rapids N3B/N3E/18A? Intel 3?

DMR is 18AP

adroc_thurston · May 5, 2025

511 said:
288C/T

that's the dead one

511 · May 5, 2025

adroc_thurston said:
that's the dead one

No? no 288C/T SKU is not dead the Sierra Forest SKU became hyperscaler and niche customer only as for Clear water forest the 288C/T SKU is going to happen.

adroc_thurston · May 5, 2025

511 said:
No?

yeah.

511 said:
no 288C/T SKU is not dead the Sierra Forest SKU became hyperscaler and niche customer only as for Clear water forest the 288C/T SKU is going to happen.

They're all dead lol.
Only the 192 CWF-SP survived.

Kepler_L2 · May 5, 2025

adroc_thurston said:
yeah.

They're all dead lol.
Only the 192 CWF-SP survived.

Really another SRF-AP situation?

adroc_thurston · May 5, 2025

Kepler_L2 said:
Really another SRF-AP situation?

Well they have no customers for any of the server Atoms.
RRF is entirely dead too.

QuickyDuck · May 5, 2025

Lol SRF-AP has been dead twice

511 · May 5, 2025

adroc_thurston said:
yeah.

They're all dead lol.
Only the 192 CWF-SP survived.

Who told you that ?

adroc_thurston said:
Well they have no customers for any of the server Atoms.
RRF is entirely dead too.

This is true from what I heard recently they are making a bigger P core variant instead of RRF.

Question Zen 6 Speculation Thread

Senior member

Senior member

Elite Member

Diamond Member

Platinum Member

Senior member

Elite Member

Senior member

Platinum Member

Junior Member

Platinum Member

Platinum Member

Senior member

Platinum Member

Lifer

Senior member

Lifer

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Senior member

Diamond Member

Member

Platinum Member