Question Zen 6 Speculation Thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

yuri69

Senior member
Jul 16, 2013
657
1,172
136
Which also says that performance of Zen 4 without V-Cache was not compelling enough to pay the premium.

It also forced AMD to offer deep discounts.
One have to be reasonable. That performance improvement would have been insane (think Zen 5 40% IPC) to be compelling for that premium.

As already mentioned, the competition was there - Intel's Raptor was fresh from owen and pretty good vs Zen 4.
 

Win2012R2

Senior member
Dec 5, 2024
936
884
96
X3D production is inherently serial between the bit that makes the chip and the bit that puts the cache on the chip.
GPU production is inherently serial because you need to get chip first, then get shipped to be placed into boards, add memory etc - all extra steps required, per GPU.

Yes a lot more "capacity" to do this better understood simpler stuff, this certainly costs $15-20 that would allegedly be in ballpark X3D integration. Low capacity? Order more! Can't be expensive if it only costs that little per chip.

You don't need to make 80s or 90s for the other one to be available

Yes you do, if you want successful commercial launch, which is ALL THIS ABOUT - not nerdy: "this chip is ready to use before extra X3D step, therefore it's a delay, can't have that!".
 

StefanR5R

Elite Member
Dec 10, 2016
6,514
10,149
136
Shared L2 (Apple and Qualcomm have shown you can do this with a manageable latency penalty: ~5ns vs ~2.5ns for 16x the size, original source)
Hmm, isn't 5 twice that of 2.5? The graphs which you are linked are informative. But even more informative would be simulations of a choice of different workloads, with different cache hit rates and different cache latencies.

[Shared L2…]
Bad for server CPUs, such as Zen ones.
Shared L2 caches in server provessors is so bad that IBM went ahead and designed a whole Power processor generation that shared all the L2 caches among the cores in one dynamic virtual L3 pool.
This paints a too simplistic picture of what they actually did. First, they still have L2 and L3 (and L4 even), just that L3 and L4 are now "virtual" by way of sharing SRAM with L2 dynamically. Still, L3 (let alone L4) latency is higher than L2 latency — typically much higher. Second, and this is what your argument neglects, there is a bunch of QoS not-so-secret sauce in place, to keep the extent of sharing in check, IOW to maintain a prioritized rest of low-latency L2 cache for each active core. If a core is not completely idle, it does have L2 cache which other cores can't displace. So in this regard, these mainframe CPUs don't really differ from what @Win2012R2 claimed about server CPUs.

Meanwhile, Xeon's shared L3 cache has become so unwieldy, i.e. cache latency has become so nonuniform across the entirety of the L3 cache, that they now offer the option of dividing the SoC into NUMA domains (clustering modes), such that groups of cores preferrably access nearest subsets of the overall L3 cache. Oh and this option is on by default. Turns out, making shared caches bigger gets tricky if you also want to keep them fast.
 

moinmoin

Diamond Member
Jun 1, 2017
5,217
8,399
136
GPU production is inherently serial because you need to get chip first, then get shipped to be placed into boards, add memory etc - all extra steps required, per GPU.
Are you trying to make sure everybody gets thinking you indeed make bad faith arguments?

Because the difference is very clear: The GPU production is bog standard stuff any ODM can and will do. V-cache package on the other hand is seriously bottlenecked by being a high end proprietary packaging technique only few companies can do at select packaging factories.
 
Reactions: Joe NYC and inquiss

LightningZ71

Platinum Member
Mar 10, 2017
2,237
2,741
136
Hmm, isn't 5 twice that of 2.5? The graphs which you are linked are informative. But even more informative would be simulations of a choice of different workloads, with different cache hit rates and different cache latencies.

[Shared L2…]


This paints a too simplistic picture of what they actually did. First, they still have L2 and L3 (and L4 even), just that L3 and L4 are now "virtual" by way of sharing SRAM with L2 dynamically. Still, L3 (let alone L4) latency is higher than L2 latency — typically much higher. Second, and this is what your argument neglects, there is a bunch of QoS not-so-secret sauce in place, to keep the extent of sharing in check, IOW to maintain a prioritized rest of low-latency L2 cache for each active core. If a core is not completely idle, it does have L2 cache which other cores can't displace. So in this regard, these mainframe CPUs don't really differ from what @Win2012R2 claimed about server CPUs.

Meanwhile, Xeon's shared L3 cache has become so unwieldy, i.e. cache latency has become so nonuniform across the entirety of the L3 cache, that they now offer the option of dividing the SoC into NUMA domains (clustering modes), such that groups of cores preferrably access nearest subsets of the overall L3 cache. Oh and this option is on by default. Turns out, making shared caches bigger gets tricky if you also want to keep them fast.
I never said that it was perfect, but they have a VERY effective strategy for sharing it that hides a lot of the problems and also gives some big advantages for things that have high locality in a smaller footprint.
 

Win2012R2

Senior member
Dec 5, 2024
936
884
96
The GPU production is bog standard stuff any ODM can and will do. V-cache package on the other hand is seriously bottlenecked by being a high end proprietary packaging technique only few companies can do at select packaging factories.
If it costs 15-20 bucks (as claimed here and retail pricing suggest it can't be far off) per X3D packing then it is comparable (cost wise) with what is done got GPUs.

"seriously bottlenecked", "high end", "proprietary" - big words for when actual costs are that low.

Scale up those factories then - place long term big orders, there - solved it for you. That can't be expensive if they only charge 15-20 per CPU which adds at least 100 to sale price.
 

StefanR5R

Elite Member
Dec 10, 2016
6,514
10,149
136
[the future of private versus shared caches, with the example of Telum]
I never said that it was perfect, but they have a VERY effective strategy for sharing it that hides a lot of the problems and also gives some big advantages for things that have high locality in a smaller footprint.
True. Sharing of SRAM is so important for them that they do it even across package, across sockets, and across drawers. Evidently they can stomach the huge amount of metadata broadcasting and data movements across their entire system which this caching regime entails. It is a four-level cache hierarchy, applied in a quite special field of computing.

Anyway; I am rephrasing the arguments so far:
@Gideon in #2,669 — SRAM has got an area cost. Wring out the best functionality out of a given SRAM size. Turn the level 2 from private to shared.
@Win2012R2 in #2,677/#2,682 — Private caches are a feature, not a bug, for sake of a resemblance of determinism in an environment with potentially noisy neighbors.
@LightningZ71 in #2,681 — Witness POWER's Telum's sharing.
@StefanR5R in #2,703 — There is still private L2 cache left in Telum.

But yes, Telum is an example of making more out of a given SRAM size: They did not drop a cache level from their hierarchy, but they brought dynamics into which cache line is serving at which cache level.

So… what's the future? Will x86 CPU makers drop a cache level in client CPUs? Will ARM vendors add a cache level in client CPUs?
  • On the one hand, SRAM area scaling has been lagging behind core logic area scaling.
  • On the other hand, core logic clock speed keeps rising (not as fast as some might wish, but still) whereas DRAM latency is basically stuck where it was already many years ago, IOW, the latency gap between cores and main memory keeps widening.
  • In addition, as SRAM caches grow larger in size and topologically, and even begin to get spread over different chiplets, cache latency becomes increasingly nonuniform, from fast near cache to slow remote cache, differently for individual cores.
Or will x86 or ARM CPU makers take a page out of IBM's book, or in Dr. Ian Cutress' words, did IBM preview the future of CPU caches? Well, there are the costs which I mentioned, but maybe they eventually become affordable for server or client even, if limited to on-chip or on-package dynamic sharing (costs in terms of transistors, routing, and operating power). Plus monetary costs of R&D, both if licensing the technology or if working around IBM's patented IP.

Edit,
this all strayed off the topic of the thread. As for Zen 6: Strix Halo is said to give a preview of Zen 6 client. Strix Halo's cache setup is unchanged versus dual-CCD Ryzens, except that there is a MALL cache now which could be used for the CPU cores but is, by default, used by the GPU only. As for Zen 6 server, I wonder if and what it might inherit from MI300.
 
Last edited:
Reactions: Joe NYC

OneEng2

Senior member
Sep 19, 2022
612
850
106
Which also says that performance of Zen 4 without V-Cache was not compelling enough to pay the premium.

It also forced AMD to offer deep discounts.
Yes it did. I wonder if AMD will make a bridge platform to DDR6 in the future as a result of that experience?
Raptor Lake was better at that time as well and people on LGA1700 had a good upgrade path
It was better at some things, but at the same time, AMD took a devastating lead in DC. Hard to fault the strategy IMO.
 
Reactions: Joe NYC

511

Platinum Member
Jul 12, 2024
2,399
2,111
106
It was better at some things, but at the same time, AMD took a devastating lead in DC. Hard to fault the strategy IMO.
No doubt with Zen 4 but now the lead is way less with GNR in DC but the client took a hit.
 

Gideon

Platinum Member
Nov 27, 2007
2,012
4,986
136
Or will x86 or ARM CPU makers take a page out of IBM's book, or in Dr. Ian Cutress' words, did IBM preview the future of CPU caches? Well, there are the costs which I mentioned, but maybe they eventually become affordable for server or client even, if limited to on-chip or on-package dynamic sharing (costs in terms of transistors, routing, and operating power). Plus monetary costs of R&D, both if licensing the technology or if working around IBM's patented IP.

Dr Ian Cutress said:
In the Q&A following the session, Dr. Christian Jacobi (Chief Architect of Z) said that the system is designed to keep track of data on a cache miss, uses broadcasts, and memory state bits are tracked for broadcasts to external chips. These go across the whole system, and when data arrives it makes sure it can be used and confirms that all other copies are invalidated before working on the data. In the slack channel as part of the event, he also stated that lots of cycle counting goes on!

Overall complexity is also part of the equation. I'm quite sure the boost and sleep behavior of Zen cores is vastly more complicated than on Telium, thus cycle counting is also a more complex endeavor. Actually, do IBM's chips even change frequency at all beyond some "no load at all" power states? I'm not sure they need to, considering the systems they're running in (business mainframes doing transactions 24/7, etc.).

Anyway, in the software world, added complexity eventually really slows you down, even if the code itself is sound. I'm talking about the kind of complexity where some simple changes in the hardest modules take you a week to implement when it would be half a day elsewhere. I'm quite sure it's the same in hardware design. Thus, the idea might be great, but would take too long to design or could slow down future iterations too much.
 

OneEng2

Senior member
Sep 19, 2022
612
850
106
No doubt with Zen 4 but now the lead is way less with GNR in DC but the client took a hit.
GNR is still significantly behind Turin. It just isn't hilariously behind any more. It is not accurate to say that a 20-40% lead is not "significant". It is just that Intel was behind by >100% for quite some time so suddenly "20-40%" seems like they have "caught up".

Additionally, I heard the argument that muli-socket in GNR had a bug and therefore those benchmarks were not valid. We are closing in on a year since the GNR release and those benchmarks still appear valid. Just like Intel said ARL just needed an update and performance would be greatly improved.

No, I still believe that while GNR was a big step up for Intel, they are still way behind the curve in DC. Intel desperately needs CWF to best not just Turin, but AMD's Zen 6 based EYPC as well or it will continue to devastate Intel's bottom line.
Does this have any relevance to the thread topic (zen 6 speculation) ?
Certainly. We are all speculating on what features and capabilities will be in Zen 6 and how well these features and capabilities will meet market demand compared to Intel's features and capabilities.

History of both companies processor lineups seems particularly relevant in this context don't you think?
 
Reactions: Joe NYC

511

Platinum Member
Jul 12, 2024
2,399
2,111
106
GNR is still significantly behind Turin. It just isn't hilariously behind any more. It is not accurate to say that a 20-40% lead is not "significant". It is just that Intel was behind by >100% for quite some time so suddenly "20-40%" seems like they have "caught up".
It was 20% in Phoronix suit in SpecInt they are pretty even not counting Turin D ofc not to mention phoronix forgot to review the HW accelerator in GNR.
I am expecting Zen 6 to be 192C/384T and Zen6C to be 256C/512T for the servers we have a new socket as well.
Additionally, I heard the argument that muli-socket in GNR had a bug and therefore those benchmarks were not valid. We are closing in on a year since the GNR release and those benchmarks still appear valid. Just like Intel said ARL just needed an update and performance would be greatly improved.
Phoronix never rebenched those CPUs after the initial review did he?
No, I still believe that while GNR was a big step up for Intel, they are still way behind the curve in DC. Intel desperately needs CWF to best not just Turin, but AMD's Zen 6 based EYPC as well or it will continue to devastate Intel's bottom line.
It is not going to happen CWF is not going to beat Zen 6 but Zen6 is close to DMR launch than it is to Clearwater Forest best guess is DMR/Zen 6 will launch Q3 26 like last year.
 

DrMrLordX

Lifer
Apr 27, 2000
22,596
12,484
136
Intel desperately needs CWF to best not just Turin, but AMD's Zen 6 based EYPC as well or it will continue to devastate Intel's bottom line.
Clearwater Forest is supposedly the last product of its kind, and it also won't be a direct competitor to anything EPYC except for dense processsors (currently Turin-d). Prediction: it'll probably match Turin-d in performance at slightly lower power. It will not challenge anything Zen6.
 
Reactions: Joe NYC and 511

OneEng2

Senior member
Sep 19, 2022
612
850
106
It was 20% in Phoronix suit in SpecInt they are pretty even not counting Turin D ofc not to mention phoronix forgot to review the HW accelerator in GNR.
I am expecting Zen 6 to be 192C/384T and Zen6C to be 256C/512T for the servers we have a new socket as well.
Agree on all points. If you are expecting EPYC Zen6c to be 256c/512t, what are you expecting CWF to be?
Phoronix never rebenched those CPUs after the initial review did he?
Not that I am aware. In fact, I can't find any more recent reviews for GNR or Turin.
It is not going to happen CWF is not going to beat Zen 6 but Zen6 is close to DMR launch than it is to Clearwater Forest best guess is DMR/Zen 6 will launch Q3 26 like last year.
Yea, well they are saying H1 2026 for CWF .... which generally means end of May 2026 in my book. Best info I can find on EPYC Venice is "2026" which I am guessing means end of Q4. Also, based on the fact that the first wafers seem to be EPYC and not desktop, either desktop is later, or desktop is N3P. I have heard little to no info on Diamond Rapids.
Clearwater Forest is supposedly the last product of its kind, and it also won't be a direct competitor to anything EPYC except for dense processsors (currently Turin-d). Prediction: it'll probably match Turin-d in performance at slightly lower power. It will not challenge anything Zen6.
Possibly. How many cores do you figure CWF will have?

Also, is Diamond Rapids N3B/N3E/18A? Intel 3?
 
Reactions: 511

DrMrLordX

Lifer
Apr 27, 2000
22,596
12,484
136
Possibly. How many cores do you figure CWF will have?

I haven't been following it, though if Intel has the same plans as they did for Sierra Forest, they're going to want 144-288c parts. Whether they'll get there or not is another story.

Also, is Diamond Rapids N3B/N3E/18A? Intel 3?
It's supposed to be 18a. Whether or not Intel can release it on 18a in an acceptable state remains to be seen (it may need to wait for 18ap).
 

511

Platinum Member
Jul 12, 2024
2,399
2,111
106
Yea, well they are saying H1 2026 for CWF .... which generally means end of May 2026 in my book. Best info I can find on EPYC Venice is "2026" which I am guessing means end of Q4. Also, based on the fact that the first wafers seem to be EPYC and not desktop, either desktop is later, or desktop is N3P. I have heard little to no info on Diamond Rapids.
Venice is N2 AMD made a PR with it for Desktop I t
Possibly. How many cores do you figure CWF will have?
288C/T
Also, is Diamond Rapids N3B/N3E/18A? Intel 3?
DMR is 18AP
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |