Question Zen 6 Speculation Thread

igor_kavinski · Jan 2, 2025

adroc_thurston said:
Intel's not gonna be asleep at the wheel forever. Unified Core is coming, after all.

2027/2028? Wake me up when that happens, please.

poke01 · Jan 2, 2025

adroc_thurston said:
Intel's not gonna be asleep at the wheel forever. Unified Core is coming, after all.

sure in 2028....

adroc_thurston said:
But at least ARM self-immolated Neoverse since any X5 derivative will have beyond prohibitive costs for server applications due to gigabloat.

stock ARM designs can't even get their phone/mobile cores to clock past 3.8GHz on N3E...
Only custom ARM cores are good

adroc_thurston · Jan 2, 2025

poke01 said:
sure in 2028....

Maybe 2029, even.
But point still stands.

poke01 said:
stock ARM designs can't even get their phone/mobile cores to clock past 3.8GHz on N3E...

The 1t isn't even far behind, but man the area just stinks.
Really idk how they're gonna ship any X5 derivative in server and expect the hyperscale to just chew it up.

reaperrr3 · Jan 2, 2025

Hulk said:
Great post. I am going to be the contrarian here. Over 6GHz is the equivalent of the sub 2 hour marathon. It can be done but only with cheating. Either wind aided for the runners or with resulting catastrophic degradation for the silicon. As always, I'd love to see it. But I'm not thinking it will happen in this decade with ambient cooling.

You make it sound like 6 ghz is a hard process node wall and factors like pipeline length etc. don't make much of a difference.

Before Zen4 got released, some (including myself) thought that going much beyond 5ghz on a TSMC process that wasn't as tailor-made to a design's needs as Intel 7 was for Alder/Raptor would be incredibly difficult. Yet Zen4 got to 5.7 ghz anyway. I remember I was actually surprised about Zen4 bringing such a clock bump.

If they aim for it, I'm fairly sure AMD will be able to get desktop Zen6 to ~6.2 ghz or higher turbo without too much trouble.
N3P improvements alone would probably already be sufficient for that.

Joe NYC · Jan 2, 2025

adroc_thurston said:
Maybe 2029, even.
But point still stands.

Intel will be a different company if they keep falling behind in competitiveness for next 5 years.

basix · Jan 2, 2025

reaperrr3 said:
Core improvements:
- Int scheduler entries for ALU/AGU upgraded from 88/56 to 96/64 or whereabouts
- Int PRF upgraded from 240 to at least 288, perhaps even 336 entries (336 would mean 56 per ALU, like Zen4 had)
- ROB upgraded from 448 to at least 512 entries
- smaller other upgrades throughout the core, including in the FPU area
- return of some optimizations that accelerate some ops (or no-ops via NOPS fusion) substantially
- at least 300, but more likely 500-600 and maybe even 700-800 mhz turbo clock uplift (not all-core, but at least for some of them), thanks to N3P + smart usage of 2-2 and 3-2 fin transistors where it's worth it

Uncore improvements:
- although no increase in L3 per core, cache/bandwidth-sensitive heterogenous workloads (aka not all threads equally heavy) will benefit from the 50% larger L3 per CCD
- less cross-CCD context switch penalties due to bigger CCDs + faster chiplet interconnect
- bandwidth improvements from faster chiplet connection + faster memory support

Wouldn't surprise me if INT-heavy workloads - and therefore many client and semi-professional workloads - would see a bigger effective IPC uplift on Zen6 than what Zen5 gave us.
And then the turbo clock bumps and more cores on top of that.

I highly agree with your speculation. I would increase Int PRF and ROB a little more to achieve better latency hiding and one of the seemingly weak parts of Zen 5 (e.g. ARM, Apple and Intel cores are much wider there), although the updated IOD etc. will probably alleviate some of Zen 5's limits already.

Zen 5's dual-decoder design and the very fat FPU spurred some other idea:
- Static core partitioning = "Little Cores"?
- So Zen 6 exists with basically three operation modes: ST (single-thread), SMT (dynamic dual-thread), 2T (static dual-thread)
- You could still do a "dense" implementation from there, but I suspect that N3 FinFlex (and its N2 successor called NanoFlex) will close the gap between frequency and area optimized designs to some extent

If AMD could somehow pull that off it would be very interesting to see. For some (or many?) workloads it might be more effective regarding chip area and power to just statically split the core in two. I think many server applications and web services would be fine with such a 2T operation.

Much of it is there already - or at least it looks like that to me:
- Dual-Decoder
- Big FPU with double-pumping (see Zen 4 and Zen 5 mobile parts)
- Much widened core in general (compared to Zen 4)

Depending on how much wider e.g. ROB and Int PRF get, you could nearly fit two full Zen 4 cores into that. So one Zen 6 thread in 2T mode could be close to a full Zen 4 core in ST mode. That is not weak.
The biggest drawback of the static 2T approach would be that L1/L2 Caches might need to get split into two as well (because private). And I do not know how the branch predictor would handle that 2T partitioning (same as in SMT mode or statically partitioned as well?).
The static 2T mode should only apply to server parts, which could feature bigger cache sizes than the client design. Or maybe smaller caches do not matter, because not that relevant for those applications?

Anyways, static core partitiong looks interesting to me regarding maximizing area efficiency and core density. In my opinion not interesting for client (SMT operation makes more sense), but a halfed FPU for client and a "6 + 6 dense" CCD would increase PPA already.

soresu · Jan 2, 2025

basix said:
Static core partitioning = "Little Cores"?

Oooooof, tell me you don't remember Bulldozer without telling me 🤣

Joe NYC · Jan 2, 2025

adroc_thurston said:
Yeah they should start with not regressing on freq gen on gen in mobile.

Since Zen 6 is likely N3P, that alone should help with frequency. And increase in potential transistor budget can help with IPC.

So, we might see something closer to Zen 3 -> Zen 4, with gains both in frequency and IPC.

Joe NYC · Jan 2, 2025

Hulk said:
If Zen 6 improvements were increase cores to 24, 5% ST IPC gain, and some efficiency gains due to a node shrink I'd be quite pleased.

If L3 on 12 core die goes to 1.5x and V-Cache also goes to 1.5x of their prior 8 core counterpart, It will be a monster, making 2 CCD, 24 core even more of a niche.

MS_AT · Jan 2, 2025

Tuna-Fish said:
Register renaming allows the compiler to just keep a live set, and offload finding ILP to the CPU. Modern compilers are absolutely assuming renaming and hundreds of registers. Before renaming was common, compilers were designed to try to extract ILP in ways that increased register pressure, through aggressive unrolling and interleaving and the like.

But it does not let you keep more values in registers than there are architectural registers. The compiler will spill regardless of the size of register file if your working set is too big. And this is where APX is useful. I am not sure if I am using right words to convey the message.

And agressive unrolling is still used to this day if for different reasons, actually clang was for some time too agressive with unrolling on zen3 what lead to perf regression when tuning for zen3.

Win2012R2 · Jan 2, 2025

MS_AT said:
And agressive unrolling is still used to this day

Been optimising some stuff last month and using this also worked on my Zen 4 nicely, was pretty tight loop too where everything will be cached very nicely

basix · Jan 3, 2025

soresu said:
Oooooof, tell me you don't remember Bulldozer without telling me 🤣

Bulldozer had many other flaws, like e.g. poor caches. Zen is solid from the ground up.

And the static partitioning is only meant for server parts and one of three operation modes, where you want to provide many little (web) services to the costumer. So nothing for HPC, high frequency/ST parts and client. General purpose servers could be a fit for static 2T mode, but that depends on the respective workload.
If you would have read my post completely, you might have noticed what I am repeating here

adroc_thurston · Jan 3, 2025

soresu said:
Oooooof, tell me you don't remember Bulldozer without telling me 🤣

You can statically partition Zen1 (maybe later too). Oracle had Naples with cores statically sawed in half per SMT thread.
It's a favela security thing, dawg.

basix · Jan 3, 2025

Correct, security should get an improvement from that as well (if implemented correctly). In the end it depends on what customers request.

I see static 2T mode also as a product to counter ARM parts and Intels Sierra Forrest etc. with as little R&D effort from AMD as possible. No new core design, no new physical design.

adroc_thurston · Jan 3, 2025

basix said:
Correct, security should get an improvement from that as well (if implemented correctly). In the end it depends on what customers request. I see static 2T mode also as a product to counter ARM parts and Intels Sierra Forrest etc.

They'd probably need to dupe some parts of ld/st setup for proper isolation but yeah.

StefanR5R · Jan 3, 2025

basix said:
No new core design, no new physical design.

Compared to the existing partially static, partially dynamic resource sharing SMT design, it would be...

basix said:
In the end it depends on what customers request.

...a product in which performance per socket and performance per Watt are sacrificed in favor of performance determinism. Hard to tell if the sacrifices would turn out small enough to be still competitive with the likes of Sierra Forrest.

basix · Jan 3, 2025

StefanR5R said:
Compared to the existing partially static, partially dynamic resource sharing SMT design, it would be...

Zen 6 is a new core, so it is not a different design because it will be new anyways. Yes, there might some updates be required if starting from Zen 5. But if I can do dynamic resource sharing, I should be able to do static "no sharing" with rather little effort. Dynamic sharing is more complex compared to just give each thread half of the resources.

StefanR5R said:
...a product in which performance per socket and performance per Watt are sacrificed in favor of performance determinism. Hard to tell if the sacrifices would turn out small enough to be still competitive with the likes of Sierra Forrest.

I do not know that either. But let's imagine a hypotethical 256C/512T Zen 6 CPU which can turned into a 512C CPU. Each core has roughly Zen 4 performance, including AVX512. It's main competitor will be Clearwater Forest (with unknown performance and core counts).
I believe that such a 512C CPU would look very decent in it's market environment. If customers are happy with 256C/512T parts so be it, then AMD will leave SMT as is.

MS_AT · Jan 3, 2025

basix said:
Zen 6 is a new core, so it is not a different design because it will be new anyways. Yes, there might some updates be required if starting from Zen 5. But if I can do dynamic resource sharing, I should be able to do static "no sharing" with rather little effort. Dynamic sharing is more complex compared to just give each thread half of the resources.

I do not know that either. But let's imagine a hypotethical 256C/512T Zen 6 CPU which can turned into a 512C CPU. Each core has roughly Zen 4 performance, including AVX512. It's main competitor will be Clearwater Forest (with unknown performance and core counts).
I believe that such a 512C CPU would look very decent in it's market environment. If customers are happy with 256C/512T parts so be it, then AMD will leave SMT as is.

I think the main points to decide if this is viable idea is how much static partitioning optional feature would cost in silicon and if additional validation would be cheaper than simply designing and validating a purpose built core.

After all Intel and ARM are using purpose bulit cores to address this market niche. And if we can learn anything from Apple is that single purpose built structures are better than jack of all trades one.

When it comes to Zen6 wishlist it is suprising nobody wants them to allievate actual bottlenecks in Zen5 beside too small int reg file.

I mean what use are bigger OoO structures if the core is idling most of the time waiting for branch prediction results (something that sounds trivial in theory like getting return address from return addres stack is twice as slow as on intel) or for code fetches. They should come up with a ways to make the fetch work smarter, like Apple is doing to ensure that if they fetch something they are making the most out of it. Maybe easier said than done but I trust they have clever people working there so I hope they will be able to improve this going forward

adroc_thurston · Jan 3, 2025

StefanR5R said:
.a product in which performance per socket and performance per Watt are sacrificed in favor of performance determinism

You're not sacrificing anything but a wee bit of area.

StefanR5R · Jan 3, 2025

I compared the current SMT, in which some resources are competitively shared, with the proposed simpler managed SMT in which all resources¹ are shared fifty-fifty.

Competitive sharing should tend to yield higher utilization.

________
¹) Edit: core resources, that is. Other QoS policies may be employed in the uncore part of the SoC. But this goes for alternatives such as Sierra Forest too.

igor_kavinski · Jan 3, 2025

StefanR5R said:
Competitive sharing should tend to yield higher utilization.

How did you arrive at that conclusion and when you say compare, you tested some workload on an Epyc CPU using some BIOS option to alter the partitioning of core resources?

igor_kavinski · Jan 3, 2025

MS_AT said:
Maybe easier said than done but I trust they have clever people working there so I hope they will be able to improve this going forward

I don't know. The clever people left and formed their own company?

Leading Intel engineers found RISC-V company

An Intel team was supposed to develop a completely new CPU architecture. The project now appears to have been scrapped.

www.heise.de

The lead is a woman.

"Hell hath no fury like a woman scorned"

Wouldn't it be ironic if actual competition to Zen 6 came from a RISC-V design...

StefanR5R · Jan 3, 2025

StefanR5R said:
Competitive sharing should tend to yield higher utilization.

igor_kavinski said:
How did you arrive at that conclusion and when you say compare, you tested some workload on an Epyc CPU using some BIOS option to alter the partitioning of core resources?

I am speaking hypothetically. For now, only AMD have the means to test this (in simulators at least, if not in actual silicon)... unless there is already a custom product out there like the one mentioned in #1,313.

Unfortunately I can't find a chart right now which enumerates which resources are partitioned statically vs. dynamically in AMD's current SMT implementation. However, we do know that Zen 5's frontend is pretty much 50:50 shared between threads whereas the sharing of backend resources is more dynamic (but not entirely dynamic either).

If there are two random threads, one might create more register pressure than the other. One might be integer heavy, the other more floating point heavy. One might use IMUL units a lot if it could, the other might be heavy on the AGUs. Et cetera. It matters whether you try to give each thread as much as you can, or if you give it at most half of each type of resources.

StefanR5R · Jan 3, 2025

igor_kavinski said:
[...] clever people left [Intel] and formed their own company?

https://www.heise.de/en/news/Leading-Intel-engineers-found-RISC-V-company-9847956.html

The "rentable units" which are mentioned in this article are a different example of dynamic resource sharing, with the goal of optimum resource utilization. On a first glance, it looks to me like some sort of hardware replacement of (or hardware assistance to) the operating system's thread scheduler. — It's otherwise a bit off-topic here as it is an Intel patent.

igor_kavinski said:
Wouldn't it be ironic if actual competition to Zen 6 came from a RISC-V design...

Ignoring the low overlap of x86 and RISC-V ecosystems for a moment — while these engineers might want to follow up on ideas which they came up with while working for Intel but Intel put on the back burner, they can do so only to the extent to which they can come up with ideas to work around the now Intel-owned patents (which they themselves might have helped to conceive...).

Doug S · Jan 3, 2025

igor_kavinski said:
I don't know. The clever people left and formed their own company?

Leading Intel engineers found RISC-V company

An Intel team was supposed to develop a completely new CPU architecture. The project now appears to have been scrapped.

www.heise.de

The lead is a woman.

"Hell hath no fury like a woman scorned"

Wouldn't it be ironic if actual competition to Zen 6 came from a RISC-V design...

If the concepts in Royal Core have any merit, it is probably best to try to prove them out with RISC-V which allows them the freedom to alter the ISA if necessary to make those concepts work - and avoids a potential licensing mess like the recent Qualcomm/ARM spat.

Question Zen 6 Speculation Thread

Lifer

Diamond Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Senior member

Member

Diamond Member

Member

Diamond Member

Elite Member

Member

Senior member

Diamond Member

Elite Member

Lifer

Lifer

Elite Member

Elite Member

Diamond Member