Question Zen 6 Speculation Thread

Page 53 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

poke01

Diamond Member
Mar 8, 2022
3,530
4,857
106
Intel's not gonna be asleep at the wheel forever. Unified Core is coming, after all.
sure in 2028....
But at least ARM self-immolated Neoverse since any X5 derivative will have beyond prohibitive costs for server applications due to gigabloat.
stock ARM designs can't even get their phone/mobile cores to clock past 3.8GHz on N3E...
Only custom ARM cores are good
 

reaperrr3

Member
May 31, 2024
103
317
96
Great post. I am going to be the contrarian here. Over 6GHz is the equivalent of the sub 2 hour marathon. It can be done but only with cheating. Either wind aided for the runners or with resulting catastrophic degradation for the silicon. As always, I'd love to see it. But I'm not thinking it will happen in this decade with ambient cooling.
You make it sound like 6 ghz is a hard process node wall and factors like pipeline length etc. don't make much of a difference.

Before Zen4 got released, some (including myself) thought that going much beyond 5ghz on a TSMC process that wasn't as tailor-made to a design's needs as Intel 7 was for Alder/Raptor would be incredibly difficult. Yet Zen4 got to 5.7 ghz anyway. I remember I was actually surprised about Zen4 bringing such a clock bump.

If they aim for it, I'm fairly sure AMD will be able to get desktop Zen6 to ~6.2 ghz or higher turbo without too much trouble.
N3P improvements alone would probably already be sufficient for that.
 

basix

Member
Oct 4, 2024
125
262
96
Core improvements:
- Int scheduler entries for ALU/AGU upgraded from 88/56 to 96/64 or whereabouts
- Int PRF upgraded from 240 to at least 288, perhaps even 336 entries (336 would mean 56 per ALU, like Zen4 had)
- ROB upgraded from 448 to at least 512 entries
- smaller other upgrades throughout the core, including in the FPU area
- return of some optimizations that accelerate some ops (or no-ops via NOPS fusion) substantially
- at least 300, but more likely 500-600 and maybe even 700-800 mhz turbo clock uplift (not all-core, but at least for some of them), thanks to N3P + smart usage of 2-2 and 3-2 fin transistors where it's worth it

Uncore improvements:
- although no increase in L3 per core, cache/bandwidth-sensitive heterogenous workloads (aka not all threads equally heavy) will benefit from the 50% larger L3 per CCD
- less cross-CCD context switch penalties due to bigger CCDs + faster chiplet interconnect
- bandwidth improvements from faster chiplet connection + faster memory support

Wouldn't surprise me if INT-heavy workloads - and therefore many client and semi-professional workloads - would see a bigger effective IPC uplift on Zen6 than what Zen5 gave us.
And then the turbo clock bumps and more cores on top of that.

I highly agree with your speculation. I would increase Int PRF and ROB a little more to achieve better latency hiding and one of the seemingly weak parts of Zen 5 (e.g. ARM, Apple and Intel cores are much wider there), although the updated IOD etc. will probably alleviate some of Zen 5's limits already.

Zen 5's dual-decoder design and the very fat FPU spurred some other idea:
- Static core partitioning = "Little Cores"?
- So Zen 6 exists with basically three operation modes: ST (single-thread), SMT (dynamic dual-thread), 2T (static dual-thread)
- You could still do a "dense" implementation from there, but I suspect that N3 FinFlex (and its N2 successor called NanoFlex) will close the gap between frequency and area optimized designs to some extent

If AMD could somehow pull that off it would be very interesting to see. For some (or many?) workloads it might be more effective regarding chip area and power to just statically split the core in two. I think many server applications and web services would be fine with such a 2T operation.

Much of it is there already - or at least it looks like that to me:
- Dual-Decoder
- Big FPU with double-pumping (see Zen 4 and Zen 5 mobile parts)
- Much widened core in general (compared to Zen 4)

Depending on how much wider e.g. ROB and Int PRF get, you could nearly fit two full Zen 4 cores into that. So one Zen 6 thread in 2T mode could be close to a full Zen 4 core in ST mode. That is not weak.
The biggest drawback of the static 2T approach would be that L1/L2 Caches might need to get split into two as well (because private). And I do not know how the branch predictor would handle that 2T partitioning (same as in SMT mode or statically partitioned as well?).
The static 2T mode should only apply to server parts, which could feature bigger cache sizes than the client design. Or maybe smaller caches do not matter, because not that relevant for those applications?

Anyways, static core partitiong looks interesting to me regarding maximizing area efficiency and core density. In my opinion not interesting for client (SMT operation makes more sense), but a halfed FPU for client and a "6 + 6 dense" CCD would increase PPA already.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,127
4,550
106
If Zen 6 improvements were increase cores to 24, 5% ST IPC gain, and some efficiency gains due to a node shrink I'd be quite pleased.

If L3 on 12 core die goes to 1.5x and V-Cache also goes to 1.5x of their prior 8 core counterpart, It will be a monster, making 2 CCD, 24 core even more of a niche.
 

MS_AT

Senior member
Jul 15, 2024
688
1,393
96
Register renaming allows the compiler to just keep a live set, and offload finding ILP to the CPU. Modern compilers are absolutely assuming renaming and hundreds of registers. Before renaming was common, compilers were designed to try to extract ILP in ways that increased register pressure, through aggressive unrolling and interleaving and the like.
But it does not let you keep more values in registers than there are architectural registers. The compiler will spill regardless of the size of register file if your working set is too big. And this is where APX is useful. I am not sure if I am using right words to convey the message.

And agressive unrolling is still used to this day if for different reasons, actually clang was for some time too agressive with unrolling on zen3 what lead to perf regression when tuning for zen3.
 

Win2012R2

Senior member
Dec 5, 2024
937
886
96
And agressive unrolling is still used to this day
Been optimising some stuff last month and using this also worked on my Zen 4 nicely, was pretty tight loop too where everything will be cached very nicely
 

basix

Member
Oct 4, 2024
125
262
96
Oooooof, tell me you don't remember Bulldozer without telling me 🤣
Bulldozer had many other flaws, like e.g. poor caches. Zen is solid from the ground up.

And the static partitioning is only meant for server parts and one of three operation modes, where you want to provide many little (web) services to the costumer. So nothing for HPC, high frequency/ST parts and client. General purpose servers could be a fit for static 2T mode, but that depends on the respective workload.
If you would have read my post completely, you might have noticed what I am repeating here
 
Last edited:

basix

Member
Oct 4, 2024
125
262
96
Correct, security should get an improvement from that as well (if implemented correctly). In the end it depends on what customers request.

I see static 2T mode also as a product to counter ARM parts and Intels Sierra Forrest etc. with as little R&D effort from AMD as possible. No new core design, no new physical design.
 

adroc_thurston

Diamond Member
Jul 2, 2023
5,865
8,202
96
Correct, security should get an improvement from that as well (if implemented correctly). In the end it depends on what customers request. I see static 2T mode also as a product to counter ARM parts and Intels Sierra Forrest etc.
They'd probably need to dupe some parts of ld/st setup for proper isolation but yeah.
 
Reactions: basix

StefanR5R

Elite Member
Dec 10, 2016
6,529
10,209
136
No new core design, no new physical design.
Compared to the existing partially static, partially dynamic resource sharing SMT design, it would be...
In the end it depends on what customers request.
...a product in which performance per socket and performance per Watt are sacrificed in favor of performance determinism. Hard to tell if the sacrifices would turn out small enough to be still competitive with the likes of Sierra Forrest.
 
Reactions: OneEng2

basix

Member
Oct 4, 2024
125
262
96
Compared to the existing partially static, partially dynamic resource sharing SMT design, it would be...
Zen 6 is a new core, so it is not a different design because it will be new anyways. Yes, there might some updates be required if starting from Zen 5. But if I can do dynamic resource sharing, I should be able to do static "no sharing" with rather little effort. Dynamic sharing is more complex compared to just give each thread half of the resources.

...a product in which performance per socket and performance per Watt are sacrificed in favor of performance determinism. Hard to tell if the sacrifices would turn out small enough to be still competitive with the likes of Sierra Forrest.
I do not know that either. But let's imagine a hypotethical 256C/512T Zen 6 CPU which can turned into a 512C CPU. Each core has roughly Zen 4 performance, including AVX512. It's main competitor will be Clearwater Forest (with unknown performance and core counts).
I believe that such a 512C CPU would look very decent in it's market environment. If customers are happy with 256C/512T parts so be it, then AMD will leave SMT as is.
 

MS_AT

Senior member
Jul 15, 2024
688
1,393
96
Zen 6 is a new core, so it is not a different design because it will be new anyways. Yes, there might some updates be required if starting from Zen 5. But if I can do dynamic resource sharing, I should be able to do static "no sharing" with rather little effort. Dynamic sharing is more complex compared to just give each thread half of the resources.


I do not know that either. But let's imagine a hypotethical 256C/512T Zen 6 CPU which can turned into a 512C CPU. Each core has roughly Zen 4 performance, including AVX512. It's main competitor will be Clearwater Forest (with unknown performance and core counts).
I believe that such a 512C CPU would look very decent in it's market environment. If customers are happy with 256C/512T parts so be it, then AMD will leave SMT as is.
I think the main points to decide if this is viable idea is how much static partitioning optional feature would cost in silicon and if additional validation would be cheaper than simply designing and validating a purpose built core.

After all Intel and ARM are using purpose bulit cores to address this market niche. And if we can learn anything from Apple is that single purpose built structures are better than jack of all trades one.

When it comes to Zen6 wishlist it is suprising nobody wants them to allievate actual bottlenecks in Zen5 beside too small int reg file.

I mean what use are bigger OoO structures if the core is idling most of the time waiting for branch prediction results (something that sounds trivial in theory like getting return address from return addres stack is twice as slow as on intel) or for code fetches. They should come up with a ways to make the fetch work smarter, like Apple is doing to ensure that if they fetch something they are making the most out of it. Maybe easier said than done but I trust they have clever people working there so I hope they will be able to improve this going forward
 

StefanR5R

Elite Member
Dec 10, 2016
6,529
10,209
136
I compared the current SMT, in which some resources are competitively shared, with the proposed simpler managed SMT in which all resources¹ are shared fifty-fifty.

Competitive sharing should tend to yield higher utilization.

________
¹) Edit: core resources, that is. Other QoS policies may be employed in the uncore part of the SoC. But this goes for alternatives such as Sierra Forest too.
 
Last edited:
Reactions: OneEng2
Jul 27, 2020
25,234
17,538
146
Maybe easier said than done but I trust they have clever people working there so I hope they will be able to improve this going forward
I don't know. The clever people left and formed their own company?


The lead is a woman.

"Hell hath no fury like a woman scorned"

Wouldn't it be ironic if actual competition to Zen 6 came from a RISC-V design...
 

StefanR5R

Elite Member
Dec 10, 2016
6,529
10,209
136
Competitive sharing should tend to yield higher utilization.
How did you arrive at that conclusion and when you say compare, you tested some workload on an Epyc CPU using some BIOS option to alter the partitioning of core resources?
I am speaking hypothetically. For now, only AMD have the means to test this (in simulators at least, if not in actual silicon)... unless there is already a custom product out there like the one mentioned in #1,313.

Unfortunately I can't find a chart right now which enumerates which resources are partitioned statically vs. dynamically in AMD's current SMT implementation. However, we do know that Zen 5's frontend is pretty much 50:50 shared between threads whereas the sharing of backend resources is more dynamic (but not entirely dynamic either).

If there are two random threads, one might create more register pressure than the other. One might be integer heavy, the other more floating point heavy. One might use IMUL units a lot if it could, the other might be heavy on the AGUs. Et cetera. It matters whether you try to give each thread as much as you can, or if you give it at most half of each type of resources.
 

StefanR5R

Elite Member
Dec 10, 2016
6,529
10,209
136
The "rentable units" which are mentioned in this article are a different example of dynamic resource sharing, with the goal of optimum resource utilization. On a first glance, it looks to me like some sort of hardware replacement of (or hardware assistance to) the operating system's thread scheduler. — It's otherwise a bit off-topic here as it is an Intel patent.
Wouldn't it be ironic if actual competition to Zen 6 came from a RISC-V design...
Ignoring the low overlap of x86 and RISC-V ecosystems for a moment — while these engineers might want to follow up on ideas which they came up with while working for Intel but Intel put on the back burner, they can do so only to the extent to which they can come up with ideas to work around the now Intel-owned patents (which they themselves might have helped to conceive...).
 

Doug S

Diamond Member
Feb 8, 2020
3,211
5,511
136
I don't know. The clever people left and formed their own company?


The lead is a woman.

"Hell hath no fury like a woman scorned"

Wouldn't it be ironic if actual competition to Zen 6 came from a RISC-V design...


If the concepts in Royal Core have any merit, it is probably best to try to prove them out with RISC-V which allows them the freedom to alter the ISA if necessary to make those concepts work - and avoids a potential licensing mess like the recent Qualcomm/ARM spat.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |