Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 108 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
803
1,383
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).



What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!
 
Last edited:
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Does it really make sense to fork the big core line for a "true big core" and a "slightly less big core"?

I mean, trading the ST for MT has traditionally been a niche - look at UltraSPARC T-line, Bulldozer, etc. Will it really take off now, in the blooming cloud era?

TBH Zen 4 cor eis starting to look less exciting. It's apparently large even at 5nm (hence Zen 4D). It looks as a simple evolution of Zen 3 (doubled FPU, doubled L2). Is going to be short-lived (~12 months till Zen 5). Osborne effect (Zen 5 being a major redesign).
Things like the UltraSparc T-series are extremely skewed towards throughput at the expense of single thread performance. I don’t think they will be sacrificing much single thread performance here, at least not for integer workloads. They may have lower floating point throughout, but that isn’t important for many servers. They need to have a lower power core for mobile, high density servers, and consoles so making at least 2 variants isn’t a bad idea. I could defiantly use ridiculous core counts for compile jobs, but those like a lot of cache also, so hopefully it has stacked L3 if it is significantly smaller on the main die. I don’t need much of any floating point on the cpu; most heavy FP operations are on the gpu for the software I deal with. They need to have a lot of hardware acceleration for mobile mobile chips since that will get the best power consumption, but that will likely be on a different die or APU variant.

The server and mobile variants will tend to stick around for a while longer than the common desktop chips. The Milan X3D will likely be around for quite a while. The server market just moves slower. Also, I expect a lot of company’s schedules to have slipped due to covid related supply issues and such. I don’t know if I believe Zen 5 will be out that quickly.
 

insertcarehere

Senior member
Jan 17, 2013
639
607
136
Not surprised at the rumblings of Zen4D, AMD has to know that they are spending significant die size to make the Zen cores clock at 5ghz, which is not needed in server or mobile applications. It'd also not make sense to stack all their SKUs with 3d cache if there are significant usecases where the die size (I. E cost) would be better spent on more cores.
 

Harry_Wild

Senior member
Dec 14, 2012
838
152
106


Posts in CPUs must have your own original written commentary in them. This one does not. This is your friendly mod warning to add your own written commentary in your posts.

AT Moderator ElFenix
 
Last edited by a moderator:

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
I mean, trading the ST for MT has traditionally been a niche - look at ..., Bulldozer, etc.
There is no indication or design point that had Bulldozer sacrificing ST for MT. The performance loss is instead mostly attributed towards Server/HPC optimizations. Big L2(higher latency), Big Front-end(no room for op or trace cache), Big FPU(with longer latency to support server/HPC's execution unit demands), etc.

In fact before CMT was selected for Bulldozer the prior CMT implementation that was suppose to launch with K8 was significantly more smaller and faster. Hence, it had a single retire and a single LSU for two cores which each had two 64-bit ALUs(FU0/1), two integer 64-bit MMX units(MM0/1), two FPU 64-bit SSE units(FP0/1), one load agu(LDA), one store agu(STA), and the shrinked version having a store data unit(STD). It had significantly less OoO resources than Bulldozer had, since CMT is meant to be implemented at low-power. Where full CMP and big-core SMT increases power complexity.

Going more in-depth in the confusion.

Bulldozer's architecture was originally referenced as a "Compute Core", with scalable partitioned execution resources being a key feature. What if this implied a similar capability as IBM's CLA/CLB Architecture with One SMT4+4 core or Two SMT4 cores. Where Bulldozer can be either one combined core or two clustered cores. With the combined core being the original intent for production models of Bulldozer, with linear scaling power draw from partitioned slices.


Various modes mentioned by 2007:
-> instruction dispatch module can dispatch integer instruction operations from the threads T0 and T1 to either of the integer execution units 212 and 214 depending on thread priority, loading, forward progress requirements, and the like.
-> instruction dispatch module can be configured to dispatch integer instruction operations associated with the thread to both integer execution units 212 and 214 based on a predefined or opportunistic dispatch scheme.
-> The integer execution units 212 and 214 can be used to implement a run ahead scheme whereby the instruction dispatch module dispatches memory-access operations (e.g., load operations and store operations) to one integer execution unit while dispatching non-memory-access operations to the other integer execution unit.
-> Another example of a collaborative use of the integer execution units 202 and 204 is for an eager execution scheme whereby both results of a branch in an instruction sequence can be individually pursued by each integer instruction unit.
-> As yet another example, the integer execution units 212 and 214 can be used collaboratively to implement a reliable execution scheme for a single thread. In this instance, the same integer instruction operation is dispatched to both integer execution units 212 and 214 for execution and the results are compared by, for example, the thread retirement modules 226 of each integer execution unit.

32nm Bulldozer only had the respective mode; one thread per partition. Whereas the 45nm Bulldozer design had the collaborative modes. Which is why they said highest performing single-threaded (+ multi-threaded) compute core in history in 2007.
 
Last edited:

yuri69

Senior member
Jul 16, 2013
401
653
136
There is no indication or design point that had Bulldozer sacrificing ST for MT.
Let me quote HotChips 22's "Bulldozer - A new approach to multithreaded compute performance" (emphasis mine):
Throughput advantages for multi-threaded workloads without significant loss on serial single-threaded workload components.

The ST loss was not significant but was present.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Makes me wonder if AMD will use the reserved bit for AVX1024 in a later design;

I doubt we will see AVX1024 any time soon. We are more likely to see further enroachment of AVX512_256bit mode style instructions like we see in ADL, than widening of FPR to 1024bits.
That ship has already sailed to the port with a name that starts with capital N.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
I doubt we will see AVX1024 any time soon. We are more likely to see further enroachment of AVX512_256bit mode style instructions like we see in ADL, than widening of FPR to 1024bits.
That ship has already sailed to the port with a name that starts with capital N.
If Zen4's AVX512 implementation takes the 256-bit P0/P1/P2's reflection and makes it the high-bits for AVX512. Thus, achieving full-speed 512-bit like Zen2 did for 256-bit, then there is a case for Zen6 to have full-speed AVX1024.

Given the trend:
Zen = 4 x 128-bit (also, relative to Bulldozer-Excavator&Greyhound-Husky this is true native 128-bit, where 64-bit lo+64-bit hi was used in Fam10h&Fam12h&Fam15h)
Zen2 = 8 x 128-bit (4*128b for lo-bit(1-128) and 4*128b for hi-bit(129-256) to get full-speed 4 x 256-bit) || doesn't allow full access to second set of units like Zen3.
Zen3 = 6 x 256-bit (true native 256-bit) || equivalent to 12 x 128-bit
--- unknown waters ---
Zen4 = 6 x 256-bit (3*256b for lo-bit(1-256) and 3*256b for hi-bit(257-512) to get full-speed 3 x 512-bit) || equivalent to 12 x 128-bit
Zen5 = 4 x 512-bit (true native 512-bit) || equivalent to 16 x 128-bit
Zen6 = 4 x 512-bit (2 * 512b for lo-bit(1-512) and 2 * 512b for hi-bit(513-1024) to get full-speed 2 x 1024-bit) || equivalent to 16 x 128-bit
how likely is it that zen4d core is used for the eventual Dali replacement for lower end mobile SKU's?
We have to suffer through Mendocino before we see any future designs in the value APU market.
AMD Mendocino | MDN-A0 Zen 2 | socket FT6 CPUID 0x8A0F00

However, if there is a Zen3 Dense used in FT7, then there is a likely case for Zen4 Dense to be used as well.
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,423
8,333
136

Mopetar

Diamond Member
Jan 31, 2011
7,961
6,312
136
It's not particularly strange for anyone that had no previous implementation in place to seemingly catch up or even leap frog ahead. There are African countries that have better an easier mobile payment systems in place than the US does because they have the benefit of not having to worry about the various incremental steps along the way.

This isn't quite the coup that AMD64 was, and while AVX-512 isn't really necessary outside of a few specific workloads I can't say it hurts to have the option for those who might want or need it. But I do admit that I'll get a bit of a grin if AMD rolls out a more efficient implementation than Intel has (much like their foray into SMT with Zen) and the resulting blow to the collective ego of some Intel diehards.
 

LightningZ71

Golden Member
Mar 10, 2017
1,645
1,929
136
We have to suffer through Mendocino before we see any future designs in the value APU market.
AMD Mendocino | MDN-A0 Zen 2 | socket FT6 CPUID 0x8A0F00

However, if there is a Zen3 Dense used in FT7, then there is a likely case for Zen4 Dense to be used as well.

I thought there was supposed to be a Monet in there somewhere on the value spectrum. GF12, quad, RDNA2, Zen3...
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
I thought there was supposed to be a Monet in there somewhere on the value spectrum. GF12, quad, RDNA2, Zen3...
The value proposition node at GlobalFoundries is 12FDX, not 12LP+.
22FDX = 1x cost
12LP+ = ~1.6x cost [2019 node]
12FDX = ~1.2x cost [2022 node]
per mm squared

So, I doubt "4c Zen3" plus "a couple RDNA2 WGPs" on 12LP+ is aimed at value.
 

Hitman928

Diamond Member
Apr 15, 2012
5,423
8,333
136
The value proposition node at GlobalFoundries is 12FDX, not 12LP+.
22FDX = 1x cost
12LP+ = ~1.6x cost [2019 node]
12FDX = ~1.2x cost [2022 node]
per mm squared

So, I doubt "4c Zen3" plus "a couple RDNA2 WGPs" on 12LP+ is aimed at value.

12LP+ is much cheaper than TSMC 7 nm, that's the correct comparison. GF12FDX is a very different process and has had a lot of hiccups in its bring up. I'm not expecting AMD, or really any major CPU maker, to bring any kind of high performance (relative to the overall digital world) design to 12FDX.
 

Gideon

Golden Member
Nov 27, 2007
1,688
3,844
136
Well, Zen 5 still sounds like a Zen core - scalable and balanced to be used form top to bottom with minor changes. AMD still has to be able to cater the 15W ultrabook market even in 2023/2024. There are two sane possibilities - AMD would fork the Zen core to another "mobile-optimized branch" or Zen 5 can't be such a big monster.

There is no such direct relation between core size and target market. Apple's big cores are huge 8-wide monsters but also go to smartphones. They will waste more die-space and have lower frequency (offset by higher IPC) but can still be every bit as power efficient
 
Reactions: Tlh97 and uzzi38

DisEnchantment

Golden Member
Mar 3, 2017
1,626
5,910
136
AVX512 Instructions supported by AMD ZEN Compared to the Intel architecture families.

View attachment 52315

View attachment 52316


I hope they won't implement single op 512bit FMA pipes. Not worth it, just take more cycles on improved Zen3 256bit pipe, on one if not both existing FMA pipes.
Increase FP register file for renaming or possibly another 256 bit FP pipe.
Genoa will have enough cores to beat SPR in AVX512 throughput even if it take more cycles.
 

Ajay

Lifer
Jan 8, 2001
15,783
7,995
136
If anything the booming cloud era is probably why it exists. Upping private L2 at the cost of public L3 definitely seems like an optimisation for server workloads if you ask me.

I would think virtual machines that split the cores on the CCD would benefit from a larger private cache. IIRC, modern CPUs allow cache partitioning for L3$, so I'm not sure how much gain is to be had. Small executables would definitely benefit from a larger L2 cache (like scripts).
 

eek2121

Platinum Member
Aug 2, 2005
2,989
4,135
136
If Zen4's AVX512 implementation takes the 256-bit P0/P1/P2's reflection and makes it the high-bits for AVX512. Thus, achieving full-speed 512-bit like Zen2 did for 256-bit, then there is a case for Zen6 to have full-speed AVX1024.

Given the trend:
Zen = 4 x 128-bit (also, relative to Bulldozer-Excavator&Greyhound-Husky this is true native 128-bit, where 64-bit lo+64-bit hi was used in Fam10h&Fam12h&Fam15h)
Zen2 = 8 x 128-bit (4*128b for lo-bit(1-128) and 4*128b for hi-bit(129-256) to get full-speed 4 x 256-bit) || doesn't allow full access to second set of units like Zen3.
Zen3 = 6 x 256-bit (true native 256-bit) || equivalent to 12 x 128-bit
--- unknown waters ---
Zen4 = 6 x 256-bit (3*256b for lo-bit(1-256) and 3*256b for hi-bit(257-512) to get full-speed 3 x 512-bit) || equivalent to 12 x 128-bit
Zen5 = 4 x 512-bit (true native 512-bit) || equivalent to 16 x 128-bit
Zen6 = 4 x 512-bit (2 * 512b for lo-bit(1-512) and 2 * 512b for hi-bit(513-1024) to get full-speed 2 x 1024-bit) || equivalent to 16 x 128-bit
We have to suffer through Mendocino before we see any future designs in the value APU market.
AMD Mendocino | MDN-A0 Zen 2 | socket FT6 CPUID 0x8A0F00

However, if there is a Zen3 Dense used in FT7, then there is a likely case for Zen4 Dense to be used as well.

Nahhh.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,626
5,910
136
I would think virtual machines that split the cores on the CCD would benefit from a larger private cache. IIRC, modern CPUs allow cache partitioning for L3$, so I'm not sure how much gain is to be had. Small executables would definitely benefit from a larger L2 cache (like scripts).
It is called Cluster On Die.
For AMD it is not straightforward, which is why until now AMD is not supporting COD yet.
They split mainly by NPS (Node Per Socket) with interleaving being done correspondingly and for EPYC it goes to NPS8, and you can guess that is because they create one NUMA node for each CCD. This is optimal because there is exactly one Home Agent per NUMA node.
AMD is supporting COD in upcoming DF, which you can find from MCA patches, which strongly suggest they can split the CCD, because there are multiple fabric instances running from each CCD
Either there are multiple CCXs or they can now perform COD per CCX but not likely.
The two CCX 16 Core CCD suggested by @uzzi38 aligns with EDAC patches and is basically just a trip back to Zen2 memory lane.
Zen 2 also has no COD support in Linux.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,783
7,995
136
It is called Cluster On Die.
For AMD it is not straightforward, which is why until now AMD is not supporting COD yet.
They split mainly by NPS (Node Per Socket) with interleaving being done correspondingly and for EPYC it goes to NPS8, and you can guess that is because they create one NUMA node for each CCD. This is optimal because there is exactly one Home Agent per NUMA node.
AMD is supporting COD in upcoming DF, which you can find from MCA patches, which strongly suggest they can split the CCD, because there are multiple fabric instances running from each CCD
Either there are multiple CCXs or they can now perform COD per CCX but not likely.
The two CCX 16 Core CCD suggested by @uzzi38 aligns with EDAC patches and is basically just a trip back to Zen2 memory lane.
Zen 2 also has no COD support in Linux.

Thank you so much. It's really hard to look up stuff on the internet when you don't know the exact terms to search. I take it you work/have worked with VMs?
I wish I hadn't passed on an IT job years ago that included managing a bunch of servers on VMWare. Would have been a good skill to have (aside from using just the Workstation product).
IIRC, it's way more automated than it was back in 2008.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
It is called Cluster On Die.
For AMD it is not straightforward, which is why until now AMD is not supporting COD yet.
They split mainly by NPS (Node Per Socket) with interleaving being done correspondingly and for EPYC it goes to NPS8, and you can guess that is because they create one NUMA node for each CCD. This is optimal because there is exactly one Home Agent per NUMA node.
AMD is supporting COD in upcoming DF, which you can find from MCA patches, which strongly suggest they can split the CCD, because there are multiple fabric instances running from each CCD
Either there are multiple CCXs or they can now perform COD per CCX but not likely.
The two CCX 16 Core CCD suggested by @uzzi38 aligns with EDAC patches and is basically just a trip back to Zen2 memory lane.
Zen 2 also has no COD support in Linux.
Is it actually listed as NPS8? I haven’t actually seen the BIOS settings. I thought it was NPS 1, 2, or 4 for no numa partitioning, each half of the IO die as separate node, and each quadrant of the die as a separate numa node. It also has a setting to make each L3 cache into a separate numa node which would be up to 16 on Rome and 8 on Milan (CCX = CCD). That is different since it is based on the cpu die, not IO die partitioning. The IO die partitioning affects the memory interleave while settings based on L3 cache do not.

If you look at the diagram of the IO die layout here:


It looks like there is a bigger penalty for going across the 2 halves of the IO die than I thought. It gets complicated to measure this due to the number of different parts. If you only have a 4 cpu die part, then setting NPS4 is equivalent to having a separate numa node for each L3 with Milan, but that is not the case with Rome or with devices with more than 4 cpu chips. I have some 7313s (4 core per CCD/CCX with 4 CCD) which I am running in NPS1. I also have a dual socket 7F32 (1 core per CCX with 8 CCX in, I assume, 4 CCD) which is currently set to NPS4. That gives me 8 numa nodes with 2 CPU each in 2 separate CCX on one CCD. I think the NPS4 setting is not optimal for this part and the software I am running.

Image above is from this article:

 
Reactions: lightmanek
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |