Question Zen 6 Speculation Thread

OneEng2 · Nov 26, 2024

maddie said:
How does this work with AM5? Already we see Zen5 as having some main memory limitations.

If we have a 50+ % increase in cores and another 10+ % throughput increase per core, where does that leave us? DDR5 10000+ needed?

As a comparison, SP5 has 12 channels serving 128 Zen5 cores vs AM5 having 2 channels with 16 cores maximum.

Good point. I wonder if the RAM bandwidth is the limit though?

maddie said:
Yes, Strix Halo will answer many questions.

True, but the SP5 socket has max 10.67 cores/memory channel, also slower cores, still 50% more bandwidth/core than AM5 (max). Tells me, that at the limit, the cores with AM5 are starved. Strix Halo will show what's true.

Rumor suggests that Strix Halo's 256bit wide DDR 8000 interface is essentially equal to a quad channel memory setup. If true, then such a setup (quad channel moving from DDR6000 to DDR8000) would represent more than enough bandwidth for double the number of Zen 5 cores for Zen 6.

maddie · Nov 26, 2024

OneEng2 said:
Rumor suggests that Strix Halo's 256bit wide DDR 8000 interface is essentially equal to a quad channel memory setup. If true, then such a setup (quad channel moving from DDR6000 to DDR8000) would represent more than enough bandwidth for double the number of Zen 5 cores for Zen 6.

For mainly CPU only tasks, definitely yes, but questionable if both CPU & GPU are stressed simultaneously. I eagerly await detailed testing.

soresu · Nov 27, 2024

maddie said:
How does this work with AM5? Already we see Zen5 as having some main memory limitations.

Can AM5 not use the new CUDIMMs to address that issue?

poke01 · Nov 27, 2024

soresu said:
Can AM5 not use the new CUDIMMs to address that issue?

it won’t be AM5 then. AMD needs a new chipset to support CUDIMM, AM5+?

soresu · Nov 27, 2024

poke01 said:
it won’t be AM5 then. AMD needs a new chipset to support CUDIMM, AM5+?

Chipset and socket are not linked in that way.

As long as the pin configuration is right it's the IMC on the processor package that needs to change to support a variant of DDR5.

maddie · Nov 27, 2024

soresu said:
Chipset and socket are not linked in that way.

As long as the pin configuration is right it's the IMC on the processor package that needs to change to support a variant of DDR5.

Lets take this a bit further. Would there be frequency limits to the motherboard and its memory trace specifications, even if the raw CPU was capable? Might a 24C Zen 6 be throttled? Lower core counts models should be OK. Not a bad position for those upgrading, if true.

soresu · Nov 27, 2024

maddie said:
Lets take this a bit further. Would there be frequency limits to the motherboard and its memory trace specifications, even if the raw CPU was capable? Might a 24C Zen 6 be throttled? Lower core counts models should be OK. Not a bad position for those upgrading, if true.

Interesting point.

adroc_thurston · Nov 27, 2024

poke01 said:
AMD needs a new chipset to support CUDIMM

No?
It's just a DIMM.
You don't need a new socket unless you're swapping to like LPDDR5X/6.

Joe NYC · Nov 29, 2024

Interesting tidbit on TSMC N2 wafer starts. It includes both Zen 6 and Mi400. But it shows (one of the two) already taped out, and (one of the two) starting production in early 2026.

Based on "conventional wisdom", Zen 6 is in late 2026, which would imply Mi400 in early 2026 and likely majority of the 100k wafers.

As far as Zen 6, it is safe to assume that none of this volume is for client parts, and all of the Zen 6 N2 volume is server.

https://twitter.com/x/status/1862391833738813706

DrMrLordX · Nov 30, 2024

Joe NYC said:
Based on "conventional wisdom", Zen 6 is in late 2026, which would imply Mi400 in early 2026 and likely majority of the 100k wafers.

Huh. Guess we know where AMD's priorities lie. Can you really blame them?

DisEnchantment · Nov 30, 2024

MS_AT said:
Not quite. First of all, INT scheduler has nothing to do with SSE and AVX2 integer operation, FP/SIMD scheduler is responsible for those. The throughput is not halved for those operations, but if you max the schedulers out, you will get one extra cycle of delay on one cycle instructions. Since SIMD integer adds are natively 1 cycle, they get the latency hit. Throughput stays the same, 4 int adds at whatever SIMD width you want. Speaking of desktop Zen5.

Yes they are handled by FP unit, I guess it was a weekend when I commented, and I tend to have higher levels of alcohol in my blood on weekends.
I am mostly referring to this teardown by Alexander Yee, where there are increase in latencies for SIMD instructions

All SIMD instructions have minimum 2 cycle latency:

As awesome as Zen5's AVX512 is, not everything is perfect. So let's start with the biggest regression I found:

All formerly 1 cycle SIMD instructions have regressed to 2 cycles.

Applies to all widths, even 128-bit.

Everything that was already >= 2 does not further regress.

Throughput remains unchanged. The regression is only for latency.

Instructions that can be rename-eliminated (i.e. XOR zeroing) are unaffected and remain zero latency.

This caught me by surprise since it wasn't revealed in AMD's GCC patch. Initially I suspected that this regression was a trade-off to achieve the full 256 -> 512-bit widening. So I asked AMD about this and they gave a completely different explanation. While I won't disclose their response (which I assume remains under NDA), I'll describe it as a CPU hazard that "almost always" turns 1-cycle SIMD instructions into 2-cycle latency.
So while the 1-cycle instructions technically remain 1-cycle, for all practical purposes they are now 2 cycles. So developers and optimizing compilers should assume 2 cycles instead of 1 cycle. I believe it is possible to construct a benchmark that demonstrates the 1-cycle latency, but I have not attempted to do this.

Zen5's AVX512 Teardown + More...

Also if you check the SOG for Z5, there are quite a few regressions in latencies for SIMD ops, and (very) few regression in throughput compared to Z4. But of course lots of improvements as well. (Z3 to Z4 there were no regressions.)

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/software-optimization-guides/58455.zip

MS_AT · Nov 30, 2024

DisEnchantment said:
I am mostly referring to this teardown by Alexander Yee, where there are increase in latencies for SIMD instructions

Ok, so those are related to the scheduler issues I have mentioned. You can find the confirmation in the SOG itself. Alex later confirmed this is the same phenomenon. But int adds still enjoy full throughput afaik.

Joe NYC · Nov 30, 2024

Some potential configurations of Medusa (Zen 6). It may share CCD across 3 families:
- Medusa Ridge (desktop)
- Medusa Point (higher end notebook)
- Medusa Halo (high end notebook)

What would differ between these is IO die. So likely configurations would be:
- Medusa Ridge - 2 CCDs, smaller IOD
- Medusa Point - 1 CCD, medium sized IOD (maybe adding LLC, large NPU)
- Medusa Halo - 2 CCDs, large IOD

The interposer used is not known, but presumably the inexpensive wafer with RDL. And it seems AMD feels confident in its competitive cost and power efficiency to offer it in mainstream Medusa Point.

The CCD being 12 (big) cores is interesting. Aiming for the mostly the high end. There will still likely be a low end Kraken successor. I wonder if it will be monolithic or chiplet.

Since this is geared to client and notebooks, the AVX-512 implementation will likely be similar to Zen 4, saving some die area, and possibly allowing 12 big cores on a single CCD (using N3) with die size still in AMD sweet spot for CCDs - 70 to 80 mm2.

Kepler_L2 · Nov 30, 2024

Joe NYC said:
There will still likely be a low end Kraken successor. I wonder if it will be monolithic or chiplet.

Should be another 4+4 monolithic die.

yuri69 · Nov 30, 2024

Does it even make sense to feed a set of 2 * 12 hungrier-than-Zen-5 cores with dual channel DDR5?

Also scaling a 100+W 6GHz 12c CCD to ~35W mobile seems a bit weird.

eek2121 · Nov 30, 2024

OneEng2 said:
Granted that Intel "got rid of it" because E cores couldn't do it, but I believe that this was a die size issue for BOTH E and P cores.

Good point on the non-Intel compilers; however, my point is that if Intel creates new instructions in their processors, they don't release this fact early enough for AMD to include the same instructions in the same time window. This has given Intel a generation of added performance every time Intel did this before AMD could spin up a new design that included the support.

The question for AVX512 becomes "Is the juice worth the squeeze?". I believe it is due to the growing number of applications that support it in the desktop/laptop, and mostly the huge gains found in many applications in DC.

Intel's recent design decisions seem to leave DC concerns on the back burner. Seems like a strategic mistake to me. We will see.

If Windows itself began making heavy use of AVX-512 along with browsers, you would see a decent speed up. AVX-512 can accelerate many different workloads, the issue is that code needs to be written, you don’t get much benefit from autogen.

CakeMonster said:
I'm not overly optimistic about more cores per CCD because of the skyrocketing node cost and the slow advancement lately (except for the c core parts).

Costs go down over time FYI. N3e will cost significantly less in 2027 when N2 is expected to be out. In addition, TSMC is not going the only leader anymore since Intel has 18A, etc. More competition means lower prices for all.

DisEnchantment · Nov 30, 2024

Joe NYC said:
The interposer used is not known, but presumably the inexpensive wafer with RDL. And it seems AMD feels confident in its competitive cost and power efficiency to offer it in mainstream Medusa Point.

What is this gigantic interposer for? I thought they will use Fan Out with RDL.
Also wouldn't it be odd for mobile chips with razor tight margins to have SLC but not the DT chips.

What's the consensus on the IOD process node? I would suppose a GPU would really benefit from a new node if it is to also be part of the IOD. Is the GPU rumored to be in IOD or separate chiplet?

75mm2 CCD are surprisingly large even with 12 Cores.

LightningZ71 · Nov 30, 2024

eek2121 said:
If Windows itself began making heavy use of AVX-512 along with browsers, you would see a decent speed up. AVX-512 can accelerate many different workloads, the issue is that code needs to be written, you don’t get much benefit from autogen.

Costs go down over time FYI. N3e will cost significantly less in 2027 when N2 is expected to be out. In addition, TSMC is not going the only leader anymore since Intel has 18A, etc. More competition means lower prices for all.

It doesn't matter if TSMC has competition on it's leading nodes if demand still surpasses aggregate capacity. Leading edge nodes will continue to be very costly going forward, getting more expensive as the actual time to complete a platter grows due to multipatterning, etc, and be more resistant to reducing in cost over time.

LightningZ71 · Nov 30, 2024

DisEnchantment said:
What is this gigantic interposer for? I thought they will use Fan Out with RDL.
Also wouldn't it be odd for mobile chips with razor tight margins to have SLC but not the DT chips.

What's the consensus on the IOD process node? I would suppose a GPU would really benefit from a new node if it is to also be part of the IOD. Is the GPU rumored to be in IOD or separate chiplet?

75mm2 CCD are surprisingly large even with 12 Cores.

TSMC discussed N4C as a more cost effective node for non-leasing edge chips. I suspect that we'll see a generation of IODs use it.

Joe NYC · Nov 30, 2024

DisEnchantment said:
What is this gigantic interposer for? I thought they will use Fan Out with RDL.

I think what AMD may be planning is what is called Fan Out wafer level packaging, combining multiple wafers on the reconstituted wafer.

The start of the process is getting Known Good Die for both IOD (SoC) and CCD.

Then, these are placed (in the desired arrangement distance etc) on the Reconstituted wafer (which is what MLID may incorrectly be calling interposer.)

Then, on this reconstituted wafer, additional wiring layers (RDL) are applied, to make the necessary connections.

This first video shows that multiple chips can be placed in the mold:

In the 2nd video, it shows how the reconstitued wafer is constructed (just to get the concept), but their example only shows one chip being used.

adroc_thurston · Dec 1, 2024

DisEnchantment said:
I thought they will use Fan Out with RDL.

That's still interposer.

DisEnchantment said:
Also wouldn't it be odd for mobile chips with razor tight margins to have SLC but not the DT chips.

The margins on mobile aren't razor thin at all.
And DT parts have no real need for rather slow and high latency memside caches.

OneEng2 · Dec 1, 2024

From a desktop perspective where die size cost is very important, I find it difficult to believe there will not be variants of the CCD used that have mixed core content (or 6p+6c) for value segments.

A 2CCD solution with one CCD all P and the other all C, and possibly a high end with 2 CCD both all P.

I guess an argument could be made against the mixed CCD based on volume?

Thoughts?

marees · Dec 1, 2024

OneEng2 said:
From a desktop perspective where die size cost is very important, I find it difficult to believe there will not be variants of the CCD used that have mixed core content (or 6p+6c) for value segments.

A 2CCD solution with one CCD all P and the other all C, and possibly a high end with 2 CCD both all P.

I guess an argument could be made against the mixed CCD based on volume?

Thoughts?

Probably just the strix point ( 4p + 8c) ported to zen 6 mobile which in turn is ported to zen 6 desktop

I don't expect AMD to experiment with multiple designs. One and done is more like AMD

Thibsie · Dec 1, 2024

Don't forget the low power island cores (halo I think).

FlameTail · Dec 1, 2024

Joe NYC said:
Some potential configurations of Medusa (Zen 6). It may share CCD across 3 families:
- Medusa Ridge (desktop)
- Medusa Point (higher end notebook)
- Medusa Halo (high end notebook)

What would differ between these is IO die. So likely configurations would be:
- Medusa Ridge - 2 CCDs, smaller IOD
- Medusa Point - 1 CCD, medium sized IOD (maybe adding LLC, large NPU)
- Medusa Halo - 2 CCDs, large IOD

The interposer used is not known, but presumably the inexpensive wafer with RDL. And it seems AMD feels confident in its competitive cost and power efficiency to offer it in mainstream Medusa Point.

The CCD being 12 (big) cores is interesting. Aiming for the mostly the high end. There will still likely be a low end Kraken successor. I wonder if it will be monolithic or chiplet.

Since this is geared to client and notebooks, the AVX-512 implementation will likely be similar to Zen 4, saving some die area, and possibly allowing 12 big cores on a single CCD (using N3) with die size still in AMD sweet spot for CCDs - 70 to 80 mm2.

Well, it'a funny that some of this lines up with the drunken speculation I made several months ago:

FlameTail said:
ZEN 6 Client (with RDNA5)
Ryzen AI 400 series

All Medusa parts use 12-core CCDs. The difference is the IOD, of which 3 unique ones exist for each Ridge/Halo/Point.

MEDUSA RIDGE (Desktop)
24C/4CU
20C/4CU
16C/4CU
12C/4CU
8C/4CU

192bit/LPDDR6-10667 LPCAMM
100 TOPS NPU

MEDUSA POINT
12C/24CU
10C/20CU
8C/16CU
6C/12CU

192bit/LPDDR6-10667 LPCAMM/Soldered
100 TOPS NPU
4 LP cores

MEDUSA HALO
24C/72CU
20C/60CU
16C/48CU

384bit/LPDDR6-10667 On-package
200 TOPS NPU
4 LP cores
___

*above is speculation

Maybe this comment of mine was secretly one of MILD's sources for that video!

Question Zen 6 Speculation Thread

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Golden Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Senior member

Golden Member

Golden Member

Diamond Member