Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Kaluan · Sep 29, 2023

Well, it's looking more and more like I predicted, that Zen5 will be more like Zen3 (a big core redesign) and Zen6 like Zen2 (a big die/packaging redesign).

leoneazzurro said:
Well that AMD slide says 15+%. That is, 15% is a baseline.

deasd said:
Yep. The 'IPC 10-15%+' is look like the another 'ST 15%+' to me. Nobody knows where it will land at last.

Yeah given AMD's history, airing on a very conservative side in the runup (internal leaks or announcements) to a launch ("8%+" IPC for Zen4 actually being 13-14% on launch, "15+" ST perf actually being closer to 29%, the whole 40%->52% Bulldozer to Zen IPC ordeal etc), I wouldn't get too hung up what these slides claim. But I would say I certainly would be very pleasantly surprised if they surpass Zen3's 19% with Zen5.

Really interesting that AMD are hedging their bets on TSMC's next gen, again. I guess we'll get more hints on how realistic N2 Zen6 is when we get more details on what N3 Zen5 looks like (and when can we see it).

BTW can anyone explain what the Zen5 Zero bubble Conditional branches might mean? Or the "Memory Profiler" for Zen6?

Edit: Formatting

inf64 · Sep 29, 2023

msj10 said:
lol what? I am not defending anyone. I am aware that most of MLID leaks are usually BS but you are blaming him for a number in an AMD slide that he even said it will likely be higher in that same video.

I'm not blaming him for the number lol, I'm just saying he has troubles interpreting data he has been spoon fed. That is why he put such a wide range of 15-25% on his "prediction", as he has no idea what the changes listed on one of the slides even mean.

Joe NYC · Sep 29, 2023

branch_suggestion said:
CCD's don't need SoIC-X, something akin to Foveros is fine.

I think a bigger cost increment will be for 4 base dies, each one in 300-400 mm2 range than the cost of SoIC-X

Also, AMD will be using CCDs that will use SoIC-X for Mi400 or Mi500 to connect to the base die. Developing yet another interconnect technology would also have a cost well in excess of SoIC-X

branch_suggestion said:
4 tile is possible, but it has to be more cores than Turin, so you just inflate the size of each tile.

Maybe the server standard CCD moves to 16 cores with Zen6, but I really doubt it.
Dense CCD 16 cores for sure, the 32 core thing I'll defer to the likes of Spec.
Also the dense part probably has some structural changes outside of the CCD's, the target market and all.

Well, the slide says 32 core CCD for Zen 6 which kind of implies it is for Dense and Standard then would most likely be 16 core.

There was a video by Jim from AdoreTV about some sort of Elevator ring / bridge, which could make moving to 16 and 32 core dies possible with lower penalty.

branch_suggestion said:
The fanout introduced with RDNA3 is nice and cheap, and works well for this purpose. Si bridge would be overkill.

I don't know how the machines for SoIC Hybrid bond work, but it is quite possible that the equipment doing it will place all 2nd floor on top of 1st floor dies all at once.

So it it is going to be placing 16 to 24 CCDs, plus 4+ bridges, adding another 4 bridges will be a minimal cost increment (if there were separate IO dies).

It seems to me that stacking the dies one by one if there are going to be so many is unlikely.

Another thing, AMD was already planning on placing 9 Shader dies + 3 bridges for Navi 4c in 2024. It seems quite likely that this assembly is likely stacking them one by one.

branch_suggestion said:
Cache isn't magic, 8 cores is basically the limit for a high performance L3.

Venice is built for the guys who care about TCO and perf density and nothing else.
Barrier of entry will be set extremely high, like MI300 pricing or so, even though it is more expensive to make but volume should naturally be higher.

BTW, there is also another possibility that there will not be an IO dies at all and the PCIe and CXL interfaces will be built into the AIDs.

Joe NYC · Sep 29, 2023

yuri69 said:
The slide clearly deals with CCX (Core Complex) not CCD (Core Complex Die).

CCX = CCD for Zen 3 and Zen 4
And then 2 CCX for CCD in Zen 4c
And very likely CCX = CCD in Zen 5c.

msj10 · Sep 29, 2023

inf64 said:
I'm not blaming him for the number lol, I'm just saying he has troubles interpreting data he has been spoon fed. That is why he put such a wide range of 15-25% on his "prediction", as he has no idea what the changes listed on one of the slides even mean.

you can't just look at a microarchitecture block diagram and come up with an accurate IPC figure. just look at A17 with all the changes and it ended up with just a 3% IPC uplift

inf64 · Sep 29, 2023

msj10 said:
you can't just look at a microarchitecture block diagram and come up with an accurate IPC figure. just look at A17 with all the changes and it ended up with just a 3% IPC uplift

Actually you can, as with Zen 5 we have some massive ALU, AGU, L/S, FP, frontend, backend increased. Basically this is a bigger core change than Zen 3 was, probably on the level of Bullldozer to Zen 1 change. Anyone who knows a little bit about CPU uarchitecture design should be able to guess this will be a BIG jump in performance.

msj10 · Sep 29, 2023

inf64 said:
Actually you can, as with Zen 5 we have some massive ALU, AGU, L/S, FP, frontend, backend increased. Basically this is a bigger core change than Zen 3 was, probably on the level of Bullldozer to Zen 1 change. Anyone who knows a little bit about CPU uarchitecture design should be able to guess this will be a BIG jump in performance.

you can guess it will be a large jump but you can't guess an accurate IPC figures. again A17 had a a lot of changes with an extra decoder and a wider backend and it ended up with 3%.

inf64 · Sep 29, 2023

msj10 said:
you can guess it will be a large jump but you can't guess an accurate IPC figures. again A17 had a a lot of changes with an extra decoder and a wider backend and it ended up with 3%.

You are comparing apples and oranges (ARM vs x86). Bottlenecks and overall pipeline flows are different between the two.
As for guessing the accurate number, you can get to the ballpark: AMD targets 40%+ from new core to new core, so you can do the math (Zen 4 was around 11% IPC jump, so Zen 5 should at least be 27%). Zen 3 was a redesign of Zen 2 and its changes were much less radical than what Zen 5 brings - Zen 3 brought almost 20% IPC gains in Spec benchmark (as measured by AT).

Abwx · Sep 29, 2023

msj10 said:
you can guess it will be a large jump but you can't guess an accurate IPC figures. again A17 had a a lot of changes with an extra decoder and a wider backend and it ended up with 3%.

They got 52% better IPC going from 2 to 4 ALUs, if we keep the same progression for 4 to 6 ALUs then theorically this amount to possibly 22% better IPC, even if it s on Cinebench..

That being said they could had extracted 10% out of a squeezed Zen 4, so such a massive overhaul would be quite wastefull if limited to such an improvement.

bakyt115 · Sep 29, 2023

what means "2 basic block fetch"?

Asterox · Sep 29, 2023

Quite reasonable numbers, if we know that AMD always publicly presents a smaller IPC before the launch.

Green and blue, hm "Zen 5 CCX with 16 or Zen 6 with 32 CPU cores"

Glo. · Sep 29, 2023

inf64 said:
I'm not blaming him for the number lol, I'm just saying he has troubles interpreting data he has been spoon fed. That is why he put such a wide range of 15-25% on his "prediction", as he has no idea what the changes listed on one of the slides even mean.

Yep. He had slides that were directly saying that its large Strix Point Halo SKU, albeit - cut down to 24 CUs will compete with 3050 mobile graphics, but he was happy to report that standard 16 CU, without MALL cache will do just that, which will just not happen.

He has Dyslexia as far as I know, tho so it could be the reason why he has troubles with reporting correctly what he is being told.

CakeMonster · Sep 29, 2023

I'm coming here because this is not like certain other places that discuss hardware, and its disappointed to see so much focus on and good faith assumed for the worst clowns on YouTube (who have a history of lying). I'm probably in the minority and I presume this will come across as whining and unconstructive but I think the quality of the discussion would improve if we stopped acknowledging actors who have proven not to be serious. I'm actually OK with gossip and chatter and even the sites collecting those, just not the people we already know are dishonest and who earn money off of making up new lies it which messes up the incentives.

/rant

Saylick · Sep 29, 2023

Kaluan said:
BTW can anyone explain what the Zen5 Zero bubble Conditional branches might mean? Or the "Memory Profiler" for Zen6?

Edit: Formatting

I assume zero bubble conditional branches means they've found a way to make the penalty of a branch misprediction zero. Normally, a branch misprediction stalls the core because the core is flushed and the execution units have no work to do until the instructions from the correct branch make it through the pipeline. Imagine the core is a pipe that flows water, and it is your goal as a CPU architect for water to flow as fast as possible through the pipe. The gap between the flush and when the execution units have work to do is the "bubble", essentially an air pocket that halts your water flow.

Zen 3 had a heavily improved branch predictor with zero bubble on MOST branches, but not all. Seems like Zen 5 addresses the remaining corner cases? If it's true capable of zero misprediction penalties across the board, that is HUGE.

From AT's Zen 3 article:

AMD claims no bubbles on most predictions due to the increased branch predictor bandwidth, here I can see parallels to what Arm had introduced with the Cortex-A77, where a similar doubled-up branch predictor bandwidth would be able to run ahead of subsequent pipelines stages and thus fill bubble gaps ahead of them hitting the execution stages and potentially stalling the core.

inf64 · Sep 29, 2023

Also, we don't know (yet) the size of the very important uOP cache, which was big update for both Zen 3 and Zen 4. I expect this structure to get a big size increase in Zen 5.

DisEnchantment · Sep 29, 2023

inf64 said:
Also, we don't know (yet) the size of the very important uOP cache, which was big update for both Zen 3 and Zen 4. I expect this structure to get a big size increase in Zen 5.

Decode width is unknown too. Odd it is not mentioned in the highlights. Would be a bit weird if it remained at 4.

Saylick · Sep 29, 2023

inf64 said:
Also, we don't know (yet) the size of the very important uOP cache, which was big update for both Zen 3 and Zen 4. I expect this structure to get a big size increase in Zen 5.

Same. Likely holds more uops and larger dispatch. Zen 4's uop$ already had 9 ops/cycle and the core is fed from the uop cache most of the time, so it's only fair that a wider core requires more throughput from the uop cache as well. Hopefully, it's something like 12 ops/cycle or more to compliment the zero bubble misprediction on conditional branches.

For FWIW, AT's Zen 4 article:

Meanwhile the branch predictor’s op cache has been more significantly improved. The op cache is not only 68% larger than before (now storing 6.75k ops), but it can now spit out up to 9 macro-ops per cycle, up from 6 on Zen 3. So in scenarios where the branch predictor is doing especially well at its job and the micro-op queue can consume additional instructions, it’s possible to get up to 50% more ops out of the op cache. Besides the performance improvement, this has a positive benefit to power efficiency since tapping cached ops requires a lot less power than decoding new ones.

TESKATLIPOKA · Sep 29, 2023

Glo. said:
Yep. He had slides that were directly saying that its large Strix Point Halo SKU, albeit - cut down to 24 CUs will compete with 3050 mobile graphics, but he was happy to report that standard 16 CU, without MALL cache will do just that, which will just not happen.

Actually, he is not entirely wrong.
16CU RDNA3(3.5) at 2.2GHz would be on par with RTX 3050 mobile.
As you mentioned, the problem is the missing BW, which will cripple performance.

On the other hand, Strix halo with 24CU should compete with 4050 laptop, If 32MB Mall and 256-bit memory controller will not be cut-down significantly.

inf64 · Sep 29, 2023

DisEnchantment said:
Decode width is unknown too. Odd it is not mentioned in the highlights. Would be a bit weird if it remained at 4.

I think that it's very obvious AMD wanted to leave some very important stuff out as that presentation was bound to get leaked quickly after they shared it. If we got it now, competition got it much earlier. I don't think that anyone from intel was taking the 10-15%+ IPC "projection" seriously

JoeRambo · Sep 29, 2023

Saylick said:
I assume zero bubble conditional branches means they've found a way to make the penalty of a branch misprediction zero. Normally, a branch misprediction stalls the core because the core is flushed and the execution units have no work to do until the instructions from the correct branch make it through the pipeline. Imagine the core is a pipe that flows water, and it is your goal as a CPU architect for water to flow as fast as possible through the pipe. The gap between the flush and when the execution units have work to do is the "bubble", essentially an air pocket that halts your water flow.

That is frankly nonsense without having access to time machine inside a core ( or having "pipeline" of 1 unit ). Nice explanation tho.

Zen 3’s L1 BTB could track 1024 branch targets and handle them with 1 cycle latency, meaning that the frontend won’t need to stall after a taken branch if the target comes from the L1 BTB. Zen 4’s L1 BTB keeps the same 1 cycle latency, but improves capacity.

So they just have 0 latency "bubble" meaning that frontend does not stall after taken branch.
The other clue is 2 basic unit fetch, just like Atom core with 3+3 decode in prev gen, they can fetch from conditional branch target -> there is probably a lot of flexibility uCode vs L1I, maybe some other smarts?
I think this all pretty much means they have beefed up raw decode as well. 6 wide would make sense?

Anhiel · Sep 29, 2023

From what I see the newly leaked slides show reasonable evolution.
8 dispatches sounds reasonable to accommodate the 5-wide decode compared to 4-wide.
Nothing really surprising seeing how I've predicted as much back in April for example about the number of ALUs: http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=thread...ranite-ridge-ryzen-8000.2607350/post-40983226

Memory profiler sounds like usage analysis to help optimize allocation between the cache levels.

As for stacking on the SOC, since both sides will have backside power delivery technology on hand by then it seems certain both will use a lot of stacking at least for server. Well, this has been on RIKEN's roadmap for the industry for many years and AMD has been experimented on it nearly as long and even showed this on events.
For consumers the question is cost and whether TSMC has enough capacity for putting the chiplets together. Recently, they said GPUs will be delayed by ~1 year due to capacity shortages on this matter. So it does look like capacity will be resolved in time but the decision/contract has to be made now or soonish.

What I'm missing on the slides are adaptive bus/meshing networks. I guess my hope was too early.

Anyhow, looks like my recent and years old projections are all still on the money. The only unknowns are all restricted by process nodes and unlikely to be surprising or divert too far off. So it looks like things are more or less set for the next two generations based on what we know. The new changes are more exiting than the outcome. The real heated fight will have to be on the followup generations.

randomhero · Sep 29, 2023

These leaks about layout and number of ALUs and all are nice but we will not know almost until launch what are those resources capable of.Like instruction exec latency. Zen 2 to Zen 4 have not seen ALU count change but they got vastly improved.
We know for instance that L1 changed from this slides, what about L2 and L3?
I am reaaaly looking forward to read arch deep dive.

Kepler_L2 · Sep 29, 2023

Joe NYC said:
Zen 5 standard would be N4 with 8 cores per CCD
Zen 5c N3 with 16 cores per CCD
Zen 6 standard N3 with 16 cores per CCD
Zen 6c N2 with 32 cores per CCD

Zen6 has three CCD types.

gdansk · Sep 29, 2023

1.15x IPC would be a disappointment for many. But that's why I think the hype train must die.
Clock regression would kill it. But so far Zen teams have always avoided that (even if Zen 3 did have lower all core clocks).

Glo. · Sep 29, 2023

Kepler_L2 said:
Zen6 has three CCD types.

"Big", "Bigger", "Biggest"?

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Senior member

Diamond Member

Platinum Member

Platinum Member

Member

Diamond Member

Member

Diamond Member

Lifer

Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Platinum Member

Diamond Member

Golden Member

Member

Member

Senior member

Platinum Member

Diamond Member