Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

AMDK11 · Feb 18, 2024

itsmydamnation said:
im still waiting for the low ballers to explain how from zen 1 to 4, core width didn't grow and OOOE window increased a small amount ( especially relative to others ) while doing ~ 50% more ipc and yet going to 6 wide plus a whole new fronted we know nothing about except its big shinny and new and we are getting -5 to 15% with clock regression.....

its not like there arent cores that size with that much more IPC on the market.

With Zen, AMD added a 4th ALU unit compared to previous generations, e.g. K10(3xALU) (I do not count Bulldozer with 2 units per Integer cluster). Look how long it took AMD. They had enough time considering the fact that Intel added 4 ALUs in Haswell (2013). Additionally, the 4-way x86 decoder, apart from Bulldozer, was 3-way in the K10. Intel 4-way decoder added in Conroe(Core 2 - year 2006)

Look how long it took them. Thanks to the fact that Intel encountered problems with AMD Zen 1 (2017), they were able to catch up with Intel's IPC and still were unable to beat Skylake (2015).

Zen 1-4 compared to K10 has a 33% wider decoder and 33% more ALU units.

We are in times of diminishing returns and AMD is just learning the broader core.

AMD has got everyone used to consistently good IPC profits. For now, the highest increase in average IPC growth is offered by Zen 3 +19%, Zen 2 +15% and Zen 4 +13%. I'm leaving out Zen1+, which is an exception in the Ryzen 2000 generation.

You can see that Zen 4 is everything you can squeeze from 4 ALU and Front-End, while also noticing a visible increase in IPC.

There is no point in counting on the fact that AMD will make a constant profit every time and everything will go according to plan.

I deliberately do not write here about increasing the clock speed, which, combined with IPC, gives the final performance. What I mean is the underlying microarchitecture and the resulting increase in IPC.

And yes, I know I simplified the topic to ALU width and x86 decoder. It's like someone is picking on you.

Fortunately, Zen 5 and LionCove are just around the corner. We won't have to wait too long for the evaluation and comparison.

adroc_thurston · Feb 18, 2024

AMDK11 said:
AMD has got everyone used to consistently good IPC profits

Those are OK.
Apple did a lot better.

eek2121 · Feb 18, 2024

AMDK11 said:
With Zen, AMD added a 4th ALU unit compared to previous generations, e.g. K10(3xALU) (I do not count Bulldozer with 2 units per Integer cluster). Look how long it took AMD. They had enough time considering the fact that Intel added 4 ALUs in Haswell (2013). Additionally, the 4-way x86 decoder, apart from Bulldozer, was 3-way in the K10. Intel 4-way decoder added in Conroe(Core 2 - year 2006)

Look how long it took them. Thanks to the fact that Intel encountered problems with AMD Zen 1 (2017), they were able to catch up with Intel's IPC and still were unable to beat Skylake (2015).

Zen 1-4 compared to K10 has a 33% wider decoder and 33% more ALU units.

We are in times of diminishing returns and AMD is just learning the broader core.

AMD has got everyone used to consistently good IPC profits. For now, the highest increase in average IPC growth is offered by Zen 3 +19%, Zen 2 +15% and Zen 4 +13%. I'm leaving out Zen1+, which is an exception in the Ryzen 2000 generation.

You can see that Zen 4 is everything you can squeeze from 4 ALU and Front-End, while also noticing a visible increase in IPC.

There is no point in counting on the fact that AMD will make a constant profit every time and everything will go according to plan.

I deliberately do not write here about increasing the clock speed, which, combined with IPC, gives the final performance. What I mean is the underlying microarchitecture and the resulting increase in IPC.

And yes, I know I simplified the topic to ALU width and x86 decoder. It's like someone is picking on you.

Fortunately, Zen 5 and LionCove are just around the corner. We won't have to wait too long for the evaluation and comparison.

Bulldozer would have been pretty amazing with many Zen int/fp units.

Shoot, they could have used the design to better control die area.

Don’t get me wrong, I am happy AMD released Ryzen as is. I do wish a tiny part of the company explored CMT designs a bit more.

adroc_thurston · Feb 18, 2024

eek2121 said:
Bulldozer would have been pretty amazing with many Zen int/fp units.

No.
Construction cores had such a long laundry list of problems.

eek2121 said:
Don’t get me wrong, I am happy AMD released Ryzen as is. I do wish a tiny part of the company explored CMT designs a bit more.

They kinda suck.

AMDK11 · Feb 18, 2024

I think many people picked up on this statement by an excited AMD employee that he would like to wake up and be able to buy Zen 5, interpreting it as a very large average increase in IPC.

I think that it is not the IPC increase but the solutions and a very large milestone for AMD that make Zen 5 so special.

Zen 5 is the fruit of years of research and teachings gleaned from Zen 1-4. Zen 5 is a gateway to the development of next generations and IPC gains.

This is my opinion at the moment. Unless it turns out differently in the sense of better, then I will be very excited.

Glo. · Feb 18, 2024

https://twitter.com/x/status/1511521980230512643

Ahem, I think we forgot about these, very interesting patents straight from 2022.

Anyone has a guess what the hell this is?

https://www.freepatentsonline.com/y2022/0100519.html 1st one
https://www.freepatentsonline.com/y2022/0100663.html 2nd one

Markfw · Feb 18, 2024

Glo. said:
https://twitter.com/x/status/1511521980230512643

Ahem, I think we forgot about these, very interesting patents straight from 2022.

Anyone has a guess what the hell this is?

https://www.freepatentsonline.com/y2022/0100519.html 1st one
https://www.freepatentsonline.com/y2022/0100663.html 2nd one

Sounds to me like the basis of the improved front end of Zen 5 that I keep hearing about.

Glo. · Feb 18, 2024

Its like AMD is able to DOUBLE the decode with Zen 5.

DOUBLE.

2-3 pages ago we've had the spreadsheet that compares the width of the cores. Do you guys even comprehend what it may mean for this Architecture?

AMDK11 · Feb 18, 2024

It's a bit like Intel's dual Gracemont decoder. I'm not saying it's the same.

moinmoin · Feb 18, 2024

AMDK11 said:
I think many people picked up on this statement by an excited AMD employee that he would like to wake up and be able to buy Zen 5,

That random unnamed "excited AMD employee" just happens to be corporate fellow and chief architect Mike Clark.

AMDK11 · Feb 18, 2024

moinmoin said:
That random unnamed "excited AMD employee" just happens to be corporate fellow and chief architect Mike Clark.

Exactly.

CakeMonster · Feb 18, 2024

moinmoin said:
That random unnamed "excited AMD employee" just happens to be corporate fellow and chief architect Mike Clark.

The relevant clip.

igor_kavinski · Feb 18, 2024

adroc_thurston said:
No.
Construction cores had such a long laundry list of problems.

They kinda suck.

Is it impossible to design power efficient CMT cores?

What do you think is worst in design? Cedarmill HT or Piledriver?

igor_kavinski · Feb 18, 2024

moinmoin said:
That random unnamed "excited AMD employee" just happens to be corporate fellow and chief architect Mike Clark.

AMD is run by some rando old chick named Lisa Su and her rando cousin is the tech sector version of the North Korean dictator (millions of people want to kiss his feet for giving them 8 GB RTX cards, DLSS3 and AI).

igor_kavinski · Feb 18, 2024

Glo. said:
https://www.freepatentsonline.com/y2022/0100519.html 1st one
https://www.freepatentsonline.com/y2022/0100663.html 2nd one

While Intel is occupied with trying to increase their fab utilization with junk CPUs, AMD is advancing ahead in CPU design possibilities.

FlameTail · Feb 18, 2024

Glo. said:
Its like AMD is able to DOUBLE the decode with Zen 5.

DOUBLE.

2-3 pages ago we've had the spreadsheet that compares the width of the cores. Do you guys even comprehend what it may mean for this Architecture?

Double the Decode width?

That will definitely come with a clock speed regression.

Markfw · Feb 18, 2024

FlameTail said:
Double the Decode width?

That will definitely come with a clock speed regression.

Maybe not... The way I read the documents was not doubling, but EFFECTIVELY doubling.

Lets wait until it comes out.

trivik12 · Feb 18, 2024

https://twitter.com/x/status/1759390163971854337

Some of the replies to the tweet seem to correct that only Zen5c will be on N3e and otherwise its on N4. Question is would there be Nirvana CPU without any Zen 5c? It would be difficult for AMD to secure N3e capacity ahead of Apple which will start to churn huge amount of A18 chips soon. There is also Nvidia new AI Chip on N3e as well costing gazillion $.

FlameTail · Feb 18, 2024

trivik12 said:
Apple which will start to churn huge amount of A18 chips soon. There is also Nvidia new AI Chip on N3e as well costing gazillion $.

Also Qualcomm and Mediatek. 8G4 is entirely on N3E.

adroc_thurston · Feb 18, 2024

AMDK11 said:
I think that it is not the IPC increase but the solutions and a very large milestone for AMD that make Zen 5 so special.

That's pretty raw cope.
Solutions and very large milestones either translate into large IPC increases or you're ARM Austin/Sofia. And you suck.

AMDK11 said:
Zen 5 is the fruit of years of research and teachings gleaned from Zen 1-4. Zen 5 is a gateway to the development of next generations and IPC gains.

that's not how semicon pathfinding even works.
the new bits of Zen5 have like, 0 relation to previous Zens, that's the point.
They're not learnings, but novel crackpot concepts of doom.

igor_kavinski said:
Is it impossible to design power efficient CMT cores?

A510/520? Kinda.
But those are very basic, gutted designs, nothing akin to mainline Zen.

FlameTail said:
That will definitely come with a clock speed regression.

Decode piles don't really impact clock rates that much.
Now, 50% more dcache at the same latency hurts. Quite a bit.

Markfw said:
The way I read the documents was not doubling, but EFFECTIVELY doubling.

That's not how OoO decode schemes work, not even remotely.
Decode isn't really the limiting factor here anyway.

trivik12 said:
Question is would there be Nirvana CPU without any Zen 5c?

Granite Ridge. Turin. Strix-Halo.

trivik12 said:
There is also Nvidia new AI Chip on N3e as well costing gazillion $.

B100 isn't N3e.

trivik12 said:
It would be difficult for AMD to secure N3e capacity ahead of Apple which will start to churn huge amount of A18 chips soon.

Apple pays pennies, they're phone chips.
Cloud swingies pay real $$$ for Turin-D (which in itself is the first N3e product period).

FlameTail said:
Also Qualcomm and Mediatek. 8G4 is entirely on N3E.

meme volumes, HPC long took over smartphones in TSM rev share.
Only aapl flagship SoCs are any real volumes and even then, tiny dies, tiny Si counts total.

DisEnchantment · Feb 19, 2024

Glo. said:
https://twitter.com/x/status/1511521980230512643

Ahem, I think we forgot about these, very interesting patents straight from 2022.

Anyone has a guess what the hell this is?

We discussed here multiple times on the Zen 4 and Zen 5 threads.

DisEnchantment said:
Some new patents about radical front end changes from AMD.
Very likely a bit too late for Zen4, filed in 2020 but who knows.

AMD's attempt to tackle the much debated x86's decode width issue aka x86 cannot increase decode width without massive power/area penalty

Instead of one unit decoding many more instructions than what they have with multiple fast/slow paths, they are attempting multiple fetch-decode units decoding in parallel different branch windows of the instruction stream. Both pipelines are not active always, only when one pipeline cannot handle the instruction stream anymore.
Instructions gets decoded in parallel on all pipelines and gets reordered before dispatch if needed.

From a high level functional perspective sounds very intriguing and scalable.

20220100519 - PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES

https://www.freepatentsonline.com/y2022/0100519.html
View attachment 59473 View attachment 59478

Similarly, there are multiple uop cache as well fed to a reorder block along with the decode from above.

20220100663 - PROCESSOR WITH MULTIPLE OP CACHE PIPELINES

https://www.freepatentsonline.com/y2022/0100519.html
View attachment 59474 View attachment 59479

The above two are complementary patents, and makes better sense when read together.
If they keep the current decoders and uop cache and just double them with addition of the reorder block, the frontend throughput would be quite large.

It bears a resemblance to Tremont style dual decode clusters but that's about it.
The AMD patent suggests something more comprehensive compared to Tremont (since Tremont's design aims at power/area efficiency instead of performance)

Some of the patent 20220100519 main differences vs Tremont are

Very flexible load balancing across the multiple fetch-decode pipelines instead of round robin policy.
The decode units could be shared across the pipelines
The pipelines could be turned off or on dynamically for power saving

High level looks similar, two or more pipelines concurrently working on different instruction windows but not in same. (Except that Tremont's dual decode clusters effectively turns into one once the taken branches starts diverging too much).

The patent 20220100663 adds the multiple opcode path to the mix, by allowing uop cache for the corresponding fetch and decode pipe to feed to dispatch (which Tremont does not have at all)

Glo. said:
Its like AMD is able to DOUBLE the decode with Zen 5.

DOUBLE.

2-3 pages ago we've had the spreadsheet that compares the width of the cores. Do you guys even comprehend what it may mean for this Architecture?

But in one instruction stream Zen 5 decode width would still be 4 wide even if this patent were to be implemented (as can be seen in the gcc patches thus far).

itsmydamnation said:
i assume if its dual decode , and each "stream" is max 4 wide then compilers don't need to care.

If i remember correctly , AMD have had 32byte cycle of fetch but have only been able to really hit ~18bytes of decoded instructions. So i wonder if L1i stays same-ish with the same fetch but getting much closer to 32bytes of instructions.

Interesting thought because this is what is present in the optimization guide.

2.9 Instruction Fetch and Decode
The processor fetches instructions from the instruction cache in 32-byte blocks that are 16-byte
aligned and contained within a 64-byte aligned block. The processor can perform a 32-byte fetch
every cycle.
The fetch unit sends these bytes to the decode unit through a 24 entry Instruction Byte Queue (IBQ),
each entry holding 16 instruction bytes. In SMT mode each thread has 12 dedicated IBQ entries. The
IBQ acts as a decoupling queue between the fetch/branch-predict unit and the decode unit.
The decode unit scans two of these IBQ entries in a given cycle, decoding a maximum of four
instructions.

Seems AMD believes 4 wide + uop cache in each instruction stream is enough. The lesser decode throughput could be that only the first decoder can really decode the > 10 bytes instructions or not aligned fetch

Only the first decode slot (of four) can decode instructions greater than 10 bytes in length. Avoid having more than one instruction in a sequence of four that is greater than 10 bytes in length.

Wondering how the second fetch logic would impact L1i throughput/latency. I think this is what they meant by 2 basic block fetch

jpiniero · Feb 19, 2024

adroc_thurston said:
Apple pays pennies, they're phone chips.

The A17 Pro isn't that small... there's a reason that N3 was 15% of TSMC's revenue in their last quarter. Probably in the low 100 range.

I am wondering how big the Zen 5 die is going to be and if that's going to be a problem cost wise.

adroc_thurston said:
meme volumes, HPC long took over smartphones in TSM rev share.
Only aapl flagship SoCs are any real volumes and even then, tiny dies, tiny Si counts total.

Says it's 43-43 in the last quarter. Granted smartphone is probably a lot of the 16 nm (8%) and maybe the 28 nm (7%).

FlameTail · Feb 19, 2024

jpiniero said:
The A17 Pro isn't that small... there's a reason that N3 was 15% of TSMC's revenue in their last quarter. Probably in the low 100 range.

A17 Pro is 103 mm².
And they shipped atleast 54 million units of it last year:

And then Apple also sells the M3, M3 Pro, and M3 Max chips.

It's not without reason that Apple is TSMC's #1 customer by revenue.

Kaffeekenan · Feb 19, 2024

Kepler_L2 said:
That was on A0 silicon which had a clock regression

Soooooo, is that confirmation, that Zen 5 does NOT have a clock regression @Kepler_L2 ?

Timmah! · Feb 19, 2024

APU_Fusion said:
I predict ipc to be in range of -5% to +40% of zen 4 myself. Wins thread.

Nah, thats too specific to win the thread.
I predict IPC.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Senior member

Platinum Member

Platinum Member

Platinum Member

Senior member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Senior member

Diamond Member

Senior member

Golden Member

Lifer

Lifer

Lifer

Platinum Member

Moderator Emeritus, Elite Member

Senior member

Platinum Member

Platinum Member

Golden Member

Lifer

Platinum Member

Member

Golden Member