Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 290 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

AMDK11

Senior member
Jul 15, 2019
313
205
116
im still waiting for the low ballers to explain how from zen 1 to 4, core width didn't grow and OOOE window increased a small amount ( especially relative to others ) while doing ~ 50% more ipc and yet going to 6 wide plus a whole new fronted we know nothing about except its big shinny and new and we are getting -5 to 15% with clock regression.....

its not like there arent cores that size with that much more IPC on the market.
With Zen, AMD added a 4th ALU unit compared to previous generations, e.g. K10(3xALU) (I do not count Bulldozer with 2 units per Integer cluster). Look how long it took AMD. They had enough time considering the fact that Intel added 4 ALUs in Haswell (2013). Additionally, the 4-way x86 decoder, apart from Bulldozer, was 3-way in the K10. Intel 4-way decoder added in Conroe(Core 2 - year 2006)

Look how long it took them. Thanks to the fact that Intel encountered problems with AMD Zen 1 (2017), they were able to catch up with Intel's IPC and still were unable to beat Skylake (2015).

Zen 1-4 compared to K10 has a 33% wider decoder and 33% more ALU units.

We are in times of diminishing returns and AMD is just learning the broader core.

AMD has got everyone used to consistently good IPC profits. For now, the highest increase in average IPC growth is offered by Zen 3 +19%, Zen 2 +15% and Zen 4 +13%. I'm leaving out Zen1+, which is an exception in the Ryzen 2000 generation.

You can see that Zen 4 is everything you can squeeze from 4 ALU and Front-End, while also noticing a visible increase in IPC.

There is no point in counting on the fact that AMD will make a constant profit every time and everything will go according to plan.

I deliberately do not write here about increasing the clock speed, which, combined with IPC, gives the final performance. What I mean is the underlying microarchitecture and the resulting increase in IPC.

And yes, I know I simplified the topic to ALU width and x86 decoder. It's like someone is picking on you.

Fortunately, Zen 5 and LionCove are just around the corner. We won't have to wait too long for the evaluation and comparison.
 

eek2121

Platinum Member
Aug 2, 2005
2,974
4,112
136
With Zen, AMD added a 4th ALU unit compared to previous generations, e.g. K10(3xALU) (I do not count Bulldozer with 2 units per Integer cluster). Look how long it took AMD. They had enough time considering the fact that Intel added 4 ALUs in Haswell (2013). Additionally, the 4-way x86 decoder, apart from Bulldozer, was 3-way in the K10. Intel 4-way decoder added in Conroe(Core 2 - year 2006)

Look how long it took them. Thanks to the fact that Intel encountered problems with AMD Zen 1 (2017), they were able to catch up with Intel's IPC and still were unable to beat Skylake (2015).

Zen 1-4 compared to K10 has a 33% wider decoder and 33% more ALU units.

We are in times of diminishing returns and AMD is just learning the broader core.

AMD has got everyone used to consistently good IPC profits. For now, the highest increase in average IPC growth is offered by Zen 3 +19%, Zen 2 +15% and Zen 4 +13%. I'm leaving out Zen1+, which is an exception in the Ryzen 2000 generation.

You can see that Zen 4 is everything you can squeeze from 4 ALU and Front-End, while also noticing a visible increase in IPC.

There is no point in counting on the fact that AMD will make a constant profit every time and everything will go according to plan.

I deliberately do not write here about increasing the clock speed, which, combined with IPC, gives the final performance. What I mean is the underlying microarchitecture and the resulting increase in IPC.

And yes, I know I simplified the topic to ALU width and x86 decoder. It's like someone is picking on you.

Fortunately, Zen 5 and LionCove are just around the corner. We won't have to wait too long for the evaluation and comparison.
Bulldozer would have been pretty amazing with many Zen int/fp units.

Shoot, they could have used the design to better control die area.

Don’t get me wrong, I am happy AMD released Ryzen as is. I do wish a tiny part of the company explored CMT designs a bit more.
 

AMDK11

Senior member
Jul 15, 2019
313
205
116
I think many people picked up on this statement by an excited AMD employee that he would like to wake up and be able to buy Zen 5, interpreting it as a very large average increase in IPC.

I think that it is not the IPC increase but the solutions and a very large milestone for AMD that make Zen 5 so special.

Zen 5 is the fruit of years of research and teachings gleaned from Zen 1-4. Zen 5 is a gateway to the development of next generations and IPC gains.

This is my opinion at the moment. Unless it turns out differently in the sense of better, then I will be very excited.
 
Last edited:

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,668
14,676
136

FlameTail

Platinum Member
Dec 15, 2021
2,634
1,461
106
Its like AMD is able to DOUBLE the decode with Zen 5.

DOUBLE.

2-3 pages ago we've had the spreadsheet that compares the width of the cores. Do you guys even comprehend what it may mean for this Architecture?
Double the Decode width?

That will definitely come with a clock speed regression.
 
Reactions: S'renne

trivik12

Senior member
Jan 26, 2006
279
248
116

Some of the replies to the tweet seem to correct that only Zen5c will be on N3e and otherwise its on N4. Question is would there be Nirvana CPU without any Zen 5c? It would be difficult for AMD to secure N3e capacity ahead of Apple which will start to churn huge amount of A18 chips soon. There is also Nvidia new AI Chip on N3e as well costing gazillion $.
 

adroc_thurston

Platinum Member
Jul 2, 2023
2,813
4,121
96
I think that it is not the IPC increase but the solutions and a very large milestone for AMD that make Zen 5 so special.
That's pretty raw cope.
Solutions and very large milestones either translate into large IPC increases or you're ARM Austin/Sofia. And you suck.
Zen 5 is the fruit of years of research and teachings gleaned from Zen 1-4. Zen 5 is a gateway to the development of next generations and IPC gains.
that's not how semicon pathfinding even works.
the new bits of Zen5 have like, 0 relation to previous Zens, that's the point.
They're not learnings, but novel crackpot concepts of doom.
Is it impossible to design power efficient CMT cores?
A510/520? Kinda.
But those are very basic, gutted designs, nothing akin to mainline Zen.
That will definitely come with a clock speed regression.
Decode piles don't really impact clock rates that much.
Now, 50% more dcache at the same latency hurts. Quite a bit.
The way I read the documents was not doubling, but EFFECTIVELY doubling.
That's not how OoO decode schemes work, not even remotely.
Decode isn't really the limiting factor here anyway.
Question is would there be Nirvana CPU without any Zen 5c?
Granite Ridge. Turin. Strix-Halo.
There is also Nvidia new AI Chip on N3e as well costing gazillion $.
B100 isn't N3e.
It would be difficult for AMD to secure N3e capacity ahead of Apple which will start to churn huge amount of A18 chips soon.
Apple pays pennies, they're phone chips.
Cloud swingies pay real $$$ for Turin-D (which in itself is the first N3e product period).
Also Qualcomm and Mediatek. 8G4 is entirely on N3E.
meme volumes, HPC long took over smartphones in TSM rev share.
Only aapl flagship SoCs are any real volumes and even then, tiny dies, tiny Si counts total.
 
Last edited:
Reactions: exquisitechar

DisEnchantment

Golden Member
Mar 3, 2017
1,626
5,909
136

Ahem, I think we forgot about these, very interesting patents straight from 2022.

Anyone has a guess what the hell this is?

We discussed here multiple times on the Zen 4 and Zen 5 threads.

Some new patents about radical front end changes from AMD.
Very likely a bit too late for Zen4, filed in 2020 but who knows.

AMD's attempt to tackle the much debated x86's decode width issue aka x86 cannot increase decode width without massive power/area penalty

Instead of one unit decoding many more instructions than what they have with multiple fast/slow paths, they are attempting multiple fetch-decode units decoding in parallel different branch windows of the instruction stream. Both pipelines are not active always, only when one pipeline cannot handle the instruction stream anymore.
Instructions gets decoded in parallel on all pipelines and gets reordered before dispatch if needed.

From a high level functional perspective sounds very intriguing and scalable.

20220100519 - PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES
View attachment 59473View attachment 59478


Similarly, there are multiple uop cache as well fed to a reorder block along with the decode from above.

20220100663 - PROCESSOR WITH MULTIPLE OP CACHE PIPELINES
View attachment 59474View attachment 59479


The above two are complementary patents, and makes better sense when read together.
If they keep the current decoders and uop cache and just double them with addition of the reorder block, the frontend throughput would be quite large.

It bears a resemblance to Tremont style dual decode clusters but that's about it.
The AMD patent suggests something more comprehensive compared to Tremont (since Tremont's design aims at power/area efficiency instead of performance)

Some of the patent 20220100519 main differences vs Tremont are
  • Very flexible load balancing across the multiple fetch-decode pipelines instead of round robin policy.
  • The decode units could be shared across the pipelines
  • The pipelines could be turned off or on dynamically for power saving
High level looks similar, two or more pipelines concurrently working on different instruction windows but not in same. (Except that Tremont's dual decode clusters effectively turns into one once the taken branches starts diverging too much).

The patent 20220100663 adds the multiple opcode path to the mix, by allowing uop cache for the corresponding fetch and decode pipe to feed to dispatch (which Tremont does not have at all)

Its like AMD is able to DOUBLE the decode with Zen 5.

DOUBLE.

2-3 pages ago we've had the spreadsheet that compares the width of the cores. Do you guys even comprehend what it may mean for this Architecture?

But in one instruction stream Zen 5 decode width would still be 4 wide even if this patent were to be implemented (as can be seen in the gcc patches thus far).

i assume if its dual decode , and each "stream" is max 4 wide then compilers don't need to care.

If i remember correctly , AMD have had 32byte cycle of fetch but have only been able to really hit ~18bytes of decoded instructions. So i wonder if L1i stays same-ish with the same fetch but getting much closer to 32bytes of instructions.

Interesting thought because this is what is present in the optimization guide.

2.9 Instruction Fetch and Decode
The processor fetches instructions from the instruction cache in 32-byte blocks that are 16-byte
aligned and contained within a 64-byte aligned block. The processor can perform a 32-byte fetch
every cycle.
The fetch unit sends these bytes to the decode unit through a 24 entry Instruction Byte Queue (IBQ),
each entry holding 16 instruction bytes. In SMT mode each thread has 12 dedicated IBQ entries. The
IBQ acts as a decoupling queue between the fetch/branch-predict unit and the decode unit.
The decode unit scans two of these IBQ entries in a given cycle, decoding a maximum of four
instructions.


Seems AMD believes 4 wide + uop cache in each instruction stream is enough. The lesser decode throughput could be that only the first decoder can really decode the > 10 bytes instructions or not aligned fetch

Only the first decode slot (of four) can decode instructions greater than 10 bytes in length. Avoid having more than one instruction in a sequence of four that is greater than 10 bytes in length.

Wondering how the second fetch logic would impact L1i throughput/latency. I think this is what they meant by 2 basic block fetch

 
Last edited:

jpiniero

Lifer
Oct 1, 2010
14,739
5,368
136
Apple pays pennies, they're phone chips.

The A17 Pro isn't that small... there's a reason that N3 was 15% of TSMC's revenue in their last quarter. Probably in the low 100 range.

I am wondering how big the Zen 5 die is going to be and if that's going to be a problem cost wise.

meme volumes, HPC long took over smartphones in TSM rev share.
Only aapl flagship SoCs are any real volumes and even then, tiny dies, tiny Si counts total.

Says it's 43-43 in the last quarter. Granted smartphone is probably a lot of the 16 nm (8%) and maybe the 28 nm (7%).
 

FlameTail

Platinum Member
Dec 15, 2021
2,634
1,461
106
The A17 Pro isn't that small... there's a reason that N3 was 15% of TSMC's revenue in their last quarter. Probably in the low 100 range.
A17 Pro is 103 mm².
And they shipped atleast 54 million units of it last year:

And then Apple also sells the M3, M3 Pro, and M3 Max chips.

It's not without reason that Apple is TSMC's #1 customer by revenue.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |