- Mar 3, 2017
- 1,626
- 5,909
- 136
With Zen, AMD added a 4th ALU unit compared to previous generations, e.g. K10(3xALU) (I do not count Bulldozer with 2 units per Integer cluster). Look how long it took AMD. They had enough time considering the fact that Intel added 4 ALUs in Haswell (2013). Additionally, the 4-way x86 decoder, apart from Bulldozer, was 3-way in the K10. Intel 4-way decoder added in Conroe(Core 2 - year 2006)im still waiting for the low ballers to explain how from zen 1 to 4, core width didn't grow and OOOE window increased a small amount ( especially relative to others ) while doing ~ 50% more ipc and yet going to 6 wide plus a whole new fronted we know nothing about except its big shinny and new and we are getting -5 to 15% with clock regression.....
its not like there arent cores that size with that much more IPC on the market.
Those are OK.AMD has got everyone used to consistently good IPC profits
Bulldozer would have been pretty amazing with many Zen int/fp units.With Zen, AMD added a 4th ALU unit compared to previous generations, e.g. K10(3xALU) (I do not count Bulldozer with 2 units per Integer cluster). Look how long it took AMD. They had enough time considering the fact that Intel added 4 ALUs in Haswell (2013). Additionally, the 4-way x86 decoder, apart from Bulldozer, was 3-way in the K10. Intel 4-way decoder added in Conroe(Core 2 - year 2006)
Look how long it took them. Thanks to the fact that Intel encountered problems with AMD Zen 1 (2017), they were able to catch up with Intel's IPC and still were unable to beat Skylake (2015).
Zen 1-4 compared to K10 has a 33% wider decoder and 33% more ALU units.
We are in times of diminishing returns and AMD is just learning the broader core.
AMD has got everyone used to consistently good IPC profits. For now, the highest increase in average IPC growth is offered by Zen 3 +19%, Zen 2 +15% and Zen 4 +13%. I'm leaving out Zen1+, which is an exception in the Ryzen 2000 generation.
You can see that Zen 4 is everything you can squeeze from 4 ALU and Front-End, while also noticing a visible increase in IPC.
There is no point in counting on the fact that AMD will make a constant profit every time and everything will go according to plan.
I deliberately do not write here about increasing the clock speed, which, combined with IPC, gives the final performance. What I mean is the underlying microarchitecture and the resulting increase in IPC.
And yes, I know I simplified the topic to ALU width and x86 decoder. It's like someone is picking on you.
Fortunately, Zen 5 and LionCove are just around the corner. We won't have to wait too long for the evaluation and comparison.
No.Bulldozer would have been pretty amazing with many Zen int/fp units.
They kinda suck.Don’t get me wrong, I am happy AMD released Ryzen as is. I do wish a tiny part of the company explored CMT designs a bit more.
Sounds to me like the basis of the improved front end of Zen 5 that I keep hearing about.
Ahem, I think we forgot about these, very interesting patents straight from 2022.
Anyone has a guess what the hell this is?
https://www.freepatentsonline.com/y2022/0100519.html 1st one
https://www.freepatentsonline.com/y2022/0100663.html 2nd one
That random unnamed "excited AMD employee" just happens to be corporate fellow and chief architect Mike Clark.I think many people picked up on this statement by an excited AMD employee that he would like to wake up and be able to buy Zen 5,
Exactly.That random unnamed "excited AMD employee" just happens to be corporate fellow and chief architect Mike Clark.
The relevant clip.That random unnamed "excited AMD employee" just happens to be corporate fellow and chief architect Mike Clark.
Is it impossible to design power efficient CMT cores?No.
Construction cores had such a long laundry list of problems.
They kinda suck.
AMD is run by some rando old chick named Lisa Su and her rando cousin is the tech sector version of the North Korean dictator (millions of people want to kiss his feet for giving them 8 GB RTX cards, DLSS3 and AI).That random unnamed "excited AMD employee" just happens to be corporate fellow and chief architect Mike Clark.
While Intel is occupied with trying to increase their fab utilization with junk CPUs, AMD is advancing ahead in CPU design possibilities.
Double the Decode width?Its like AMD is able to DOUBLE the decode with Zen 5.
DOUBLE.
2-3 pages ago we've had the spreadsheet that compares the width of the cores. Do you guys even comprehend what it may mean for this Architecture?
Maybe not... The way I read the documents was not doubling, but EFFECTIVELY doubling.Double the Decode width?
That will definitely come with a clock speed regression.
Also Qualcomm and Mediatek. 8G4 is entirely on N3E.Apple which will start to churn huge amount of A18 chips soon. There is also Nvidia new AI Chip on N3e as well costing gazillion $.
That's pretty raw cope.I think that it is not the IPC increase but the solutions and a very large milestone for AMD that make Zen 5 so special.
that's not how semicon pathfinding even works.Zen 5 is the fruit of years of research and teachings gleaned from Zen 1-4. Zen 5 is a gateway to the development of next generations and IPC gains.
A510/520? Kinda.Is it impossible to design power efficient CMT cores?
Decode piles don't really impact clock rates that much.That will definitely come with a clock speed regression.
That's not how OoO decode schemes work, not even remotely.The way I read the documents was not doubling, but EFFECTIVELY doubling.
Granite Ridge. Turin. Strix-Halo.Question is would there be Nirvana CPU without any Zen 5c?
B100 isn't N3e.There is also Nvidia new AI Chip on N3e as well costing gazillion $.
Apple pays pennies, they're phone chips.It would be difficult for AMD to secure N3e capacity ahead of Apple which will start to churn huge amount of A18 chips soon.
meme volumes, HPC long took over smartphones in TSM rev share.Also Qualcomm and Mediatek. 8G4 is entirely on N3E.
Ahem, I think we forgot about these, very interesting patents straight from 2022.
Anyone has a guess what the hell this is?
Some new patents about radical front end changes from AMD.
Very likely a bit too late for Zen4, filed in 2020 but who knows.
AMD's attempt to tackle the much debated x86's decode width issue aka x86 cannot increase decode width without massive power/area penalty
Instead of one unit decoding many more instructions than what they have with multiple fast/slow paths, they are attempting multiple fetch-decode units decoding in parallel different branch windows of the instruction stream. Both pipelines are not active always, only when one pipeline cannot handle the instruction stream anymore.
Instructions gets decoded in parallel on all pipelines and gets reordered before dispatch if needed.
From a high level functional perspective sounds very intriguing and scalable.
20220100519 - PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES
View attachment 59473View attachment 59478
Similarly, there are multiple uop cache as well fed to a reorder block along with the decode from above.
20220100663 - PROCESSOR WITH MULTIPLE OP CACHE PIPELINES
View attachment 59474View attachment 59479
The above two are complementary patents, and makes better sense when read together.
If they keep the current decoders and uop cache and just double them with addition of the reorder block, the frontend throughput would be quite large.
Its like AMD is able to DOUBLE the decode with Zen 5.
DOUBLE.
2-3 pages ago we've had the spreadsheet that compares the width of the cores. Do you guys even comprehend what it may mean for this Architecture?
i assume if its dual decode , and each "stream" is max 4 wide then compilers don't need to care.
If i remember correctly , AMD have had 32byte cycle of fetch but have only been able to really hit ~18bytes of decoded instructions. So i wonder if L1i stays same-ish with the same fetch but getting much closer to 32bytes of instructions.
2.9 Instruction Fetch and Decode
The processor fetches instructions from the instruction cache in 32-byte blocks that are 16-byte
aligned and contained within a 64-byte aligned block. The processor can perform a 32-byte fetch
every cycle.
The fetch unit sends these bytes to the decode unit through a 24 entry Instruction Byte Queue (IBQ),
each entry holding 16 instruction bytes. In SMT mode each thread has 12 dedicated IBQ entries. The
IBQ acts as a decoupling queue between the fetch/branch-predict unit and the decode unit.
The decode unit scans two of these IBQ entries in a given cycle, decoding a maximum of four
instructions.
Only the first decode slot (of four) can decode instructions greater than 10 bytes in length. Avoid having more than one instruction in a sequence of four that is greater than 10 bytes in length.
Apple pays pennies, they're phone chips.
meme volumes, HPC long took over smartphones in TSM rev share.
Only aapl flagship SoCs are any real volumes and even then, tiny dies, tiny Si counts total.
A17 Pro is 103 mm².The A17 Pro isn't that small... there's a reason that N3 was 15% of TSMC's revenue in their last quarter. Probably in the low 100 range.
That was on A0 silicon which had a clock regression
Nah, thats too specific to win the thread.I predict ipc to be in range of -5% to +40% of zen 4 myself. Wins thread.