Discussion Intel current and future Lakes & Rapids thread

Doug S · Aug 22, 2021

jpiniero said:
I would expect it to be more Cloud; either selling directly to cloud companies exclusively or maybe their own service. Or both.

Like I said I could see them using it internally for that. I even strongly predicted that a few years ago, though GW III leaving makes me feel that's much less likely, unless he wanted to build something that was 100% server focused rather than taking something designed to scale from the Macbook Pro and Mac Pro and using the same dies for cloud.

ondma · Aug 22, 2021

A/// said:
Yes, spotted. Until a retail sample gets passed around like a candy tray and we see actual benches, I don't care enough to speculate. If it's faster than a 5900X but uses 150-200 watts more, then it's DOA to me. While you don't have to worry about power consumption on a desktop, there is a limit to how silly you can get with power draw. I'm not trying to outbattle my dang central air.

Seriously, 200 watts more than 5900x? That is absurd.

itsmydamnation · Aug 22, 2021

AMDK11 said:
I dare say that GoldenCove is the biggest update since Conroe (Core2). Conroe (2006) introduced a 4-way x86 decoder, so from Conroe (Core2) to SunnyCove (2021) the x86 decoder was 4-way. I would like to add that from the time of Bulldozer (x86 CMT core (called module)) to Zen-Zen3 the x86 decoder is also 4 way. For the first time in 15 years in x86 history, Intel extended the x86 decoder by 50% to 6-way. This is really a very big x86 core upgrade. Additionally, Intel for the first time since P6 (Pentium Pro) to SunnyCove is switching from L1-I cache fetching 16 Bytes per clock cycle to fetching 32 Bytes in GoldenCove. GoldenCove is the largest redevelopment and expansion since Conroe (Core2). SandyBridge wasn't that revolutionary.

Your over selling intels decode, it has been 4+1+1+1 uop's with 16bytes from L1i where heaps of basic instructions cant be done by the simple units because they decode into 2 uops, compared to Zen which is 2+2+2+2 + mircocode engine with 32bytes from L1i. Now intel are going to ????? in terms of decode but there are 6 of them and making the uop cache much closer to Zen2/3. So given intel/golden can now deliver 8uops from the uop cache a cycle just like Zen im going to say there front ends are probably pretty similar in terms of uop throughput a cycle.

intel are probably winning in scalar IPC from more L/S, 2x the ROB and now more execution ports/ register read/write ports.

Hulk · Aug 22, 2021

AMDK11 said:
I dare say that GoldenCove is the biggest update since Conroe (Core2). Conroe (2006) introduced a 4-way x86 decoder, so from Conroe (Core2) to SunnyCove (2021) the x86 decoder was 4-way. I would like to add that from the time of Bulldozer (x86 CMT core (called module)) to Zen-Zen3 the x86 decoder is also 4 way. For the first time in 15 years in x86 history, Intel extended the x86 decoder by 50% to 6-way. This is really a very big x86 core upgrade. Additionally, Intel for the first time since P6 (Pentium Pro) to SunnyCove is switching from L1-I cache fetching 16 Bytes per clock cycle to fetching 32 Bytes in GoldenCove. GoldenCove is the largest redevelopment and expansion since Conroe (Core2). SandyBridge wasn't that revolutionary.

I wish we knew a little more about this 6-way decode in Golden Cove. 3 simple plus 1 complex to 6... something? Why provide all of these details and then not detail the decoder? I guess they spent a lot of resources finding the optimum configuration and want to protect that intellectual property a bit longer?

JoeRambo · Aug 22, 2021

Hulk said:
I wish we knew a little more about this 6-way decode in Golden Cove. 3 simple plus 1 complex to 6... something? Why provide all of these details and then not detail the decoder? I guess they spent a lot of resources finding the optimum configuration and want to protect that intellectual property a bit longer?

I would not be suprised if they have lifted decoders from Atom 3+3 and made all 6 of them work all the time with extra logic ( and much extra power burned ).

Btw what i have found interesting in Atom core discussion, is that it has "On Demand" Instruction length decoders in each cluster, that are probably independently fed 16-bytes of predecoded length buffer, 2nd one working with 16 bytes from branch if there is any.
One can imagine such clusters working with unified, larger predecoded lenghth buffer, the job of decoders is the same.

Mopetar · Aug 22, 2021

I doubt those secrets matter much. No one else is going to tear out what they've been working on for their next product to try to replace it with any supposed Intel secret sauce. Lead times are far too long on chips for a few months to matter in the grand scheme of things.

The more likely reason is that they can release the detailed information later and get another round of coverage from doing so.

Thunder 57 · Aug 22, 2021

AMDK11 said:
With this, I agree that SandyBridge is revolutionary in these respects.

I was talking about pipelines and I consider the transition from the 4-way x86 to the 6-way decoder to be a revolution. It's not just adding resources. All the control logic and the algorithms contained in it have been completely redesigned or replaced with a new, more extensive one. You have a heavily rebuilt and expanded Front-End with a completely new predictor and preselector with new mechanisms. There have been big changes to the rest of the x86 core as well, but so far Intel hasn't revealed everything yet. The fact that GoldenCove basically uses similar mechanisms as SandyBridge does not mean that it is the same microarchitecture. With each generation, new mechanisms and algorithms are added to the x86 core to increase the IPC. In my opinion SunnyCove is the biggest change since SandyBridge and GodenCove is the biggest change since Conroe.

That's my opinion but you don't have to agree with it

I think the uOP cache alone was a massive upgrade in Sandy Bridge. AMD was at a serious disadvantage there. I think they tried some half assed solution around Steamroller but it wasn't really a thing until Zen.

Don't get me wrong I want to see what ADL can do, it looks rather potent. It probably will be the biggest upgrade since Sandy Bridge. Sandy Bridge though was lauded though and rightfully so. That became even more obvious after the Bulldozer disaster.

Hulk said:
I wish we knew a little more about this 6-way decode in Golden Cove. 3 simple plus 1 complex to 6... something? Why provide all of these details and then not detail the decoder? I guess they spent a lot of resources finding the optimum configuration and want to protect that intellectual property a bit longer?

6... something? I found that one funny, I guess because it's true. From what I understand the complex decoder is rarely used, so maybe 5 simple + 1 complex? Or maybe they went another way entirely?

itsmydamnation · Aug 22, 2021

Thunder 57 said:
I think the uOP cache alone was a massive upgrade in Sandy Bridge. AMD was at a serious disadvantage there. I think they tried some half assed solution around Steamroller but it wasn't really a thing until Zen.

They implemented a loop buffer, intel also had a loop buffer in Nehalem.

Thunder 57 said:
6... something? I found that one funny, I guess because it's true. From what I understand the complex decoder is rarely used, so maybe 5 simple + 1 complex? Or maybe they went another way entirely?

anything that becomes more then 1 uop can't use the simple decodes, but even then there are more complex rules

Real World Technologies - Forums - Thread: Alder Lake: 1st Intel/AMD CPU with 6 instruction decoders

www.realworldtech.com

Of course, the presentation doesn't talk much about the exact limits of decoding. They've often been much more complex than the ostensible "16B and 4-1-1-1", where the rules for what counts as "simple" can be quite surprising.

(For example, r-m-w instructions are exactly as simple as load-store instructions from a pure decoding angle, but count as complex because depending on the microarchitectural details they might end up being two uops).

itsmydamnation · Aug 22, 2021

Thunder 57 said:
I think the uOP cache alone was a massive upgrade in Sandy Bridge. AMD was at a serious disadvantage there. I think they tried some half assed solution around Steamroller but it wasn't really a thing until Zen.

Don't get me wrong I want to see what ADL can do, it looks rather potent. It probably will be the biggest upgrade since Sandy Bridge. Sandy Bridge though was lauded though and rightfully so. That became even more obvious after the Bulldozer disaster.

6... something? I found that one funny, I guess because it's true. From what I understand the complex decoder is rarely used, so maybe 5 simple + 1 complex? Or maybe they went another way entirely?

Real World Technologies - Forums - Thread: Alder Lake: 1st Intel/AMD CPU with 6 instruction decoders

www.realworldtech.com

Of course, the presentation doesn't talk much about the exact limits of decoding. They've often been much more complex than the ostensible "16B and 4-1-1-1", where the rules for what counts as "simple" can be quite surprising.

(For example, r-m-w instructions are exactly as simple as load-store instructions from a pure decoding angle, but count as complex because depending on the microarchitectural details they might end up being two uops).

from the rules in CPU

New rules in CPU forum, and reminder of current rules: Rules Updated 5-4-20

OK, I will keep this short and sweet. You all know the rules, or better learn to read them. Here are cliff notes: No baiting, flaming, trolling, thread crapping, cussing or insulting. So if you have an opinion you can state without insulting people, do so, stating it as an opinion. If you have...

forums.anandtech.com

"
Updated: 5-4-20
"your post must contain your own original written content and cannot consist solely of another's content, be it a quote, image, or link" "

esquared
Anandtech Forum Director

Thunder 57 · Aug 22, 2021

itsmydamnation said:
They implemented a loop buffer, intel also had a loop buffer in Nehalem.

anything that becomes more then 1 uop can't use the simple decodes, but even then there are more complex rules

Real World Technologies - Forums - Thread: Alder Lake: 1st Intel/AMD CPU with 6 instruction decoders

www.realworldtech.com

That was in Steamroller though, right? When they also added another decoder? I'm surprised that didn't improve performance more. It seems like Excavator was a bigger gain, though I have never seen a "deep dive" on Excavator. I guess AMD was more interested in promoting Zen by that point.

IntelUser2000 · Aug 22, 2021

diediealldie said:
These are good points. I'm actually quite curious how Atom team put everything into such a small area. Some parts of cores are even bigger than Golden Cove (L1 cache and issue ports) We already know that the Willow cove(which is almost the same as Sunny cove) core uses more than 10mm2 of die space. 2015 Skylake used 8mm2 of die space, it'll be 3~4mm2 with 10ESF assuming 50~70% of shrink. But Gracemont uses 1.5~2mm2. Surprising.

The caches at that small of a capacity takes up minimal space. Part of the reason it has more issue ports is because on the Cove cores the port is more multi-function, while it's dedicated on the atom-based chips.

10nm Skylake should be close to 4.5mm2.

There's also a significant die and performance per clock penalty in reaching insane clocks. Goldmont and after chips have a 13 stage pipeline. And Cove cores are at 18 with uop cache miss and around 14 with a hit. Golden Cove has 1 more. So you are not only adding a bit of extra logic, it reduces performance per clock by a noticeable amount.

Another reason for the larger area is simply the spacing required. The GPUs and Atom cores shrunk by 2.5x+ on 14nm, and probably bit over 2x on 10nm but the main cores always ended up in the 2x range. The spacing I speculate reduces hotspots which I suspect at such ridiculous frequencies reduce the clock headroom.

Of course if you want the absolute max performance and you can have it use 200-plus watts for a desktop chip, that's the way to go for higher performance.

Maybe for Desktops they'll continue to use super high clock, high power consumption chips but for mobile it'll be dominated by -mont successors that perform only slightly below Cove in perf/clock.

AMDK11 said:
With this, I agree that SandyBridge is revolutionary in these respects.

Considering Sandy Bridge's gains are similar to what we'll get with Golden Cove, I'd say it's a way better way of doing things. Of course new ideas don't just fall from the sky.

Just expanding it is how you fall into the square root law of returns.

Sandy Bridge's new ideas:
-uop cache
-Physical Register Files
-Rethinking of the branch target buffer to increase effective history size
-Using existing integer SIMD ports to double FP performance in a die efficient manner
-Use of an efficient, simple interconnect called the Ring Bus
-Significantly improved Turbo mode, Turbo 2.0

Due to those changes, in the mobile space we saw amazing gains. 60% in the H space and 30%-plus in the U space at lower power and better battery life.

They overhauled almost every aspect in an efficient manner. Golden Cove just does more of the same.

DrMrLordX · Aug 22, 2021

biostud said:
So the performance cores are faster and the efficiency cores slower, but roughly the same as 16 zen3 in MT.

. . . except . . .

SAAA said:
Honestly that bench shows almost no differences between a 8 core, 16 core and 24 core with different clocks and IPC. Whatever it measures it's moot if all current/next gen CPUs perform almost the same, I'd even take leaked Geekbench scores as more indicative than that suite.

I was sort of thinking the same thing. All three CPUs fall really close to one another, which at the very least is an indicator that core count isn't a big deal in that bench.

lobz said:
Why is that a problem? If only everyone here would try to make educated guesses

Because such guesses may or may not be all that educated. They're fodder for the kind of uninformed discussions we've had which lead up to the relevant comment.

IntelUser2000 · Aug 22, 2021

AMDK11 said:
One thing puzzles me and bothers me. According to Intel slides, GoldenCove has a 6-way x86 decoder while SunnyCove according to the same slide has a 4-way x86 decoder. Really strange because I thought that Skylake and SunnyCove have a 5-way decoder according to, among others, wikichip.

Sunny Cove is 4-wide decode, but 5-wide allocate. Golden Cove is 6-wide decode, and 6-wide allocate. That should clear the confusion.

Back with Skylake they claimed "5-wide" because of the uop. So they've been claiming 5-wide for quite a while now. It seems misleading marketing has infiltrated the high-level engineers.

itsmydamnation said:
Your over selling intels decode, it has been 4+1+1+1 uop's with 16bytes from L1i where heaps of basic instructions cant be done by the simple units because they decode into 2 uops, compared to Zen which is 2+2+2+2 + mircocode engine with 32bytes from L1i.

The Complex/Simple is not a big thing at all nowadays. It was before vector instructions such as SSE existed. The "Simple" decoders are capable of decoding ALL integer and floating point vector instructions.

The goal for Intel designers have been to get the ratio between "macro" x86 instructions and internal "micro" instructions to 1, and it's during the Core 2 days they said it's pretty close. Things like uop and x86 fusion is an attempt to get it even closer.

The obvious bottleneck going 4-wide and above is the 16-byte fetch, especially since we know the uop cache hit is quite a bit less than they say it is. 16-byte fetch and 4-wide suggests each x86 instructions are in average 4-bytes wide, when Linus is saying it can be 7-8 bytes. The "atoms" are more balanced with 16-byte fetch per 3-wide decoders.

Since it's going from 4-wide/16B to 6-wide/32B, it'll be in average be able to send 33% wider instructions.

itsmydamnation said:
They implemented a loop buffer, intel also had a loop buffer in Nehalem.

Core 2 used the instruction buffer as a Loop Stream Detector. On Nehalem they put that further down the pipeline and changed it to a Loop Buffer, so more of the front end can be turned off.

majord · Aug 22, 2021

SAAA said:
Honestly that bench shows almost no differences between a 8 core, 16 core and 24 core with different clocks and IPC. Whatever it measures it's moot if all current/next gen CPUs perform almost the same, I'd even take leaked Geekbench scores as more indicative than that suite.

There are sub tests that scale fine with cores and core perf , but looks like no one grabbed those detailed scores before the entry was removed .. Fail

IntelUser2000 · Aug 22, 2021

Thunder 57 said:
Gracemont sounds to good to be true.

It sounds very good, but not so much as it's "too good".

Because you see, Zen cores beat Cove cores in perf/mm2 significantly. The ARM cores also do really really well.

The comparison is skewed because Coves are mediocre and monts are very good.

I don't think this necessarily points to Cove team or the design being terrible, but the design philosophy needing a change. The absurd focus on clock frequency is the reason. It may also be because they are using Coves all the way from 4.5W chips to 200W gaming desktop chips and multi-thread prosumer workstations.

Hulk · Aug 22, 2021

Still trying to wrap my head around all of the ADL details...

ADL Desktop, Mobile, and Ultramobile are all monolithic does right? No chiplets here?

I'm starting to "get" the rationale behind ADL I think. 8 big, fast cores to compete with Zen 3 head-to-head against the 5800X. I know this is dangerous water I'm treading into but I think Intel is reasoning that *most* people don't really need more than 8 cores AND there is still a lot of software that doesn't scale in a linear fashion past 8 cores.

So adding the Gracemont cores solves quite a few "competition" problems:
1. They are small so no chiplets, dies can remain monolithic.
2. They are power efficient making them great for mobile and ultramobile. Now besides downclocking laptops with power settings, that slider bar can also move compute toward the Gracemont cores when on battery if needed.
3. The Gracemont cores allow for better competition with the 5900X and 5950X.
4. They allow for lots of mix and match options in the products stack.
5. Gracemont and Golden Cove can be independently "upgraded." That is the Cove can be enhanced on one generation, Mont on the next.
6. Two development teams might lead to a more flexible Intel, able to zig or zag as required.

Just some thoughts from one of the admittedly less knowledgeable people following this thread.

A/// · Aug 22, 2021

ondma said:
Seriously, 200 watts more than 5900x? That is absurd.

While it's an exaggeration of figures, if it runs hotter and uses more energy to maintain its clocks without throttling itself at stock or very likely in an OC'd state, then it is DOA to me. That's my personal take on Alderlake, and it shouldn't affect anyone else's personal opinion on what's acceptable and what isn't.

geegee83 · Aug 22, 2021

mooreslawisnotdead said:
This. The project is still in pathfinding and Jim Keller left Intel last year. MLID trying to clickbait. Glenn Hinton is the one working on it. The so called "exciting high performance CPU project" he mentioned in his LinkedIn post.

You mean this group which posted a job ad? https://www.themuse.com/jobs/intel/cpu-rtl-design-engineer

Seems like from LinkedIn it was formed in 2019. The question is who was their boss when it was created. (https://www.linkedin.com/in/debbie-marr-1326b34)

RanFodar · Aug 23, 2021

Hulk said:
I'm starting to "get" the rationale behind ADL I think. 8 big, fast cores to compete with Zen 3 head-to-head against the 5800X. I know this is dangerous water I'm treading into but I think Intel is reasoning that *most* people don't really need more than 8 cores AND there is still a lot of software that doesn't scale in a linear fashion past 8 cores.

So adding the Gracemont cores solves quite a few "competition" problems:
1. They are small so no chiplets, dies can remain monolithic.
2. They are power efficient making them great for mobile and ultramobile. Now besides downclocking laptops with power settings, that slider bar can also move compute toward the Gracemont cores when on battery if needed.
3. The Gracemont cores allow for better competition with the 5900X and 5950X.
4. They allow for lots of mix and match options in the products stack.
5. Gracemont and Golden Cove can be independently "upgraded." That is the Cove can be enhanced on one generation, Mont on the next.
6. Two development teams might lead to a more flexible Intel, able to zig or zag as required.

Certainly. Alder Lake is the first iteration of the road to disaggregated tiles. You can see that with their "building blocks" rationale, even if their dies are monolithic. They can't afford to be inflexible in the future, as Intel manages thousands of employees with different design teams and assemblies. We'll see what they can do with Meteor Lake.

coercitiv · Aug 23, 2021

RanFodar said:
Alder Lake is the first iteration of the road to disaggregated tiles. You can see that with their "building blocks" rationale, even if their dies are monolithic.

First step is Lakefield. Not only building blocks rationale, but also actual stacked tiles. In the context of advanced design & packaging, Alder and Raptor are vestiges of the old way of doing chips at Intel.

RTX2080 · Aug 23, 2021

Intuitive diagram drawn by Cardyak:

https://twitter.com/x/status/1428643475122278400

https://twitter.com/x/status/1428388787193892871

Of I'm not wrong, GoldenCove has at least ~25% more integer resources than Gracemont (5 ALU ports in GC compared to 4 in GM)

JoeRambo · Aug 23, 2021

I wonder how the OC on Alder Lake will be. Big cores will probably be the same "lock @ 5.1Ghz deal", but what about small ones? If Intel can clock them 3.9Ghz, does not mean more volts can't push them forward? Some really interesting questions about voltage, frequency domains are open so far and no good leaks yet.

JoeRambo · Aug 23, 2021

cortexa99 said:
Intuitive diagram drawn by Cardyak:

https://twitter.com/x/status/1428643475122278400

At least Gracemont contains some innacuracies like FP pipes being completely wrong. There are two "symmetric" pipes that can both do FADD and FMUL. As shown in that picture it would perform horribly with 1 throughput for critical operations.

andermans · Aug 23, 2021

Is the 4 MiB L2 for the Gracemont cluster confirmed? If true, extrapolating from Willow Cove L2 just the L2 cache would take ~5.8 mm2 per cluster which is about the same as a Willow Cove core + L2 (6.11 mm2). Makes me a lot more sceptical about the rumors of a single cluster being about as large as a single Golden Cove core unless the Golden Cove core grew a lot in area. (maybe if you count the cores only without L2? However, I'd assume it needs the L2 to achieve the good perf, so core without L2 isn't that useful for a competitive analysis on die space efficiency?)

Cardyak · Aug 23, 2021

cortexa99 said:
Intuitive diagram drawn by Cardyak:

https://twitter.com/x/status/1428643475122278400

https://twitter.com/x/status/1428388787193892871

Of I'm not wrong, GoldenCove has at least ~25% more integer resources than Gracemont (5 ALU ports in GC compared to 4 in GM)

JoeRambo said:
At least Gracemont contains some innacuracies like FP pipes being completely wrong. There are two "symmetric" pipes that can both do FADD and FMUL. As shown in that picture it would perform horribly with 1 throughput for critical operations.

Yes, it's still early days into the investigation of Golden Cove and Gracemont, so there will be some mistakes and errors.

Please bear with me over the next couple of weeks, the diagrams will be updated to further improve the accuracy.

Any feedback people can provide regarding errors/mistakes are of course greatly appreciated.

Discussion Intel current and future Lakes & Rapids thread

Platinum Member

Platinum Member

Platinum Member

Diamond Member

Golden Member

Diamond Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Elite Member

Lifer

Elite Member

Senior member

Elite Member

Diamond Member

Diamond Member

Junior Member

Junior Member

Diamond Member

Senior member

Golden Member

Golden Member

Member

Member