Discussion Intel current and future Lakes & Rapids thread

JoeRambo · Aug 30, 2021

coercitiv said:
There's a number of forum users here that consider the MT performance of the hybrid solution will come with a price in performance consistency. As long as 6+0 or 8+0 SKUs are available, they would rather avoid paying for the E-cores considering their workloads fit P-cores better. Some would obviously like their chips unlocked to push P-cores further.

I am one of those users. Actually i don't mind the small cores, as long as they bring with them 2.5MB of L3 per slice. 5MB of extra L3 is good deal for me versus 25MB of L3 for hypothetical 8+0 CPU.
I am now running 10900K with disabled HT and static OC to 5.1Ghz and plenty happy with it as main desktop and gaming machine. As long as Golden Cove has 25% IPC advance over Skylake and i can get something like DDR5 6400CL36 or so i will be happy to disable small cores and HT, set clock to 5-5.1ghz and enjoy real smooth and responsive system.

My dream is simple: 2000 in GB5 ST without VAES style shenanigans would be advance of 33% for me. And i have very little faith in schedulers even in easier setting of HT, heterogenous setup is disaster waiting to bite.

dullard said:
1) Lots of software cannot transition to more MT. That is because many tasks simply are not possible to be multithreaded. Any task that is user facing must go down to one thread at some point. Having dozens of threads sitting around doing nothing waiting for the mouse to move does not improve performance, only one thread is needed for that. Also, only one thread can display on the screen (the UI thread). Having drawing calls from multiple different threads is a guaranteed way to crash the software. As for math problems, many calculations rely on the result of the previous calculation and thus must be ST. Sure, some tasks can be MT, but many cannot.

I think ~10 years ago i had hilarious MT problem with one of our servers, that after upgrade to 2S Sandy Bridge started having nasty periodic slowdowns running exactly same workload versus Core2 FSB based server. After a looooooong investigation involving a lot of digging, it turned out that culprit was periodic scheduled invocation of ImageMagic command line utility to do some misc stuff. Said command was using multithreading on all CPUs to "speed up" things, except once number of CPU's has risen and locking became contended, 99.99% of time started being spent on lock contention and cache line pingpongs betwen CPUs and over inter socket QPI links. And that created NASTY slowdown on whole system, destroying QoL of our service big time.
It took quite some time to investigate and notice patterns and pin the command to single CPU to fix it.

So not everything can be multithreaded and sometimes you don't even control the quality of multithreading in software involved.

I think with heterogenous cores with such wide performance gap, scheduling will steal part of IPC increase from big cores and create problems with various legacy apps. Not complaining much as long as small cores can be disabled tho.

dullard · Aug 30, 2021

coercitiv said:
Reminds of a now famous Blizzard employee quote: "Do you guys not have phones?".

There's a number of forum users here that consider the MT performance of the hybrid solution will come with a price in performance consistency. As long as 6+0 or 8+0 SKUs are available, they would rather avoid paying for the E-cores considering their workloads fit P-cores better. Some would obviously like their chips unlocked to push P-cores further.

That is an awful lot of probably false assumptions there.
1) You are downplaying the performance gain of having the E cores.
2) You are ignoring the benefits of having background threads not take up resources on the P cores (nor the heat generated from them). Meaning with E cores taking that load, you can push the P cores further.
3) You are assuming the task scheduler is bad. It now is ~1000x faster and has way more information to use at its disposal. It might screw up, but that is just an assumption you are making.
4) You are assuming that Intel would charge less if they were forced to spend the money to make separate silicon to drop the E cores. Pricing, if done properly, is based on what the market will pay for -- not on the amount of silicon used.
5) You are ignoring the new API that lets programmers put all the code only on the P cores (or only E cores), meaning your whole point is already considered and can be addressed in full using no E cores (this does require the software to be updated though).

SAAA · Aug 30, 2021

There's really no point in going back now, making a 10 Golden cores variant wouldn't necessarily work better if the power usage gets even higher... then at that point why not remove the IGP and place 12 whole cores? Or cut HT and add more single-threaded resources?

We can keep at it till it becomes silly, point is their plan for the incoming years look pretty good with 8 strong cores and whatever amount of smaller ones to help parallel execution, Meteor Lake will double IGP too from current top tier. Next step is multi-layered logic and who knows when silicon/quantum brains.

We aren't going back on this road, single fat core era has ended 15 years ago with Pentiums, today the more we integrate stuff the better.

moinmoin · Aug 30, 2021

SAAA said:
single fat core era has ended 15 years ago

Lakefield said hi! /scnr

Hougy · Aug 30, 2021

How likely do you think it is that Intel will say desktop Alder Lake launches in October, even if it is a paper launch?

jpiniero · Aug 30, 2021

Hougy said:
How likely do you think it is that Intel will say desktop Alder Lake launches in October, even if it is a paper launch?

The launch is October 27th. Intel didn't directly mention Alder Lake but they have an event and you can take it to the bank that it will launch then.

Intel Alder Lake to launch on October 27th at 'fully hybrid' Innovation event? - VideoCardz.com

Intel teases Alder Lake launching on October 27th Pat Gelsinger, the CEO of Intel, announced during the Intel Accelerate webcast which covered the changes to node naming as well as new products that will debut on these new nodes, that on October 27th to 28th the manufacturer will hold another...

videocardz.com

JoeRambo · Aug 30, 2021

I think with latest leaks and variuos hints like Intel XTU with Alder Lake support coming out - release is imminent. How hard or soft launch we will get is big question here. And also maturity of initial DDR5 DIMMs?

coercitiv · Aug 30, 2021

SAAA said:
We aren't going back on this road, single fat core era has ended 15 years ago with Pentiums, today the more we integrate stuff the better.

moinmoin said:
Lakefield said hi! /scnr

Diet food for thought:

Golden Cove ~ 8 mm2
Zen3 ~ 4mm2
Gracemont < 2 mm2

Let's see if that fat lady sings.

Hulk · Aug 30, 2021

dullard said:
1) Lots of software cannot transition to more MT. That is because many tasks simply are not possible to be multithreaded. Any task that is user facing must go down to one thread at some point. Having dozens of threads sitting around doing nothing waiting for the mouse to move does not improve performance, only one thread is needed for that. Also, only one thread can display on the screen (the UI thread). Having drawing calls from multiple different threads is a guaranteed way to crash the software. As for math problems, many calculations rely on the result of the previous calculation and thus must be ST. Sure, some tasks can be MT, but many cannot.

2) Think about the number of cores in the next 10 years. Rumors have it that Intel is already planning on 40 cores in ~3 to ~4 years (Arrow Lake). Rocket lake in 125 W over 8 cores gets at the very most 15.6 W per core (actually a bit less since the uncore takes some power). But with 40 cores, that translates to an absolute max of 3.1 W per core. You just cannot have the high speed that you have with 15 W at your disposal when you now only have 3 W. The only possible way forward is to keep adding power or to switch to efficiency cores.

3) Combine #1 and #2 and you get Intel's strategy. Some tasks must be single threaded, so you might as well have a couple high power cores for the tasks that must be single threaded. But, math is forcing you to less and less power per core, so you might as well use cores that are efficient for those tasks.

Do you think we'll be seeing Big/Middle/Little on the desktop/mobile eventually?

eek2121 · Aug 30, 2021

coercitiv said:
Reminds of a now famous Blizzard employee quote: "Do you guys not have phones?".

There's a number of forum users here that consider the MT performance of the hybrid solution will come with a price in performance consistency. As long as 6+0 or 8+0 SKUs are available, they would rather avoid paying for the E-cores considering their workloads fit P-cores better. Some would obviously like their chips unlocked to push P-cores further.

Intel’s designs also happen to go well with gaming. With many well threaded games there are 1-3 “main” threads and several lighter/less active ones. By putting the light threads on the small cores and the big threads on the big cores, you can potentially save energy or increase performance (by improving thermals). Part of me wants to build a 12900k system at launch just to toy around with these scenarios. I don’t need a new machine, however.

I am also curious if it will be possible to disable the big cores and only use the small ones.

dullard · Aug 30, 2021

Hulk said:
Do you think we'll be seeing Big/Middle/Little on the desktop/mobile eventually?

That is beyond my knowledge. I haven't seen any rumors of it. Intel is going more towards a mix and match design philosophy. There will be various numbers of big, little, AI, media, graphics, etc for different types of users. Maybe a medium core could be put into the mix, but I don't think that will be any time soon.

dullard · Aug 30, 2021

eek2121 said:
Intel’s designs also happen to go well with gaming. With many well threaded games there are 1-3 “main” threads and several lighter/less active ones. By putting the light threads on the small cores and the big threads on the big cores, you can potentially save energy or increase performance (by improving thermals).

That is actually one example already being used to describe the benefits of big/little specifically to make games faster.

How Intel Thread Director makes Alder Lake and Windows 11 a match made in heaven

Alder Lake is the new hybrid CPU architecture from Intel, and it features a special optimization for Windows 11 in the form of Thread Director.

www.digitaltrends.com

Similarly, a background animation in a game, maybe one that’s static and doesn’t impact performance, isn’t a high-priority task. Developers can already tune these tasks to consume less power, and now, they can do so across a hybrid architecture. “Developers can now tell the operating system ‘I know this thread is doing this, but don’t prioritize it to any performant threads.'”

SAAA · Aug 30, 2021

coercitiv said:
Diet food for thought:

Golden Cove ~ 8 mm2
Zen3 ~ 4mm2
Gracemont < 2 mm2

Let's see if that fat lady sings.

Looks like they almost average out area wise, 16 "medium" vs 8 "big"+ 8 "small" cores, power and performance too from the look of it.
In the end silicon is silicon and given similar constraints they'll fall very close, but I guess going on only the hybrid concept will push much further ST/MT, like Meteor and Zen 5 will.

eek2121 said:
I am also curious if it will be possible to disable the big cores and only use the small ones.

I would be very interested too, if it works could be a way to know precise power figures for both cluster… also what's the max performance you can get if overclocking them is a thing, never happened with other Atoms!

Thala · Aug 30, 2021

eek2121 said:
I am also curious if it will be possible to disable the big cores and only use the small ones.

You do not have to technically disable the big cores, just do not assign any workload to them.

LightningZ71 · Aug 30, 2021

Hulk said:
Do you think we'll be seeing Big/Middle/Little on the desktop/mobile eventually?

With HT on the P cores, we effectively already have that! The priority is the E cores get most everything, the P cores get the high priority sensitive stuff, and the HT threads get overflow threads, at least, that's on its most basic level. All of that can be tuned by the software authors and the scheduler from hardware feedback...

eek2121 · Aug 30, 2021

Thala said:
You do not have to technically disable the big cores, just do not assign any workload to them.

That is impossible unless the UEFI allows it, since the operating system would not know not to use them at startup, and cumbersome since scripts would have to be written for every single program to ensure they don’t use small cores. Since the PE core count varies and HT is involved, the scripts would need to be modified per platform, and scripts could not be used on multiple operating systems.

Hulk · Aug 30, 2021

I remember Ian's article mentioned that the thread director will also be able to control frequency. I wonder if/how the logic determines if a thread should go to an E core at full speed vs a throttled back P core? Of course this scenario would assume P and E cores are both available and the Thread Director would have to "decide."

Another I've been wondering about is possible power savings during the following situation.

Imagine a some number of threads, which is less than the available cores. Also assume these threads are dependent on one another and they have different compute requirements. With a homogenous CPU and no Thread Director I would assume the most compute intensive thread would run at full speed at all times. Would the least compute intensive thread move constantly ramp from idle to full speed as compute is required or could the current scheduler adjust frequency (lower it) for optimum efficiency? Is this something the Thread Director will do?

Finally I am assuming based on the nature of nonlinear V/f curves for the cores that it would be more efficient to run a core at some frequency lower than max rather than constantly go from idle to full speed if less than the max available compute was needed for a sustained period of time?

Thala · Aug 30, 2021

eek2121 said:
That is impossible unless the UEFI allows it, since the operating system would not know not to use them at startup, and cumbersome since scripts would have to be written for every single program to ensure they don’t use small cores. Since the PE core count varies and HT is involved, the scripts would need to be modified per platform, and scripts could not be used on multiple operating systems.

Sorry - my mistake, i used the wrong quote. I was going to answer SAAA, who was interested in power/performance figures for both clusters. For this you just have to assign the according core affinity for the benchmark of interest - and physically disabling of the cores would not be necessary.

RTX · Aug 31, 2021

How much power and die space does the Thread Director take?

coercitiv · Aug 31, 2021

RTX said:
How much power and die space does the Thread Director take?

These things don't get measured in terms of area and power. The relevant cost is latency, and Intel engineers claim that in their testing it wasn’t in any way human perceivable.

I doubt we'll get better info on the subject before review day.

IntelUser2000 · Sep 1, 2021

coercitiv said:
These things don't get measured in terms of area and power. The relevant cost is latency, and Intel engineers claim that in their testing it wasn’t in any way human perceivable.

Lol.

Like how they say 60HZ is not perceivable by humans but lot of users see the difference switching to 75Hz, 100Hz, 144Hz, and some at 200Hz refresh rates.

Obviously there will be overhead. There has to be. It won't be 1+1. It'll be 0.9+0.9 or something. The question is whether it's good enough.

@dullard Yes it's 40 cores, but 8 coves and 32 monts. So no difference in the cove core department. Split is usually mostly single threaded/takes advantage of certain amount of threads/massively multithreaded.

In the first two scenarios Arrow Lake will be faster because it has 8 of the faster cove cores.

In the third scenario Arrow Lake will be massively faster not just because of the updated cores but many more of the mont ones.

And mont cores, if we take the Intel's claims of 2x perf/watt over the cove cores at a face value, will be using only 1/4th of each cove cores. So a 125W chip may have 100W for the cores. 6.25W for a Cove core and 1.56W for a mont core.

In scenario #1 and #2, power will mostly be used by the Cove cores and each might be at 10W+. Uncore will easily take 20W+ so Rocketlake cores would have 12.5W available to them.

RTX2080 · Sep 1, 2021

An insider talked about GoldenCove, said GoldenCove being equal performance to Cortex-X2 in SPEC

如何评价英特尔在 2021 架构日正式公布的 Alder Lake 系列处理器？ - 知乎

https://www.intel.com/content/www/us/en/newsroom/resources/press-kit-architecture-day-2021.html#gs…

translate.google.com

The ARM Cortex-X2, which is about to go on the market, has an architecture performance that basically equals the goldencove, and the resources used by both parties are obviously not equal.

edit: he explained 'architecture performance' as IPC in later reply

JoeRambo · Sep 1, 2021

cortexa99 said:
An insider talked about GoldenCove, said GoldenCove being equal performance to Cortex-X2 in SPEC

Hard to believe such claims when ARM themselves claim X1->X2 is +16% Integer IPC?

X2 is gonna be huge hit, but beating GC? Not there yet.

EDIT: but on topic of actual comments about Alder Lake it is good stuff and points out an important thing i have missed/got wrong about Alder Lake:
i was sure that intel added FADD units to PORT0 and PORT1 => they already have FMA and I thought they can now do 3 cycle latency FP ADD with less energy.

But the Chinese? comments pointed out to me that this is not the case, and Intel in fact can do 2xFMA on P1/P2 and FP add on FADD unit on PORT5.
FADD units were added to Port2 and Port5.

So in typical non-FMA FP code Intel can do 3x256 ADD or 2x256MUL +1x256ADD or 1x256MUL + 2x256ADD.

This is basically CPU that was built to fight ZEN3 in Cinebenches:

Currently ZEN2 vs ZEN3 vs Sunny Cove:
1801 AVX :VMULPS ymm1.. VADDPS ymm2.. L: 0.86ns= 3.0c T: 0.24ns= 0.86c

1801 AVX :VMULPS ymm1.. VADDPS ymm2.. L: 0.88ns= 3.0c T: 0.18ns= 0.62c

1801 AVX : VMULPS ymm1.. VADDPS ymm2.. L: 1.43ns= 4.0c T: 0.36ns= 1.00c

Intel currently has both worse latency and lower throughput. Should have ~same if not better vs ZEN3 now.

eek2121 · Sep 1, 2021

JoeRambo said:
Hard to believe such claims when ARM themselves claim X1->X2 is +16% Integer IPC?

X2 is gonna be huge hit, but beating GC? Not there yet.

EDIT: but on topic of actual comments about Alder Lake it is good stuff and points out an important thing i have missed/got wrong about Alder Lake:
i was sure that intel added FADD units to PORT0 and PORT1 => they already have FMA and I thought they can now do 3 cycle latency FP ADD with less energy.

But the Chinese? comments pointed out to me that this is not the case, and Intel in fact can do 2xFMA on P1/P2 and FP add on FADD unit on PORT5.
FADD units were added to Port2 and Port5.

So in typical non-FMA FP code Intel can do 3x256 ADD or 2x256MUL +1x256ADD or 1x256MUL + 2x256ADD.

This is basically CPU that was built to fight ZEN3 in Cinebenches:

Currently ZEN2 vs ZEN3 vs Sunny Cove:
1801 AVX :VMULPS ymm1.. VADDPS ymm2.. L: 0.86ns= 3.0c T: 0.24ns= 0.86c

1801 AVX :VMULPS ymm1.. VADDPS ymm2.. L: 0.88ns= 3.0c T: 0.18ns= 0.62c

1801 AVX : VMULPS ymm1.. VADDPS ymm2.. L: 1.43ns= 4.0c T: 0.36ns= 1.00c

Intel currently has both worse latency and lower throughput. Should have ~same if not better vs ZEN3 now.

Zen 2 to Zen 3 really blew my mind. I was on a 3900X and it was a fantastic processor, moving to the 5950X was mind blowing. I expect ADL-S to help Intel get back in the game, however, they HAVE to reign in power usage. I am not willing to use an Intel processor on a daily basis that consumes up to twice as much power as an AMD processor that is still going to be far more efficient in terms of perf/watt. Intel has work to do.

JoeRambo · Sep 1, 2021

eek2121 said:
I am not willing to use an Intel processor on a daily basis that consumes up to twice as much power as an AMD processor that is still going to be far more efficient in terms of perf/watt. Intel has work to do.

As owner of multiple AMD and Intel CPUs i feel that situation is not as clear cut as You put it with power. For casual desktop user/gamer, i think even 5950x vs 10900K will consume similar total system power over time.
Thing is, ~5ghz is 15-20W affair on AMD and ~25W on Intel per core. So casual gaming or web browsing is not that different and only when rendering or compiling 24/7 AMD's efficiency will be fully realized.
Where AMD suffers is idle power consumption, esp with 2 CCD chips and low load regime things are frankly not rosy and i feel for my average desktop usage and gaming, AMD would come out not that far ahead in power efficiency.

Now where AMD really shines is when tuned for ~4-4.4Ghz clocks, now that is the efficiency Intel's chips can't really touch, but further clock scaling and esp stock boosting algorithms waste power big time.

EDIT: not discussing servers here, nor mobile, just Desktop as it applies for 99% of users + dGPU.

Discussion Intel current and future Lakes & Rapids thread

Golden Member

Elite Member

Senior member

Diamond Member

Member

Lifer

Golden Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Elite Member

Senior member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Golden Member

Member

Diamond Member

Elite Member

Senior member

Golden Member

Diamond Member

Golden Member