Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Tigerick · Aug 22, 2022

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

Model	Code-Name	Date	TDP	Node	Tiles	Main Tile	CPU	LP E-Core	LLC	GPU	Xe-cores
Core Ultra 100U	Meteor Lake	Q4 2023	15 - 57 W	Intel 4 + N5 + N6	4	tCPU	2P + 8E	2	12 MB	Intel Graphics	4
?	Lunar Lake	Q4 2024	17 - 30 W	N3B + N6	2	CPU + GPU & IMC	4P + 4E	0	8 MB	Arc	8
?	Panther Lake	Q1 2026 ?	?	Intel 18A + N3E	3	CPU + MC	4P + 8E	4	?	Arc	12

Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

	Meteor Lake	Arrow Lake (20A)	Arrow Lake (N3B)	Arrow Lake Refresh (N3B)	Lunar Lake	Panther Lake
Platform	Mobile H/U Only	Desktop Only	Desktop & Mobile H&HX	Desktop Only	Mobile U Only	Mobile H
Process Node	Intel 4	Intel 20A	TSMC N3B	TSMC N3B	TSMC N3B	Intel 18A
Date	Q4 2023	Q1 2025 ?	Desktop-Q4-2024 H&HX-Q1-2025	Q4 2025 ?	Q4 2024	Q1 2026 ?
Full Die	6P + 8P	6P + 8E ?	8P + 16E	8P + 32E	4P + 4E	4P + 8E
LLC	24 MB	24 MB ?	36 MB ?	?	8 MB	?
tCPU	66.48
tGPU	44.45
SoC	96.77
IOE	44.45
Total	252.15

Intel Core Ultra 100 - Meteor Lake

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

dullard · Mar 8, 2024

jpiniero said:
I was thinking it's either because there's a massive security hole in HT that Intel doesn't want to admit to right now or they are simply cutting costs by not doing the validation.

What is the latest in the performance impact of the spectre / meltdown mitigations? I don't follow that thread regularly. At least for a while at the start, there was a pretty big performance loss from the mitigations when hyperthreading was on.

Hyperthreading realistically is more of a 10% performance boost (ranges roughly from +30% in a few benchmarks to -10% in others, with typical average close to ~+10%). And that was before the mitigation performance losses. So how does the current spectre / meltdown mitigation performance loss compare to the potential hyperthreading gain? Potentially a wash?

reggie_fils_aime · Mar 8, 2024

Chalking up the loss of HT to apple alone is pretty silly; i think a more reasonable explanation has something to do with the utter lack of success that intel has with arresting power budgets in a way that scales with performance. If you're a company with a history of utterly catastrophic duds, you're on the back foot against AMD, and you NEED to have a successful generational launch to stop the coming tide of OEM mutiny - you axe the thing that makes it harder to hit performance goals. I'm not sure how it'll play out in marketing terms though (people generally like seeing big number, because that's more bigger and better and gooder)

naukkis · Mar 8, 2024

Big cores in hybrid designs are there to offer better thread performance. Using HT will nullify that as HT will split core thread performance to about half. Only beneficial case for HT in those hybrid designs are massively parallelized loads where single thread performance won't matter - and if power is limited it's more beneficial to assign that power to efficiency cores anyway for better total performance. Intel was actually slow to drop out HT, they should have done it as soon as they go to hybrid designs.

SiliconFly · Mar 8, 2024

adroc_thurston said:
It is, literally the entire reason it exist is mainline Mx compete part.

Nope. LNL actually competes with the likes of Snapdragon X series to fill the gap left out by ARL.

adroc_thurston · Mar 8, 2024

naukkis said:
Big cores in hybrid designs are there to offer better thread performance.

Those cores are also used in DC, where SMT loss hurts.

SiliconFly said:
Nope. LNL actually competes with the likes of Snapdragon X series to fill the gap left out by ARL.

no? lol.
X Elite is higher power than LNL is like all cases.
For a lot higher nT perf but I digress.

Hulk · Mar 8, 2024

Anyone have a technical understanding of how the cycles that were unused for the primary thread and diverted to the secondary logical thread are going to be utilized solely for the primary thread? Is the removal of HT just to reduce die area or is something being changed to minimize lost cycles during thread stalls?

I mean other than Apple doesn't do it.

adroc_thurston · Mar 8, 2024

Hulk said:
Is the removal of HT just to reduce die area or is something being changed to minimize lost cycles during thread stalls?

Less validation work and you dupe a bit less structures.
SMT area/power impact was overall negligable, you pay in validation costs/times mostly.

naukkis · Mar 8, 2024

adroc_thurston said:
Those cores are also used in DC, where SMT loss hurts.

Those cases where single-thread speed doesn't matter they should drop big cores and use just more e-cores. Actually Intel is doing it right now.

SiliconFly · Mar 8, 2024

naukkis said:
...they should have done it as soon as they go to hybrid designs.

I totally agree with this assessment. They should have removed it in Alder Lake itself. Anyways, better late than never.

Hulk said:
Anyone have a technical understanding of how the cycles that were unused for the primary thread and diverted to the secondary logical thread are going to be utilized solely for the primary thread? Is the removal of HT just to reduce die area or is something being changed to minimize lost cycles during thread stalls?

I mean other than Apple doesn't do it.

Reduced die space, cleaner design, less vulnerabilities, faster validation, higher ST, etc. HT is basically hardware based thread context switching thats employed when the h/w scheduler (like thread director) can't find a free core to assign the thread. When HT isn't available in the CPU, the OS scheduler itself executes a s/w based context switch which is slower but minimizes lost cycles. But since the new CPUs have tons of real cores, the s/w based context switching overhead is very minimal which makes HT totally redundant in clients.

naukkis · Mar 8, 2024

Hulk said:
Anyone have a technical understanding of how the cycles that were unused for the primary thread and diverted to the secondary logical thread are going to be utilized solely for the primary thread? Is the removal of HT just to reduce die area or is something being changed to minimize lost cycles during thread stalls?

I mean other than Apple doesn't do it.

Intel HT is symmetrical threading. Both threads are equal, every other clock cycle instructions are feed from different thread. There aren't primary/secondary thread but both threads are executed at speed that is a bit more than half of that execution of single thread on that core.

adroc_thurston · Mar 8, 2024

naukkis said:
Those cases where single-thread speed doesn't matter they should drop big cores and use just more e-cores. Actually Intel is doing it right now.

Atoms have rather castrated featureset and middling perf in just a ton of workloads and they won't replace mainline Xeon that way.
You still need big cores in many-many places. The loss of SMT hurts there.

dullard · Mar 8, 2024

adroc_thurston said:
Less validation work and you dupe a bit less structures.
SMT area/power impact was overall negligable, you pay in validation costs/times mostly.

Don't forget the non-negligible part: the cache. When two threads share the cache for a core, you get half as much cache for each thread and cache thrashing is much more likely. That means either significantly less performance per thread or you need to have significantly more cache than you would otherwise need (more area, more expense, and more cache latency). Hyperthreading can be a nice performance boost in some cases, but it comes with some significant drawbacks.

adroc_thurston · Mar 8, 2024

dullard said:
Don't forget the non-negligible part: the cache. When two threads share the cache for a core, you get half as much cache for each thread and cache thrashing is much more likely. That means either significantly less performance per thread or you need to have significantly more cache than you would otherwise need (more area, more expense, and more cache latency). Hyperthreading can be a nice performance boost in some cases, but it comes with some significant drawbacks.

SMT-friendly workloads nuke your caches anyway (stuff like server-side Java and other JITs etc).
The real SMT drawbacks are security (cloud guys are anal about that) and validation time.

SiliconFly · Mar 8, 2024

dullard said:
Don't forget the non-negligible part: the cache. When two threads share the cache for a core, you get half as much cache for each thread and cache thrashing is much more likely. That means either significantly less performance per thread or you need to have significantly more cache than you would otherwise need (more area, more expense, and more cache latency). Hyperthreading can be a nice performance boost in some cases, but it comes with some significant drawbacks.

Cache & TLB trashing happens even on non-HT processors (and not just x86). When a OS executes a s/w based context switch, it has to invalidate the cache contents and tlb for the incoming thread. Even in cases where the cache is directly mapped, the contents won't need to be flushed but will still need to be populated with new info related to the incoming thread. The penalty still exists in one form or the other.

igor_kavinski · Mar 8, 2024

Hulk said:
I have been thinking about the rumored removal of HT from ARL

While HT may seem like an unnecessary headache on desktop for Intel and maybe AMD, in mobile CPUs, Intel may keep HT alive for a few more years simply because it's the cheapest way to advertise more cores to consumers without incurring a significant area penalty of replacing the HT virtual cores with physical efficiency cores. I used a Core i5-1235U Dell laptop recently and its BIOS had no setting to turn off HT which I found to be pretty weird. It was an Inspiron laptop. It's like Intel doesn't want the majority of its users working without HT.

Another factor is the core occupancy determination by Intel Thread Director. Suppose Windows is using the P-core for something so that core is "awake". A lightweight thread needs to do something at the same time. Does the ITD wake up a sleeping efficiency core or does it allocate the virtual HT core of the active P-core to that thread? I'm thinking the latter would be a more efficient use of the available resources and it could even save time if the lightweight thread is quick to finish its task in less time it takes to context switch and wake up an efficiency core.

Then there's the rumor about rentable units. Let's suppose they really help increase performance similar to or even better than HT. But because adjacent idle core resources are getting rented out, this will wake those cores up more often and thus power efficiency will take some hit. What if ARL desktop has rentable units and no HT while ARL mobile has HT and no rentable units? If the silicon area dedicated to enabling the rentable unit functionality is similar to HT's silicon area requirements, Intel could put both on the compute die and enable one or the other depending on their use case and targeted market.

SiliconFly · Mar 8, 2024

igor_kavinski said:
Then there's the rumor about rentable units. Let's suppose they really help increase performance similar to or even better than HT...

Far far better than HT. But, I don't think Rentable Units is still ready yet. At least not for the next couple of years for sure. This technology is radical, disruptive and complex, Intel has to completely rethink some of the foundational design elements. There were rumors that Nova Lake might feature RU, but after reading about the tech, I don't think Nova Lake is gonna get it. And most importantly, RU flies in the face of logic and promises too much. At this point, I have to say, RU is just pure smoke fellas. Don't take it too seriously until Intel says otherwise in no uncertain terms.

I mean, instead of running x86 instructions in physical cores, RU runs virtual instructions in virtual RU cores (mapped 1-to-1) after translating the x86 instructions to virtual instruction using a translation layer. Sounds tedious and inefficient, but not really. Most of it is actually doable today. The problem arises when the x86 core has to emulate a virtual RU core. A CPU with physical RU cores can do all this and they even have a proof of concept to show for it. But a x86 core emulating a RU core at the hardware level is just absurd. An architectural nightmare. Sounds insane. It's like a x86 core casually emulating a ARM core simultaneously! Yikes!

But hypothetically speaking, and assuming Intel can get it working, and assuming that it works at theoretical maximum and assuming that the threads are one hundred percent RU friendly and perfectly sliceable and all conditions are optimal, here it how it goes:

(1) When the CPU executes a single thread on a single core, the thread executes at the usual speed of 1 ST. This is normal. Nothing unusual about it.

(2) But when RU kicks in, it slices that one single thread into multiple pieces and executes the multiple pieces of that single thread simultaneously on multiple RU cores!!!!! Yikes!

i.e, if a single thread is cut into 4 smaller pieces and executed on 4 different RU cores simultaneously, it executes at the speed of 4 ST (or 1/4th the time) or 400% boost. Pick which ever number you prefer. But this is the concept of RU.

(3) And there are limitations. For example, how do we "cut" or slice a single thread? All these years I didn't even know that was possible, cos it's actually not that simple. Even if it were true, I still don't think it can be done in a generic way. Maybe on specific workloads, but definitely not on everything.

RU promises too much people. Hence, I don't think it's even real. Too many things just don't add up. Probably just pure smoke! Rumors thats somehow gotten mainstream. I'm definitely not gonna believe in it until I see something real.

Doug S · Mar 8, 2024

The reason Intel did HT was to increase MT throughput, they were pretty clear about that when they introduced it. They don't need that anymore with their E cores.

Look at it this way. MT throughput is always going to be power limited - you can't run every core at its max frequency in a CPU with a lot of cores. So you (or rather Intel's chip designers) have to ask yourself, where do I get the best increase in performance for each additional watt of power I can pump into the chip?

I'll bet Intel's designers did the math/simulations/benchmarks and determined that if they disabled HT and used the power saved by that to spin up a few more E cores, they got better throughput. What's more, it wouldn't suffer from the vagaries of HT performance where on average it helps but in the wide world of MT workloads there are some where it helps more and some where it actually HURTS. One nice advantage of an extra E core is that it is almost impossible to come up with a benchmark where that will hurt. Maybe it won't help (i.e. you're maxing out memory bandwidth) but you won't see the benchmarks where it makes things worse like you do with HT.

moinmoin · Mar 8, 2024

igor_kavinski said:
Then there's the rumor about rentable units.

Where can I read more about that whole rentable units concept? Seems to be wild.

FlameTail · Mar 8, 2024

adroc_thurston said:
X Elite is higher power than LNL is like all cases.
For a lot higher nT perf but I digress.

X Elite power consumption is scalable

adroc_thurston · Mar 8, 2024

FlameTail said:
X Elite power consumption is scalable

All chips have scalable power envelopes, just that 8cx g4 isn't a 10W part like LNL or Apple Mx stuff.
Not tablet, laptop.

Saylick · Mar 8, 2024

moinmoin said:
Where can I read more about that whole rentable units concept? Seems to be wild.

Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

www.anandtech.com

Report: Intel has quietly bought chip startup Soft Machines for $250M - SiliconANGLE

siliconangle.com

Probably something similar to this. We called it "reverse Hyperthreading".

FlameTail · Mar 8, 2024

adroc_thurston said:
All chips have scalable power envelopes, just that 8cx g4 isn't a 10W part like LNL or Apple Mx stuff.
Not tablet, laptop.

LNL also scales upto 30W I believe?

Qualcomm showed off an X Elite device with 23W power.

FlameTail · Mar 8, 2024

Saylick said:
Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

www.anandtech.com

Report: Intel has quietly bought chip startup Soft Machines for $250M - SiliconANGLE

Report: Intel has quietly bought chip startup Soft Machines for $250M - SiliconANGLE

siliconangle.com

Probably something similar to this. We called it "reverse Hyperthreading".

That inverse hyperthreading is wild stuff.

If Intel can get that working, they'll become the undisputed king of Single Thread performance.

itsmydamnation · Mar 8, 2024

naukkis said:
Those cases where single-thread speed doesn't matter they should drop big cores and use just more e-cores. Actually Intel is doing it right now.

Except for all those workloads that have high latency but also need lots of brawn like relational DB's or generally anything in the server space that is dealing with I/O.

I cant wait for 1000's of terribly performing kubernetes containers running on 1000's of average performing core. But im cloud scale!!!!! 2024 IT is lit.

tamz_msc · Mar 8, 2024

itsmydamnation said:
Except for all those workloads that have high latency but also need lots of brawn like relational DB's or generally anything in the server space that is dealing with I/O.

I cant wait for 1000's of terribly performing kubernetes containers running on 1000's of average performing core. But im cloud scale!!!!! 2024 IT is lit.

Isn't Skymont targeting Golden Cove level of performance?

Hardly average performing if that is indeed the case.

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Senior member

Attachments

Elite Member

Member

Senior member

Golden Member

Platinum Member

Diamond Member

Platinum Member

Senior member

Golden Member

Senior member

Platinum Member

Elite Member

Platinum Member

Golden Member

Lifer

Golden Member

Platinum Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Platinum Member

Platinum Member

Diamond Member