Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Tigerick · Aug 22, 2022

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

Model	Code-Name	Date	TDP	Node	Tiles	Main Tile	CPU	LP E-Core	LLC	GPU	Xe-cores
Core Ultra 100U	Meteor Lake	Q4 2023	15 - 57 W	Intel 4 + N5 + N6	4	tCPU	2P + 8E	2	12 MB	Intel Graphics	4
?	Lunar Lake	Q4 2024	17 - 30 W	N3B + N6	2	CPU + GPU & IMC	4P + 4E	0	12 MB	Arc	8
?	Panther Lake	Q1 2026 ?	?	Intel 18A + N3E	3	CPU + MC	4P + 8E	4	?	Arc	12

Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

	Meteor Lake	Arrow Lake (N3B)	Lunar Lake	Panther Lake
Platform	Mobile H/U Only	Desktop & Mobile H&HX	Mobile U Only	Mobile H
Process Node	Intel 4	TSMC N3B	TSMC N3B	Intel 18A
Date	Q4 2023	Desktop-Q4-2024 H&HX-Q1-2025	Q4 2024	Q1 2026 ?
Full Die	6P + 8P	8P + 16E	4P + 4E	4P + 8E
LLC	24 MB	36 MB ?	12 MB	?
tCPU	66.48
tGPU	44.45
SoC	96.77
IOE	44.45
Total	252.15

Intel Core Ultra 100 - Meteor Lake

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

vanplayer · Apr 22, 2025

DrMrLordX said:
Probably not, but they could always resurrect triple channel topology from the Nehalem days.

Since NVL socket doesn't change that much on physical level, I would guess no more than 2 mem channels on NVL platform.

Gideon · Apr 22, 2025

eek2121 said:
I find it hard to believe that this will be a desktop part. 52 cores at 230W = 4.42W for core on average. Possibly an HEDT part, if it even drops.

I wouldn't be surprised if they brand it HEDT and mark it up, but this power draw shouldn't be a problem at all on TSMC 2nm or Intel 18A it will be on.

1. A 7nm 3990X with an ancient interconnect did 3.0W per core on full load with 64 cores, 280W TDP total.
2. A 5nm 7945HX can easily power 16 cores at 55W TDP (220W for 64 cores) losing only around 10% to it's desktop brethren at full MT load. Again with ancient interconnect.
3. Even a Desktop 7950X can sustain 3.4Ghz All core clock while drawing only 67W package power in a minisforum MS-A1 (where the PSU can't handle more)

At worst intel has to slightly lower all core clocks a couple hundred mhz but 230W is more than enough for 52 cores.

Is it actually useful with 2- channel memory and big-small cores, is another discussion entirely.

Gideon · Apr 22, 2025

OneEng2 said:
The standard Zen 6 desktop will likely top out at 24c/48t (2 x 12c CCD). Beyond that, I believe that more memory bandwidth will be needed. AMD's solution for higher bandwidth is to make a multichannel desktop for HPC (Threadripper). I suspect that this is how AMD will compete with the 52 core Nova Lake. I also wonder how Intel intends to feed such a monster?

This is always claimed every time someone mentions core count increases on desktop sockets going back to pre-ryzen days. That such CPUs would be useless (to the point of not being worth making) due to memory bandwidth constraints. I'm yet to see factual evidence to this.

It should be trivial to prove, just lower the memory speed on existing platforms to a similar ratio and see what the effect is. In fact we have plenty of comparison to make already:

1. Just look at the core counts vs BW available on Server CPUs, particularily the DDR4 servers of yesteryear (all with recommended max seeds):

Server:

64 core Milan running at DDR4-3200 has 3.2 GB/s per core
96 core Genoa running at DDR5-4800 has 4.8 GB/s per core
128 core Turin running at DDR5-6400 has 4.8 GB/s per core

Desktop:

16 core Ryzen 9950X at DDR5-6000 has 6 GB/s per core
hypothetical 32 core Zen 6 at DDR5-8000 has 4 GB/s per core
52 core Nova Lake at DDR5-4800 would have a measly 1.48 GB/s per core
52 core Nova Lake at DDR5-8800 (CUDIMM) would have 2.7 GB/s per core

Now this definitely looks bandwidth starved, and it is, but to what degree? Let's find some benches. I found two with interesting configurations:

16 core 9950X with DDR-4800 having 4.8 GB/s per core - Techpowerup
16 core 5950X with DDR-2133 having 2.1 GB/s per core - Tom's (considerably worse than a 52 core Nova Lake)

In Techpowerup a memory scaling bench the 9950X still beat a 7950X running with 6000 Mhz EXPO (on average in apps) while losing ~7% of max perf. TBF in some benches like compiling UE5 it lost by a non-trivial amount (only having 13700K perf)
In Tom's comparison with the 5950X DDR4-3600 offered 7.2% more performance than DDR4-2133. Dual rank offered additional 7%. in y-cruncher and 7-Zip the diff goes to 15.6% and 25.2%. All in all the difference in most apps was not over 10% going from 2.1 GB/s to 3.6 GB/s (3600 MT/s) and that's with the terrible FCLK tax zen 3 had at lower frequencies.

TL;DR:

While memory bandwidth issues are there with extreme core counts, they are often overblown to a ridiculous degree.

A 52 core Nova Lake at DDR5-4800 would be terrrible but with fast 8000+Mhz CUDIMMs it would probably only leave about 10% of it's performance on the table vs a 4-channel equivalent in average workloads (7-zip, etc would be way worse of course).

32 cores would have absolutely no issues at all with 8000Mhz memory.

511 · Apr 22, 2025

fastandfurious6 said:
here's why zen6 (tsmc n2) will eat intel A18:

View attachment 122492

Turin (Zen 5c) is ~20% faster than Intel 3 xeon (AFAIK the only intel 3 product?)
Underlined the models with same core count (128)

mark my words: basically intel A18 will be catching up with zen 5 on all segments
mobile
desktop
server

A18 maybe ~10% faster than zen5 in some cases but wont compete with zen6

For this i have quite the thoughts after looking at the servers released and stuff it will be a long thread.

First This Data is based on Mean of all the results which means if we have some extremes anything can skew the result and also i doubt some results for both AMD/Intel like this one.

a bug in NAMD will it be not fixed?

why would AMD Publish SPEC 2K17 result in their presentation ?

Than there is the process of deciding Server AMD vs Intel

9575F is good where you need fewer but faster cores( good for accelerators and things that are CPU Bound).
9965 is good for Virtualization where you want maximum thread density.
6980P is good for Memory Bandwidth where you need the highest Memory bandwidth.
for AI everyone is picking up Xeons cause of AMX for inferencing small models and as a node to drive GPUs in a Cluster. There is no way they will beat GPUs as of now.
Than there are workload like OpenSSl/Databases and each of these have their different requirements depending on which you use.

Now the Platform Features like CXL/PCI-E and the biggest thing these Xeon 6 has accelerator for different stuff like Networking/Databases etc no one has benched these and Intel has plugins for everything fully open source.

Now let me come to the core parts in next gen servers, Zen 6 is N2 and DMR is 18AP using Panther Cove. Zen 6(N2) is going to be a big upgrade over Zen5(N4P) i don't have doubts but the thing is Panther cove is an even bigger upgrade over Granite Rapids(Golden Cove with enhancements) it is basically Redwood -> Lion -> Cougar Cove -> Panther Cove . That's like 2 tock and 1 tick with a process enhancement.

Now to the Node i firmly believe 18A is at minimum N3P level of performance (This source is TSMC so make sure you have a better counter point 🤣 ) and at max slightly weaker than N2 in perf/watt but based on Intel's improvement in terms of what the call a '+' i have every reason to believe 18AP is slightly better than N2.

yuri69 · Apr 22, 2025

Gideon said:
A 52 core Nova Lake at DDR5-4800 would be terrrible but...

These are not 52 P-cores. The config is 16 hungry P-cores, 32 less hungry E-cores, and 4 wimpy LP-cores.

Win2012R2 · Apr 22, 2025

511 said:
for AI everyone is picking up Xeons cause of AMX.

For AI (almost) everybody is picking up NVIDIA.

511 · Apr 22, 2025

Win2012R2 said:
For AI (almost) everybody is picking up NVIDIA.

That's the GPU part not the CPU

MS_AT · Apr 22, 2025

511 said:
for AI everyone is picking up Xeons cause of AMX

That is very general and broad statement. Could you list more specifics like who is everyone, and what kind of AI workloads?

Win2012R2 · Apr 22, 2025

511 said:
That's the GPU part not the CPU

And that's 99% spending, on top of that nvidia got their own CPU now tightly integrated with their GPUs, so that cuts out any other CPUs from contention.

AMX got almost zero usage, it's DOA.

511 · Apr 22, 2025

Win2012R2 said:
And that's 99% spending, on top of that nvidia got their own CPU now tightly integrated with their GPUs, so that cuts out any other CPUs from contention.

AMX got almost zero usage, it's DOA.

I think people said that about AVX-512 as well when it came it was worse than AMX cause it would throttle the core hard but look at it now.

MS_AT said:
That is very general and broad statement. Could you list more specifics like who is everyone, and what kind of AI workloads?

I was referring for situations where you don't need GPUs like inferencing small models also I need to edit the point I think as you said it's too broad.
Also this new project

GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs

Official inference framework for 1-bit LLMs. Contribute to microsoft/BitNet development by creating an account on GitHub.

github.com

Win2012R2 · Apr 22, 2025

511 said:
I think people said that about AVX-512 as well when it came it was worse than AMX cause it would throttle the core hard but look at it now.

AFAIK AMX throttles freq down much like original AVX512 did. Maybe it does not matter if you run exclusively AMX code all the time, but what if it's mixed? Should be same bad situation with with original AVX512.

AMX should be better for lower inference latency compared to GPU, but most people will look for reasonable latency + big throughput, which can only be achieved in GPU.

Can't see AMD using it, which in practice means AMX is dead end.

MS_AT · Apr 22, 2025

511 said:
I was referring for situations where you don't need GPUs like inferencing small models also I need to edit the point I think as you said it's too broad.

Oh, I thought you know some specific usage data. Thing is small models usually stick to using custom data format which is not accelerated by AMX nor AVX512 (AMX has support for int8 and bf16, AVX512 for bf16) but most models used are quantized to more weird data formats (something that could be thought of as int 4,5, int 6 and so on) and these require custom kernels, were data are being packed specifically to fit in AVX512 registers (which should be true also for AMX) and seeing AMX is less popular than AVX512 I doubt that popular inferencing frameworks (ollama, lm studio etc, all are really derivatives of llama.cpp) will get support unless Intel will provide it on its own due to the custom packaging part and lack of available hardware on the market. I mean you could use the model packed into bf16 but then you need 4x the memory bandwidth to maintain the same speed for token generation vs the popular int4 like quants assuming you have enough memory to fit the model.

511 · Apr 22, 2025

MS_AT said:
Oh, I thought you know some specific usage data. Thing is small models usually stick to using custom data format which is not accelerated by AMX nor AVX512 (AMX has support for int8 and bf16, AVX512 for bf16) but most models used are quantized to more weird data formats (something that could be thought of as int 4,5, int 6 and so on) and these require custom kernels, were data are being packed specifically to fit in AVX512 registers (which should be true also for AMX) and seeing AMX is less popular than AVX512 I doubt that popular inferencing frameworks (ollama, lm studio etc, all are really derivatives of llama.cpp) will get support unless Intel will provide it on its own due to the custom packaging part and lack of available hardware on the market. I mean you could use the model packed into bf16 but then you need 4x the memory bandwidth to maintain the same speed for token generation vs the popular int4 like quants assuming you have enough memory to fit the model.

That is true but AMX is fully supported in llamma CPP and in Pytorch. Intel has provided support since launch of SPR that was nice of them unlike AMD who haven't bothered to provide RCom support for RDNA4.

https://openmetal.io/resources/blog/intel-amx-ai-inference-performance/

We tested Intel’s AMX CPU accelerator for AI. Here’s what we learned | Google Cloud Blog

Confidential VMs are now available with built-in CPU acceleration with Intel AMX. Which one is suited for AI? Check out our test results.

cloud.google.com

Add Intel Advanced Matrix Extensions (AMX) support to ggml by mingfeima · Pull Request #8998 · ggml-org/llama.cpp

I have read the contributing guidelines Self-reported review complexity: Low Medium High replacement of #7707 to trigger ggml-ci on amx

github.com

Win2012R2 said:
AFAIK AMX throttles freq down much like original AVX512 did. Maybe it does not matter if you run exclusively AMX code all the time, but what if it's mixed? Should be same bad situation with with original AVX512.

AMX should be better for lower inference latency compared to GPU, but most people will look for reasonable latency + big throughput, which can only be achieved in GPU.

Can't see AMD using it, which in practice means AMX is dead end.

LOL AMD has copied almost everything what Intel does while the opposite is not true in terms of ISA only exception being X86_64 Extension.

fastandfurious6 · Apr 22, 2025

511 said:
For this i have quite the thoughts after looking at the servers released and stuff it will be a long thread.

First This Data is based on Mean of all the results which means if we have some extremes anything can skew the result and also i doubt some results for both AMD/Intel like this one.
View attachment 122499

a bug in NAMD will it be not fixed?
View attachment 122500

why would AMD Publish SPEC 2K17 result in their presentation ?

View attachment 122501

Than there is the process of deciding Server AMD vs Intel

9575F is good where you need fewer but faster cores( good for accelerators and things that are CPU Bound).

9965 is good for Virtualization where you want maximum thread density.

6980P is good for Memory Bandwidth where you need the highest Memory bandwidth.

for AI everyone is picking up Xeons cause of AMX for inferencing small models and as a node to drive GPUs in a Cluster. There is no way they will beat GPUs as of now.

Than there are workload like OpenSSl/Databases and each of these have their different requirements depending on which you use.

Now the Platform Features like CXL/PCI-E and the biggest thing these Xeon 6 has accelerator for different stuff like Networking/Databases etc no one has benched these and Intel has plugins for everything fully open source.

Now let me come to the core parts in next gen servers, Zen 6 is N2 and DMR is 18AP using Panther Cove. Zen 6(N2) is going to be a big upgrade over Zen5(N4P) i don't have doubts but the thing is Panther cove is an even bigger upgrade over Granite Rapids(Golden Cove with enhancements) it is basically Redwood -> Lion -> Cougar Cove -> Panther Cove . That's like 2 tock and 1 tick with a process enhancement.

Now to the Node i firmly believe 18A is at minimum N3P level of performance (This source is TSMC so make sure you have a better counter point 🤣 ) and at max slightly weaker than N2 in perf/watt but based on Intel's improvement in terms of what the call a '+' i have every reason to believe 18AP is slightly better than N2.

very valid points. interesting how they have similar specint perf

on phoronix review of many use cases https://www.phoronix.com/review/amd-epyc-9965-9755-benchmarks/6 in majority Turin is faster, most cases 10-25% and some even 60%+, while granite rapids is only faster up to 10% in just few use cases, thats how geometric mean is reached

still it's really hard to believe intel managed to jump from intel 7 straight to 3 and A18 so quickly lol. how did they make it??

to answer my question I think it's AI/LLM greatly accelerating R&D. like between decision making and existing assets, they could make it work so both TSMC and Intel got 3 and 2nm class nodes working quickly, also assuming they're both dependent on ASML machinery as well so they go in tandem

maybe it's ASML R&D first and foremost dictating how quickly better nodes come up?

Win2012R2 · Apr 22, 2025

511 said:
LOL AMD has copied almost everything what Intel does while the opposite is not true in terms of ISA only exception being X86_64 Extension.

AMD got GPU line for AI, they have no need to bloat their CPU cores with AMX that almost nobody is using, for low latency inferencing AVX-512 can also be used, why apply massive silicon penalty to all customers when 0.1% will use AMX? Zero chance that is happening.

MS_AT · Apr 22, 2025

511 said:
Intel has provided support since launch of SPR that was nice of them unlike AMD who haven't bothered to provide RCom support for RDNA4.

Let me quote the engineer that provided the support:

I use my spare time to work on this project as this is not an official task from my employer.

https://github.com/ggml-org/llama.cpp/pull/7707#issuecomment-2210221214 so we should thank the person, not the company

511 · Apr 22, 2025

MS_AT said:
Let me quote the engineer that provided the support:

https://github.com/ggml-org/llama.cpp/pull/7707#issuecomment-2210221214 so we should thank the person, not the company

Yes but i think it is one of the benefits of open source. I hope intel had compensated him for this task.

MS_AT · Apr 22, 2025

511 said:
Yes but i think it is one of the benefits of open source. I hope intel had compensated him for this task.

Hopefully, after they have seen what he has done. Though I am a bit surprised that David Huang is providing support for ROCM kernels, I wonder if he is connected to AMD somehow, hmm. Nevertheless that becomes off-topic.

Back on topic:
Have you found by chance any comparison where Intel is not being compared to Intel for AMX inference? For example this guy produced very efficient kernels for AVX2/AVX512 https://github.com/ikawrakow/ik_llama.cpp. So it would be nice to see how it compares.

OneEng2 · Apr 22, 2025

DrMrLordX said:
Probably not, but they could always resurrect triple channel topology from the Nehalem days.

... and compete with AMD on price? I think not. Intel is already in a big financial pinch. They can't afford to throw money at the problem any longer IMO.

Gideon said:
Is it actually useful with 2- channel memory and big-small cores, is another discussion entirely.

True. My concern is more the profitability of such a configuration. More memory channels, a 52 core processor all to solve what problem?

Gideon said:
That such CPUs would be useless (to the point of not being worth making) due to memory bandwidth constraints. I'm yet to see factual evidence to this.

Not useless. Just not applicable to the vast majority of the market and VERY costly to make.

yuri69 said:
These are not 52 P-cores. The config is 16 hungry P-cores, 32 less hungry E-cores, and 4 wimpy LP-cores.

Yea, but with E cores having IPC close to Zen 4, they will still need a good deal of bandwidth. No way around it. I'll give you the 4 wimpy cores argument though .

511 · Apr 22, 2025

Win2012R2 said:
AMD got GPU line for AI, they have no need to bloat their CPU cores with AMX that almost nobody is using, for low latency inferencing AVX-512 can also be used, why apply massive silicon penalty to all customers when 0.1% will use AMX? Zero chance that is happening.

i would disagree investing in Hardware and software is the key part NVIDIA have CUDA moat if they thought like that they wouldn't have the moat and AMX is available on every WS/Server starting from 4th gen Sapphire Rapids. Even if you take AVX-512 for example initially when it launched it had issues and software was not ready now look at it you will find use cases for AVX-512 in almost all HPC software and few consumer software as well

MS_AT said:
Hopefully, after they have seen what he has done. Though I am a bit surprised that David Huang is providing support for ROCM kernels, I wonder if he is connected to AMD somehow, hmm. Nevertheless that becomes off-topic.

Back on topic:
Have you found by chance any comparison where Intel is not being compared to Intel for AMX inference? For example this guy produced very efficient kernels for AVX2/AVX512 https://github.com/ikawrakow/ik_llama.cpp. So it would be nice to see how it compares.

Nope not yet if i find anything will surely ping you.

Markfw · Apr 22, 2025

511 said:
That is true but AMX is fully supported in llamma CPP and in Pytorch. Intel has provided support since launch of SPR that was nice of them unlike AMD who haven't bothered to provide RCom support for RDNA4.

https://openmetal.io/resources/blog/intel-amx-ai-inference-performance/

We tested Intel’s AMX CPU accelerator for AI. Here’s what we learned | Google Cloud Blog

Confidential VMs are now available with built-in CPU acceleration with Intel AMX. Which one is suited for AI? Check out our test results.

cloud.google.com

Add Intel Advanced Matrix Extensions (AMX) support to ggml by mingfeima · Pull Request #8998 · ggml-org/llama.cpp

I have read the contributing guidelines Self-reported review complexity: Low Medium High replacement of #7707 to trigger ggml-ci on amx

github.com

LOL AMD has copied almost everything what Intel does while the opposite is not true in terms of ISA only exception being X86_64 Extension.

Except AMD does not down-clock on a full avx-512 implementation (Zen 5) like Intel did and does on AVX now.

fastandfurious6 · Apr 22, 2025

OneEng2 said:
Intel is already in a big financial pinch. They can't afford to throw money at the problem any longer IMO.

yeah this is intel's #1 issue right now. they just sold off Altera with 50% loss of investment for quick $4b cash injection. they threw TONS of money the last years trying to aggressively keep up, had to go TSMC as well

Krzanich did immeasurable damage lmao. is there like a full list of his decisions?

MS_AT · Apr 22, 2025

Markfw said:
Except AMD does not down-clock on a full avx-512 implementation (Zen 5) like Intel did and does on AVX now.

It does if you run well optimized kernel that is not memory bound The problem with Intel was license based throttling. Recent Intel and AMD show dynamic load based throtting, what makes using AVX512 sparringly a viable option what was not possible with first Intel implementations.

Markfw · Apr 22, 2025

MS_AT said:
It does if you run well optimized kernel that is not memory bound The problem with Intel was license based throttling. Recent Intel and AMD show dynamic load based throtting, what makes using AVX512 sparringly a viable option what was not possible with first Intel implementations.

I disagree with the "sparringly a viable option", and primegrid uses it a lot, and I have multiple servers running it all day for days at a time, with no ill effects, including recently that I (not fully verified yet I will add before I am corrected) have discovered the 16th largest prime number in existence using it.

MS_AT · Apr 22, 2025

Markfw said:
I disagree with the "sparringly a viable option", and primegrid uses it a lot, and I have multiple servers running it all day for days at a time, with no ill effects, including recently that I (not fully verified yet I will add before I am corrected) have discovered the 16th largest prime number in existence using it.

You misunderstood my point. With Skylake-X unless you had heavy AVX512 compute kernels, it was not worth it to run AVX512 code at 512b widths that is why compilers were tuned to emit 256b instructions or refusing to use AVX512 at all. Problem was that using a couple 512b instructions on Skylake-X downclocked your CPU for thousands of cycles. Recent Intel and AMD CPUs don't have this problem, so you can use AVX512 even for small things so it no longer makes sense to avoid these instructions on purpose.

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Senior member

Attachments

Member

Golden Member

Golden Member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Senior member

Golden Member

Senior member

Senior member

Golden Member

Moderator Emeritus, Elite Member

Senior member

Senior member

Moderator Emeritus, Elite Member

Senior member