Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 771 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
720
677
106






As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

ModelCode-NameDateTDPNodeTilesMain TileCPULP E-CoreLLCGPUXe-cores
Core Ultra 100UMeteor LakeQ4 202315 - 57 WIntel 4 + N5 + N64tCPU2P + 8E212 MBIntel Graphics4
?Lunar LakeQ4 202417 - 30 WN3B + N62CPU + GPU & IMC4P + 4E012 MBArc8
?Panther LakeQ1 2026 ??Intel 18A + N3E3CPU + MC4P + 8E4?Arc12



Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

Meteor LakeArrow Lake (N3B)Lunar LakePanther Lake
PlatformMobile H/U OnlyDesktop & Mobile H&HXMobile U OnlyMobile H
Process NodeIntel 4TSMC N3BTSMC N3BIntel 18A
DateQ4 2023Desktop-Q4-2024
H&HX-Q1-2025
Q4 2024Q1 2026 ?
Full Die6P + 8P8P + 16E4P + 4E4P + 8E
LLC24 MB36 MB ?12 MB?
tCPU66.48
tGPU44.45
SoC96.77
IOE44.45
Total252.15



Intel Core Ultra 100 - Meteor Lake



As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)



 

Attachments

  • PantherLake.png
    283.5 KB · Views: 24,022
  • LNL.png
    881.8 KB · Views: 25,511
Last edited:

Gideon

Golden Member
Nov 27, 2007
1,994
4,946
136
I find it hard to believe that this will be a desktop part. 52 cores at 230W = 4.42W for core on average. Possibly an HEDT part, if it even drops.
I wouldn't be surprised if they brand it HEDT and mark it up, but this power draw shouldn't be a problem at all on TSMC 2nm or Intel 18A it will be on.

1. A 7nm 3990X with an ancient interconnect did 3.0W per core on full load with 64 cores, 280W TDP total.
2. A 5nm 7945HX can easily power 16 cores at 55W TDP (220W for 64 cores) losing only around 10% to it's desktop brethren at full MT load. Again with ancient interconnect.
3. Even a Desktop 7950X can sustain 3.4Ghz All core clock while drawing only 67W package power in a minisforum MS-A1 (where the PSU can't handle more)

At worst intel has to slightly lower all core clocks a couple hundred mhz but 230W is more than enough for 52 cores.

Is it actually useful with 2- channel memory and big-small cores, is another discussion entirely.
 

Gideon

Golden Member
Nov 27, 2007
1,994
4,946
136
The standard Zen 6 desktop will likely top out at 24c/48t (2 x 12c CCD). Beyond that, I believe that more memory bandwidth will be needed. AMD's solution for higher bandwidth is to make a multichannel desktop for HPC (Threadripper). I suspect that this is how AMD will compete with the 52 core Nova Lake. I also wonder how Intel intends to feed such a monster?
This is always claimed every time someone mentions core count increases on desktop sockets going back to pre-ryzen days. That such CPUs would be useless (to the point of not being worth making) due to memory bandwidth constraints. I'm yet to see factual evidence to this.

It should be trivial to prove, just lower the memory speed on existing platforms to a similar ratio and see what the effect is. In fact we have plenty of comparison to make already:

1. Just look at the core counts vs BW available on Server CPUs, particularily the DDR4 servers of yesteryear (all with recommended max seeds):

Server:
  • 64 core Milan running at DDR4-3200 has 3.2 GB/s per core
  • 96 core Genoa running at DDR5-4800 has 4.8 GB/s per core
  • 128 core Turin running at DDR5-6400 has 4.8 GB/s per core
Desktop:
  • 16 core Ryzen 9950X at DDR5-6000 has 6 GB/s per core
  • hypothetical 32 core Zen 6 at DDR5-8000 has 4 GB/s per core
  • 52 core Nova Lake at DDR5-4800 would have a measly 1.48 GB/s per core
  • 52 core Nova Lake at DDR5-8800 (CUDIMM) would have 2.7 GB/s per core

Now this definitely looks bandwidth starved, and it is, but to what degree? Let's find some benches. I found two with interesting configurations:
  • 16 core 9950X with DDR-4800 having 4.8 GB/s per core - Techpowerup
  • 16 core 5950X with DDR-2133 having 2.1 GB/s per core - Tom's (considerably worse than a 52 core Nova Lake)

In Techpowerup a memory scaling bench the 9950X still beat a 7950X running with 6000 Mhz EXPO (on average in apps) while losing ~7% of max perf. TBF in some benches like compiling UE5 it lost by a non-trivial amount (only having 13700K perf)
In Tom's comparison with the 5950X DDR4-3600 offered 7.2% more performance than DDR4-2133. Dual rank offered additional 7%. in y-cruncher and 7-Zip the diff goes to 15.6% and 25.2%. All in all the difference in most apps was not over 10% going from 2.1 GB/s to 3.6 GB/s (3600 MT/s) and that's with the terrible FCLK tax zen 3 had at lower frequencies.

TL;DR:

While memory bandwidth issues are there with extreme core counts, they are often overblown to a ridiculous degree.

A 52 core Nova Lake at DDR5-4800 would be terrrible but with fast 8000+Mhz CUDIMMs it would probably only leave about 10% of it's performance on the table vs a 4-channel equivalent in average workloads (7-zip, etc would be way worse of course).

32 cores would have absolutely no issues at all with 8000Mhz memory.
 
Last edited:
Reactions: Fjodor2001

511

Golden Member
Jul 12, 2024
1,903
1,707
106
here's why zen6 (tsmc n2) will eat intel A18:

View attachment 122492


Turin (Zen 5c) is ~20% faster than Intel 3 xeon (AFAIK the only intel 3 product?)
Underlined the models with same core count (128)


mark my words: basically intel A18 will be catching up with zen 5 on all segments
mobile
desktop
server

A18 maybe ~10% faster than zen5 in some cases but wont compete with zen6
For this i have quite the thoughts after looking at the servers released and stuff it will be a long thread.

First This Data is based on Mean of all the results which means if we have some extremes anything can skew the result and also i doubt some results for both AMD/Intel like this one.


a bug in NAMD will it be not fixed?


why would AMD Publish SPEC 2K17 result in their presentation ?




Than there is the process of deciding Server AMD vs Intel
  • 9575F is good where you need fewer but faster cores( good for accelerators and things that are CPU Bound).
  • 9965 is good for Virtualization where you want maximum thread density.
  • 6980P is good for Memory Bandwidth where you need the highest Memory bandwidth.
  • for AI everyone is picking up Xeons cause of AMX for inferencing small models and as a node to drive GPUs in a Cluster. There is no way they will beat GPUs as of now.
  • Than there are workload like OpenSSl/Databases and each of these have their different requirements depending on which you use.
Now the Platform Features like CXL/PCI-E and the biggest thing these Xeon 6 has accelerator for different stuff like Networking/Databases etc no one has benched these and Intel has plugins for everything fully open source.

Now let me come to the core parts in next gen servers, Zen 6 is N2 and DMR is 18AP using Panther Cove. Zen 6(N2) is going to be a big upgrade over Zen5(N4P) i don't have doubts but the thing is Panther cove is an even bigger upgrade over Granite Rapids(Golden Cove with enhancements) it is basically Redwood -> Lion -> Cougar Cove -> Panther Cove . That's like 2 tock and 1 tick with a process enhancement.

Now to the Node i firmly believe 18A is at minimum N3P level of performance (This source is TSMC so make sure you have a better counter point 🤣 ) and at max slightly weaker than N2 in perf/watt but based on Intel's improvement in terms of what the call a '+' i have every reason to believe 18AP is slightly better than N2.
 
Last edited:
Reactions: DKR

511

Golden Member
Jul 12, 2024
1,903
1,707
106
And that's 99% spending, on top of that nvidia got their own CPU now tightly integrated with their GPUs, so that cuts out any other CPUs from contention.

AMX got almost zero usage, it's DOA.
I think people said that about AVX-512 as well when it came it was worse than AMX cause it would throttle the core hard but look at it now.

That is very general and broad statement. Could you list more specifics like who is everyone, and what kind of AI workloads?
I was referring for situations where you don't need GPUs like inferencing small models also I need to edit the point I think as you said it's too broad.
Also this new project
 
Last edited:
Reactions: DKR

Win2012R2

Senior member
Dec 5, 2024
895
853
96
I think people said that about AVX-512 as well when it came it was worse than AMX cause it would throttle the core hard but look at it now.
AFAIK AMX throttles freq down much like original AVX512 did. Maybe it does not matter if you run exclusively AMX code all the time, but what if it's mixed? Should be same bad situation with with original AVX512.

AMX should be better for lower inference latency compared to GPU, but most people will look for reasonable latency + big throughput, which can only be achieved in GPU.

Can't see AMD using it, which in practice means AMX is dead end.
 

MS_AT

Senior member
Jul 15, 2024
599
1,253
96
I was referring for situations where you don't need GPUs like inferencing small models also I need to edit the point I think as you said it's too broad.
Oh, I thought you know some specific usage data. Thing is small models usually stick to using custom data format which is not accelerated by AMX nor AVX512 (AMX has support for int8 and bf16, AVX512 for bf16) but most models used are quantized to more weird data formats (something that could be thought of as int 4,5, int 6 and so on) and these require custom kernels, were data are being packed specifically to fit in AVX512 registers (which should be true also for AMX) and seeing AMX is less popular than AVX512 I doubt that popular inferencing frameworks (ollama, lm studio etc, all are really derivatives of llama.cpp) will get support unless Intel will provide it on its own due to the custom packaging part and lack of available hardware on the market. I mean you could use the model packed into bf16 but then you need 4x the memory bandwidth to maintain the same speed for token generation vs the popular int4 like quants assuming you have enough memory to fit the model.
 

511

Golden Member
Jul 12, 2024
1,903
1,707
106
Oh, I thought you know some specific usage data. Thing is small models usually stick to using custom data format which is not accelerated by AMX nor AVX512 (AMX has support for int8 and bf16, AVX512 for bf16) but most models used are quantized to more weird data formats (something that could be thought of as int 4,5, int 6 and so on) and these require custom kernels, were data are being packed specifically to fit in AVX512 registers (which should be true also for AMX) and seeing AMX is less popular than AVX512 I doubt that popular inferencing frameworks (ollama, lm studio etc, all are really derivatives of llama.cpp) will get support unless Intel will provide it on its own due to the custom packaging part and lack of available hardware on the market. I mean you could use the model packed into bf16 but then you need 4x the memory bandwidth to maintain the same speed for token generation vs the popular int4 like quants assuming you have enough memory to fit the model.
That is true but AMX is fully supported in llamma CPP and in Pytorch. Intel has provided support since launch of SPR that was nice of them unlike AMD who haven't bothered to provide RCom support for RDNA4.


AFAIK AMX throttles freq down much like original AVX512 did. Maybe it does not matter if you run exclusively AMX code all the time, but what if it's mixed? Should be same bad situation with with original AVX512.

AMX should be better for lower inference latency compared to GPU, but most people will look for reasonable latency + big throughput, which can only be achieved in GPU.

Can't see AMD using it, which in practice means AMX is dead end.
LOL AMD has copied almost everything what Intel does while the opposite is not true in terms of ISA only exception being X86_64 Extension.
 

fastandfurious6

Senior member
Jun 1, 2024
498
643
96
For this i have quite the thoughts after looking at the servers released and stuff it will be a long thread.

First This Data is based on Mean of all the results which means if we have some extremes anything can skew the result and also i doubt some results for both AMD/Intel like this one.
View attachment 122499

a bug in NAMD will it be not fixed?
View attachment 122500

why would AMD Publish SPEC 2K17 result in their presentation ?

View attachment 122501


Than there is the process of deciding Server AMD vs Intel
  • 9575F is good where you need fewer but faster cores( good for accelerators and things that are CPU Bound).
  • 9965 is good for Virtualization where you want maximum thread density.
  • 6980P is good for Memory Bandwidth where you need the highest Memory bandwidth.
  • for AI everyone is picking up Xeons cause of AMX for inferencing small models and as a node to drive GPUs in a Cluster. There is no way they will beat GPUs as of now.
  • Than there are workload like OpenSSl/Databases and each of these have their different requirements depending on which you use.
Now the Platform Features like CXL/PCI-E and the biggest thing these Xeon 6 has accelerator for different stuff like Networking/Databases etc no one has benched these and Intel has plugins for everything fully open source.

Now let me come to the core parts in next gen servers, Zen 6 is N2 and DMR is 18AP using Panther Cove. Zen 6(N2) is going to be a big upgrade over Zen5(N4P) i don't have doubts but the thing is Panther cove is an even bigger upgrade over Granite Rapids(Golden Cove with enhancements) it is basically Redwood -> Lion -> Cougar Cove -> Panther Cove . That's like 2 tock and 1 tick with a process enhancement.

Now to the Node i firmly believe 18A is at minimum N3P level of performance (This source is TSMC so make sure you have a better counter point 🤣 ) and at max slightly weaker than N2 in perf/watt but based on Intel's improvement in terms of what the call a '+' i have every reason to believe 18AP is slightly better than N2.


very valid points. interesting how they have similar specint perf

on phoronix review of many use cases https://www.phoronix.com/review/amd-epyc-9965-9755-benchmarks/6 in majority Turin is faster, most cases 10-25% and some even 60%+, while granite rapids is only faster up to 10% in just few use cases, thats how geometric mean is reached

still it's really hard to believe intel managed to jump from intel 7 straight to 3 and A18 so quickly lol. how did they make it??

to answer my question I think it's AI/LLM greatly accelerating R&D. like between decision making and existing assets, they could make it work so both TSMC and Intel got 3 and 2nm class nodes working quickly, also assuming they're both dependent on ASML machinery as well so they go in tandem

maybe it's ASML R&D first and foremost dictating how quickly better nodes come up?
 
Last edited:
Reactions: 511

Win2012R2

Senior member
Dec 5, 2024
895
853
96
LOL AMD has copied almost everything what Intel does while the opposite is not true in terms of ISA only exception being X86_64 Extension.
AMD got GPU line for AI, they have no need to bloat their CPU cores with AMX that almost nobody is using, for low latency inferencing AVX-512 can also be used, why apply massive silicon penalty to all customers when 0.1% will use AMX? Zero chance that is happening.
 

MS_AT

Senior member
Jul 15, 2024
599
1,253
96

MS_AT

Senior member
Jul 15, 2024
599
1,253
96
Yes but i think it is one of the benefits of open source. I hope intel had compensated him for this task.
Hopefully, after they have seen what he has done. Though I am a bit surprised that David Huang is providing support for ROCM kernels, I wonder if he is connected to AMD somehow, hmm. Nevertheless that becomes off-topic.

Back on topic:
Have you found by chance any comparison where Intel is not being compared to Intel for AMX inference? For example this guy produced very efficient kernels for AVX2/AVX512 https://github.com/ikawrakow/ik_llama.cpp. So it would be nice to see how it compares.
 
Reactions: igor_kavinski

OneEng2

Senior member
Sep 19, 2022
512
742
106
Probably not, but they could always resurrect triple channel topology from the Nehalem days.
... and compete with AMD on price? I think not. Intel is already in a big financial pinch. They can't afford to throw money at the problem any longer IMO.
Is it actually useful with 2- channel memory and big-small cores, is another discussion entirely.
True. My concern is more the profitability of such a configuration. More memory channels, a 52 core processor all to solve what problem?
That such CPUs would be useless (to the point of not being worth making) due to memory bandwidth constraints. I'm yet to see factual evidence to this.
Not useless. Just not applicable to the vast majority of the market and VERY costly to make.
These are not 52 P-cores. The config is 16 hungry P-cores, 32 less hungry E-cores, and 4 wimpy LP-cores.
Yea, but with E cores having IPC close to Zen 4, they will still need a good deal of bandwidth. No way around it. I'll give you the 4 wimpy cores argument though .
 

511

Golden Member
Jul 12, 2024
1,903
1,707
106
AMD got GPU line for AI, they have no need to bloat their CPU cores with AMX that almost nobody is using, for low latency inferencing AVX-512 can also be used, why apply massive silicon penalty to all customers when 0.1% will use AMX? Zero chance that is happening.
i would disagree investing in Hardware and software is the key part NVIDIA have CUDA moat if they thought like that they wouldn't have the moat and AMX is available on every WS/Server starting from 4th gen Sapphire Rapids. Even if you take AVX-512 for example initially when it launched it had issues and software was not ready now look at it you will find use cases for AVX-512 in almost all HPC software and few consumer software as well

Hopefully, after they have seen what he has done. Though I am a bit surprised that David Huang is providing support for ROCM kernels, I wonder if he is connected to AMD somehow, hmm. Nevertheless that becomes off-topic.

Back on topic:
Have you found by chance any comparison where Intel is not being compared to Intel for AMX inference? For example this guy produced very efficient kernels for AVX2/AVX512 https://github.com/ikawrakow/ik_llama.cpp. So it would be nice to see how it compares.
Nope not yet if i find anything will surely ping you.
 
Reactions: MS_AT

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,957
15,931
136
That is true but AMX is fully supported in llamma CPP and in Pytorch. Intel has provided support since launch of SPR that was nice of them unlike AMD who haven't bothered to provide RCom support for RDNA4.



LOL AMD has copied almost everything what Intel does while the opposite is not true in terms of ISA only exception being X86_64 Extension.
Except AMD does not down-clock on a full avx-512 implementation (Zen 5) like Intel did and does on AVX now.
 

fastandfurious6

Senior member
Jun 1, 2024
498
643
96
Intel is already in a big financial pinch. They can't afford to throw money at the problem any longer IMO.

yeah this is intel's #1 issue right now. they just sold off Altera with 50% loss of investment for quick $4b cash injection. they threw TONS of money the last years trying to aggressively keep up, had to go TSMC as well

Krzanich did immeasurable damage lmao. is there like a full list of his decisions?
 

MS_AT

Senior member
Jul 15, 2024
599
1,253
96
Except AMD does not down-clock on a full avx-512 implementation (Zen 5) like Intel did and does on AVX now.
It does if you run well optimized kernel that is not memory bound The problem with Intel was license based throttling. Recent Intel and AMD show dynamic load based throtting, what makes using AVX512 sparringly a viable option what was not possible with first Intel implementations.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,957
15,931
136
It does if you run well optimized kernel that is not memory bound The problem with Intel was license based throttling. Recent Intel and AMD show dynamic load based throtting, what makes using AVX512 sparringly a viable option what was not possible with first Intel implementations.
I disagree with the "sparringly a viable option", and primegrid uses it a lot, and I have multiple servers running it all day for days at a time, with no ill effects, including recently that I (not fully verified yet I will add before I am corrected) have discovered the 16th largest prime number in existence using it.
 

MS_AT

Senior member
Jul 15, 2024
599
1,253
96
I disagree with the "sparringly a viable option", and primegrid uses it a lot, and I have multiple servers running it all day for days at a time, with no ill effects, including recently that I (not fully verified yet I will add before I am corrected) have discovered the 16th largest prime number in existence using it.
You misunderstood my point. With Skylake-X unless you had heavy AVX512 compute kernels, it was not worth it to run AVX512 code at 512b widths that is why compilers were tuned to emit 256b instructions or refusing to use AVX512 at all. Problem was that using a couple 512b instructions on Skylake-X downclocked your CPU for thousands of cycles. Recent Intel and AMD CPUs don't have this problem, so you can use AVX512 even for small things so it no longer makes sense to avoid these instructions on purpose.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |