Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Tigerick · Aug 22, 2022

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Intel Core Ultra 100 - Meteor Lake

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

MS_AT · Apr 24, 2025

Schmide said:
But asking how big a register is, isn't exactly cut and dry these days anyways.

This I think here is where we disagree. I mean while physical entry size in register file is up to implementation, SSE architectural register size is defined to be 128b (the xmm register), AVX512 with VL extension supports xmm, ymm (256b) and zmm (512b) registers each with well defined bit width. How this is implemented in HW is another matter

Schmide said:
They just allocate another virtual register from the physical file making sure writes to memory are made in order.

Compiler cannot allocate more than there are architectural registers. It will spill to stack as soon as it thinks it already used all architectural registers. Renaming is opaque to the compiler and is used by OoO engine to solve other problems like write after write or write after read. And its OoO engine that makes sure the results are observable in program order, not the compiler.

Win2012R2 · Apr 24, 2025

Schmide said:
How big is a SSE register.

128 bits as per Intel who invented it and documented a few aeons ago

What's your number?

Schmide said:
Am I right?

Wrong because you are still confusing hardware implementation with ISA spec - sure there are lots of registers now and obviously they have to be contiguous, so 512 bit register can be used for 256 bit purposes, so what - does it mean SSE is suddenly not 128 bit but 512 bit because in modern CPU hardware decided to use one of 512 bit registers?

No - it's still 128 bit, as per bloody spec.

Schmide said:
On Zen5 it can be 512b with a mask set to 128b. On Zen5 mobile it can be both that and a 256b

No, AVX-512 registers are 512 bits wide - on Zen 4 or Zen 5 or Zen 5C and on Zen 6 they will also be 512 bits wide. In fact even in Zen 1000 they will have to be 512 bits wide, because that's how they are specified!

Schmide · Apr 24, 2025

Schmide said:
If you look at the opcodes spit out by compilers they continuously reuse the same register references over and over rarely stacking more than a couple depths. Does this reuse block execution? No. They just allocate another virtual register from the physical file making sure writes to memory are made in order.

MS_AT said:
Compiler cannot allocate more than there are architectural registers. It will spill to stack as soon as it thinks it already used all architectural registers. Renaming is opaque to the compiler and is used by OoO engine to solve other problems like write after write or write after read. And its OoO engine that makes sure the results are observable in program order, not the compiler.

You took this this out of context. I said the compiler is reusing the same registers over and over.

example code

Code:

    int xvals[8] = {0,1,2,3,4,5,6,7};
    int end = sizeof(xvals) / sizeof(int);
    int i = 0;
    do {
        xvals[i]++;
    } while (++i < end);

the loop minus the array setup

Code:

.L2:
        mov     eax, DWORD PTR [rbp-4]
        cdqe
        mov     eax, DWORD PTR [rbp-48+rax*4]
        lea     edx, [rax+1]
        mov     eax, DWORD PTR [rbp-4]
        cdqe
        mov     DWORD PTR [rbp-48+rax*4], edx
        add     DWORD PTR [rbp-4], 1
        mov     eax, DWORD PTR [rbp-4]
        cmp     eax, DWORD PTR [rbp-8]
        setl    al
        test    al, al
        jne     .L2

The same 4 registers are used over and over. It is not going to spill to the stack because they are renamed by the out of order engine. My point was compilers generally don't use more than a few named registers.

MS_AT · Apr 24, 2025

Schmide said:
The same 4 registers are used over and over. It is not going to spill to the stack because they are renamed by the out of order engine

Nope, they are not spilled over to the stack because compiler is storing results on the stack because array is kept there So in other words compiler is loading value from the stack, incrementing it by one and storing it to stack. There is no reason it would need to use more registers than EAX/RAX since x64 allows memory operands. OoO engine will rename whatever false dependencies it will spot in this code but the reason you are seeing EAX/RAX only is because your code does not need more active registers. In other words compiler does not have to keep the value "alive" in registers for the whole duration of your program.

Out of Order engine is a hardware property. Compiler does not know on what hardware your code will run [most of the time] so it will not assume you have X size of register file to know how well it can do register renaming. The whole magic of OoO is that it is opaque to the compiler and the programmer.

Schmide · Apr 24, 2025

MS_AT said:
Nope, they are not spilled over to the stack because compiler is storing results on the stack because array is kept there So in other words compiler is loading value from the stack, incrementing it by one and storing it to stack. There is no reason it would need to use more registers than EAX/RAX since x64 allows memory operands. OoO engine will rename whatever false dependencies it will spot in this code but the reason you are seeing EAX/RAX only is because your code does not need more active registers. In other words compiler does not have to keep the value "alive" in registers for the whole duration of your program.

Out of Order engine is a hardware property. Compiler does not know on what hardware your code will run [most of the time] so it will not assume you have X size of register file to know how well it can do register renaming. The whole magic of OoO is that it is opaque to the compiler and the programmer.

The OoO may be opaque to the compiler. Compilers still produce code in ways that help the OoO engine take advantage of register renaming.

I've looked at a lot of code spit out by the compilers. Rarely do you see more than 3 or 4 registers. I don't think I've ever seen a numbered register.

As for referenced values. That's the whole x86 thing. RISC requires you to load everything then operate on it. Part of what I was saying previously was labeled memory is more like named registers.

eek2121 · Apr 24, 2025

dullard said:
I personally cannot fathom a single scenario where Intel sells 100 million of the 52-core variant of Nova Lake. AMD + Intel combined often sell 70 million to 80 million desktops / workstations CPU per year. Intel's portion would be smaller and the portion of just one rumored top-end desktop CPU would be even smaller than that. Divide your number by 10 at least (probably more) to get a much more realistic value. Plus, again, Intel isn't the one buying the memory AND the new memory premium fades quickly after a few months.

There never was a rumor for only a 52-core variant. Here are rumors for 16 core and 28 core versions.

https://videocardz.com/newz/intel-nova-lake-preliminary-desktop-specs-list-52-cores-16p32e4lp-configuration

Here is another rumor from our own source that adds 24 core, 12 core, and 4 core variants.

https://videocardz.com/newz/intel-nova-lake-s-for-desktops-rumored-to-feature-2x8p16e-configuration

Plus, I can't think of a recent time where Intel sold only one desktop variant without lower end models with fewer cores. Take Arrow Lake for example. There is Ultra 9 with 8P + 16E, Ultra 7 with 8P + 12E, and Ultra 5 that is either 6P + 8E or 6P + 4E. The rumored 52-core, if it exists, is only for the very top SK processor (or maybe X if Intel brings that back).

Maybe I missed it, however I don’t think anyone was ever claiming there would only be a 52 score variant.

FWIW I have a hard time believing such a part exists at all. N2 and 18A cost significantly more than previous processes, mastering PPA is a must for both Intel and AMD. The 285k is on N3, and Intel’s core sizes will eat up any improvements that N2/18A provide.

If anything, they will probably have a 12/24 + 4 part at the top end, with 2 6 + 12 core CCDs. I have seen Intel do crazier things, however.

Just recently, they announced another round of layoffs, so I definitely wouldn’t hold my breath.

Win2012R2 · Apr 25, 2025

Abwx said:
That was all theorical because the data path was still 64b, so only half of the SSE throughput was actualy possible

The size of SSE registers, however, was, is and will always be - 128 bits, even if there is no single data type that is 128 bits, which obviously is the case because it's called SIMD for a reason.

Register renaming, compilers, spilling stack, not full data path, dinosaurs roaming the Earth, all that are entirely different matters.

Abwx · Apr 25, 2025

Win2012R2 said:
The size of SSE registers, however, was, is and will always be - 128 bits, even if there is no single data type that is 128 bits, which obviously is the case because it's called SIMD for a reason.

Register renaming, compilers, spilling stack, not full data path, dinosaurs roaming the Earth, all that are entirely different matters.

That was 128b on paper and 64b in real world, the same way as a tank with 2x the volume but with an unchanged output pipe diameter and flow speed.

Win2012R2 · Apr 25, 2025

Abwx said:
That was 128b on paper and 64b in real world, the same way as a tank with 2x the volume but with an unchanged output pipe diameter and flow speed.

So then are AVX-512 registers in Zen 4 also 512 bit only on paper because execution is double pumped and data path can't load whole register in one go, yes or no?

Abwx · Apr 25, 2025

Win2012R2 said:
So then are AVX-512 registers in Zen 4 also 512 bit only on paper because execution is double pumped and data path can't load whole register in one go, yes or no?

Not the same thing as the pentium 3 also lacked the necessary exe ressources, wich is not the case of Zen 4.

I put it again since you have trouble grasping all the info :

To compensate partially for implementing only half of SSE's architectural width, Katmai implements the SIMD-FP adder as a separate unit on the second dispatch port. This organization allows one half of a SIMD multiply and one half of an independent SIMD add to be issued together bringing the peak throughput back to four floating point operations per cycle — at least for code with an even distribution of multiplies and adds.The issue was that Katmai's hardware-implementation contradicted the parallelism model implied by the SSE instruction-set. Programmers faced a code-scheduling dilemma: "Should the SSE-code be tuned for Katmai's limited execution resources, or should it be tuned for a future processor with more resources?"

Page 775 - Discussion - Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 775 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Win2012R2 · Apr 25, 2025

Abwx said:
Not the same thing as the pentium 3 also lacked the necessary exe ressources, wich is not the case of Zen 4.

Zen 4 totally lacks necessary resources for AVX 512 which is why it's "double pumped", you really have double standards here.

Even Zen 5 can't load 64 bytes from L3 in one go (only L1 and L2), does it make Zen 5's registers half size? No!

Anyway, SSE is 128 bit, as per Intel spec, as per physical implementation, actual execution how fast or slow it is, whether it's microcode even does not matter - spec for size is spec for size, end of.

Abwx · Apr 25, 2025

Win2012R2 said:
Zen 4 totally lacks necessary resources for AVX 512 which is why it's "double pumped", you really have double standards here.

Anyway, SSE is 128 bit, as per Intel spec, as per physical implementation, actual execution how fast or slow it is, whether it's microcode even does not matter - spec for size is spec for size, end of.

A baby need to be fed with a spoon.
Zen 4 has not such a limitation, read better :

This organization allows one half of a SIMD multiply and one half of an independent SIMD add to be issued together bringing the peak throughput back to four floating point operations per cycle — at least for code with an even distribution of multiplies and adds.

Win2012R2 · Apr 25, 2025

Abwx said:
Zen 4 has not such a limitation, read better :

If Zen 4 had no limitations using "double pumping", then what limitation did Zen 5 solve?

Abwx · Apr 25, 2025

Win2012R2 said:
If Zen 4 had no limitations using "double pumping", then what limitation did Zen 5 solve?

You are deflecting the subject.

Win2012R2 · Apr 25, 2025

Abwx said:
You are deflecting the subject.

I am not deflecting anything, it is you who is bringing all sort of historic limitations in execution which got nothing to do with register size - all CPUs got limitations of different sorts, many of which get relaxed later, that does not affect register size as it is defined in uarch.

Anyway, I am done here, you clearly operating on a different plane than me on this one.

P.S. Why won't you bring 64-bit memory addressing, which in reality is not 64-bit because CPUs use less real bits for memory addressing, yet the registers used for pointers are 64-bit... or are they now? Rhetorical question.

OneEng2 · Apr 25, 2025

Fjodor2001 said:
Not according to this a few pages back in this thread.

Current CB24 scores appear to be quite bandwidth limited. Certainly even from that post, it is clear that bandwidth has a significant effect on performance across a wide range of benchmarks.

Assuming that Intel will shell out the dough for state-of-the-art new memory for high volume products is a fallacy IMO.

Assuming that average desktop and laptop users utilize higher core counts than today's processors provide is another fallacy IMO.

Finally, assuming that HPC and DC workloads where higher core counts are justified by the applications will not be bandwidth starved with only 2 channels is equally hard to imagine.

A 52 core Nova Lake (lets just call it a 48 core since I doubt those LP cores are worth the die space anyway) with higher IPC P and E cores like everyone is expecting will crave even more bandwidth than the current Arrow Lake per core.

I see neither the market for a 52 core desktop/laptop processor nor the technical merit of pairing such a beast with only 2 channels of DDR5.

As this is only my opinion, I suspect time will tell.

Thunder 57 · Apr 25, 2025

Why are we talking about SIMD in an Intel thread???

Schmide · Apr 25, 2025

Thunder 57 said:
Why are we talking about SIMD in an Intel thread???

Because 3dnow registers used the MMX area and SSE was had its own.

dullard · Apr 25, 2025

eek2121 said:
Maybe I missed it, however I don’t think anyone was ever claiming there would only be a 52 score variant.
FWIW I have a hard time believing such a part exists at all. N2 and 18A cost significantly more than previous processes, mastering PPA is a must for both Intel and AMD. The 285k is on N3, and Intel’s core sizes will eat up any improvements that N2/18A provide.

If anything, they will probably have a 12/24 + 4 part at the top end, with 2 6 + 12 core CCDs.

That was implied by the "multiply by 100 million" statement which was in reference to the rumored 52 core variant. The need for DDR6 is only for the 52 core variant. There is absolutely no need for the rest of the Nova Lake chips to have DDR6. They may have it, but don't need it. I can certainly see a desktop CPU with one tile of 26 or fewer cores on DDR5 and a dual tile workstation CPU with 52 cores and DDR6 (remember rumors state that Nova Lake separates the memory tile from the CPU tile). Will it happen? I have no idea. But it could.

Take the expense and time to design just one 1 tile with 26 cores for the desktop crowd. Mass produce it for yield and cost savings. Put two of those tiles together for the workstation crowd with a different memory controller. Apple put two M1 tiles together with the M1 Ultra. AMD did it with Threadripper. Heck, Intel did it (poorly) back with the Pentium D which started the whole glue it together meme. Plus, this is exactly what the rumors state: 2x26 cores.

Win2012R2 · Apr 25, 2025

Thunder 57 said:
Why are we talking about SIMD in an Intel thread???

Because SIMD got an I inside...

OneEng2 · Apr 25, 2025

dullard said:
That was implied by the "multiply by 100 million" statement which was in reference to the rumored 52 core variant. The need for DDR6 is only for the 52 core variant. There is absolutely no need for the rest of the Nova Lake chips to have DDR6. They may have it, but don't need it. I can certainly see a desktop CPU with one tile of 26 or fewer cores on DDR5 and a dual tile workstation CPU with 52 cores and DDR6 (remember rumors state that Nova Lake separates the memory tile from the CPU tile). Will it happen? I have no idea. But it could.

Take the expense and time to design just one 1 tile with 26 cores for the desktop crowd. Mass produce it for yield and cost savings. Put two of those tiles together for the workstation crowd with a different memory controller. Apple put two M1 tiles together with the M1 Ultra. AMD did it with Threadripper. Heck, Intel did it (poorly) back with the Pentium D which started the whole glue it together meme. Plus, this is exactly what the rumors state: 2x26 cores.

Seeing how they put 24 cores on N3B @ 114mm2, seems logical that 26 cores would be doable on 18A (or N2). They should be able to keep the die down to around 100mm2 or somewhere near that I would think as well. Doubling this and expecting the RAM to keep up with just 2 channels? That seems a bit optimistic IMO.

I believe I saw a rumor that AMD's 12c CCD would be around 75mm2 on N2. This next round is going to be interesting for sure.

Currently the 16c/32t Zen 5 generally bests Arrow Lake 24c/24t by 5% in multi-threaded loads (pretty close though). So essentially, 1 Zen 5 core = 1.5 Arrow Lake cores overall in MT. So if everything stays in the same proportion next generation, 24 Zen 6 cores would be equivalent to 36 Nova Lake cores in MT.

I believe we are entering an interesting time with the next generation of processors though. What percentage of the consumer market needs more than a 26 core Nova Lake (or 12c/24t Zen 6)? Sure, you can make one with a dual CCD and a decent memory controller (and potentially faster memory on dual channel), but how many consumers need it?

For the HPC/Workstation work, wouldn't something like Threadripper be better? Much more memory bandwidth and way more threads?

Perhaps this is what the 52 core Nova Lake will be targeting?

511 · Apr 26, 2025

Nothingness said:
I'm really surprised people fight over SSE vs 3dnow, some depending on their preference for Intel or AMD. What I will always remember is 64-bit x86, which was designed by AMD; this had a much higher impact on x86 than any SIMD extension.

Also register banks exist. They can even be spotted on floor plans. They are obviously much larger than what the ISA requires due to renaming.

Intel had it designed as well they were super focused on Itanium they didn't wanted AMD to have thier money making x86 license.

How was AMD able to beat Intel in delivering the first x86-64 instruction set? Was Intel too distracted by the Itanium project? If so, wh...

Answer (1 of 7): Itanium was to be Intel’s flagship for the x86–64 market. They did feel confident their marketshare would simply dominate the competition and establish Itanium as the de facto cpu in the x86–64 market as they had in the x86 market. The Intel problems however were many; * A powe...

www.quora.com

511 · Apr 26, 2025

Schmide said:
You took this this out of context. I said the compiler is reusing the same registers over and over.

example code

Code:

int xvals[8] = {0,1,2,3,4,5,6,7}; int end = sizeof(xvals) / sizeof(int); int i = 0; do { xvals[i]++; } while (++i < end);

the loop minus the array setup

Code:

.L2: mov eax, DWORD PTR [rbp-4] cdqe mov eax, DWORD PTR [rbp-48+rax*4] lea edx, [rax+1] mov eax, DWORD PTR [rbp-4] cdqe mov DWORD PTR [rbp-48+rax*4], edx add DWORD PTR [rbp-4], 1 mov eax, DWORD PTR [rbp-4] cmp eax, DWORD PTR [rbp-8] setl al test al, al jne .L2

The same 4 registers are used over and over. It is not going to spill to the stack because they are renamed by the out of order engine. My point was compilers generally don't use more than a few named registers.

there are only 16 GPR Registers available in x86_64 to begin with

OneEng2 said:
Seeing how they put 24 cores on N3B @ 114mm2, seems logical that 26 cores would be doable on 18A (or N2). They should be able to keep the die down to around 100mm2 or somewhere near that I would think as well. Doubling this and expecting the RAM to keep up with just 2 channels? That seems a bit optimistic IMO.

I believe I saw a rumor that AMD's 12c CCD would be around 75mm2 on N2. This next round is going to be interesting for sure.

where did it leak i had guessed around 80mm2 i was close :rofl:

OneEng2 said:
Currently the 16c/32t Zen 5 generally bests Arrow Lake 24c/24t by 5% in multi-threaded loads (pretty close though). So essentially, 1 Zen 5 core = 1.5 Arrow Lake cores overall in MT. So if everything stays in the same proportion next generation, 24 Zen 6 cores would be equivalent to 36 Nova Lake cores in MT.

View attachment 122734

It has Y cruncher in the mix with AVX-512 when NVL will get AVX-512 there would be a change in it if in handbrake they are using SVT-AV1 it also supports AVX-512 .

OneEng2 said:
I believe we are entering an interesting time with the next generation of processors though. What percentage of the consumer market needs more than a 26 core Nova Lake (or 12c/24t Zen 6)? Sure, you can make one with a dual CCD and a decent memory controller (and potentially faster memory on dual channel), but how many consumers need it?

For this the Tiles that were leaked were 8+16/4+8/4+0 and the SOC tile contains 4LPE Cores no matter the config i think a 4+4(4 cores disabled)+4 I3 would be ridiculous.
THe SKU can be but not limited to

8+16+4
2*(8+16)+4
2*(4+8)+4
4+8+4
4+0+4

the 8+16 Tile is N2 and the 4+8+4/4+0 tile is 18AP and the common SoC tile is shared across all the SKUs.

OneEng2 said:
For the HPC/Workstation work, wouldn't something like Threadripper be better? Much more memory bandwidth and way more threads?

Perhaps this is what the 52 core Nova Lake will be targeting?

Probable around $1000 this should be a great buy for people looking at ST/MT performance where you don't need a ton of PCI-E and for RAM i think 64 GB DIMM should be more available by than so 256 GB should be enough for most enthusiasts.

Thunder 57 · Apr 26, 2025

511 said:
Intel had it designed as well they were super focused on Itanium they didn't wanted AMD to have thier money making x86 license.

View attachment 122737

How was AMD able to beat Intel in delivering the first x86-64 instruction set? Was Intel too distracted by the Itanium project? If so, wh...

Answer (1 of 7): Itanium was to be Intel’s flagship for the x86–64 market. They did feel confident their marketshare would simply dominate the competition and establish Itanium as the de facto cpu in the x86–64 market as they had in the x86 market. The Intel problems however were many; * A powe...

www.quora.com

100% fake. There was absolutely no 64 bit in the P4 until it needed it. I cannot believe I read such stupidity. His 64 bit just happened to be the same as AMD's? What a fool. Totally worthless fake chump. Have I made my opinion clear?

511 · Apr 26, 2025

Thunder 57 said:
100% fake. There was absolutely no 64 bit in the P4 until it needed it. I cannot believe I read such stupidity. His 64 bit just happened to be the same as AMD's? What a fool. Totally worthless fake chump. Have I mad my opinion clear?

Ok so you are saying one of the Chief architects of x86 of his time is lying that's pretty bold of you

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Senior member

Attachments

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Lifer

Senior member

Lifer

Senior member

Lifer

Senior member

Lifer

Senior member

Senior member

Diamond Member

Diamond Member

Elite Member

Senior member

Senior member

Platinum Member

Platinum Member

Diamond Member

Platinum Member