Discussion Future ARM Cortex + Neoverse µArchs Discussion

soresu · Apr 7, 2023

Thought the subject of ARM Ltd CPU core rumours were better discussed in their own thread, so here's what I recently posted in the Apple SoC thread to kick it off.....

Found this while trolling Google for hints on ARM's future Neoverse roadmap:

I'm assuming "built on Poseidon" means that V4/Aphrodite is based on the same µArchitecture design team lineage.

Here's the link if you can do better than Google Translate for the text.

Found another nugget of information from the same website about Cortex X5, which apparently is codenamed Logan (and not Chaberton-ELP) according to their information:

This seems to imply that unlike X1 -> X4 it will not just be the Chaberton/Cortex A730 µArch with more resources, but another CPU µArch designed from the ground up primarily for performance.

A true divergence between the mid and high end of ARM CPU design.

Here's the link for the website again.

dark zero · Apr 10, 2023

Seems that Cortex X5 might be the true high performance core ARM needs.
Meanwhile any info about the A7XX and the A5XX series?

That's because the A5XX are being less and less used.

soresu · Apr 11, 2023

dark zero said:
Meanwhile any info about the A7XX and the A5XX series?

Only what was released last year:

Hunter is likely to be called Cortex A720, and Hayes likely to be A520.

We know that there will be a Hunter-ELP core corresponding to Cortex X4.

Chaberton is the 2024 successor to Hunter, so likely named A725/A730.

Hayes will be present at least in the 2023 and 2024 IP announcements - but a 2 year core refresh at the little level for A510 -> Hayes is at least twice as fast as we have seen so far with the A53 -> A55 -> A510 cadence, so hopefully this means ARM Ltd is getting more serious about this segment.

dark zero said:
That's because the A5XX are being less and less used.

On the contrary, it's used everywhere for the lower end of smart devices like streamers.

This is a segment where barebones performance for mass market is concentrated - products like Amazon Fire TV Stick's and Chromecast's use them and they shift a lot of units.

ikjadoon · Apr 11, 2023

dark zero said:
Seems that Cortex X5 might be the true high performance core ARM needs.
Meanwhile any info about the A7XX and the A5XX series?

That's because the A5XX are being less and less used.

According to Arm,

A700: upgrades in 2023, 2024
A500: upgrades in 2023
DSU cluster: upgrades in 2023

NoC and other upgrades in 2023, too. This year seems like a total SoC-wide update.

From here: https://www.fudzilla.com/news/mobile/55064-arm-details-client-roadmap-until-2024

Rumors are high for Qualcomm’s X4 / CXC23, but I don’t read those tea leaves with any seriousness this far out from shipping products.

Arm is on a solid roll, so let’s hope they can deliver. AMD, Apple, and Intel have now all had a 2-year gap with basically 0% IPC increases.

EDIT: ninja’d by mere seconds haha

soresu · Apr 11, 2023

ikjadoon said:
Arm is on a solid roll, so let’s hope they can deliver. AMD, Apple, and Intel have now all had a 2-year gap with basically 0% IPC increases.

On the AMD side I hear that waiting on CXL was the main reason that Zen4 got delayed, albeit global supply chain problems and COVID likely didn't help all of them.

On Intel's side it's less a problem with IPC increases as the power draw increases that seem to have come with them - they aren't quite in Bulldozer territory because they are still performant, but their perf/watt is terrible at the moment, which is likely why they are not releasing much in the server department, despite the fact that this is almost certainly hitting them hard in the bottom line while AMD made bank on Milan and now Genoa EPYC SKUs.

I agree though that it does look like ARM are executing well - that being said the latest big A and X cores from 2022 did not bring such a great perf/watt with their IPC increases (what little A715 had), so I hope the forthcoming µArchs remedy that.

soresu · Apr 11, 2023

ikjadoon said:

Also is it just me or is the white text on light blue background almost illegible?

Lol ARM need to fire their PR firm because I needed to basically invert that diagram to make it readable for my poor eyesight 😭

ikjadoon · May 30, 2023

Here we go:

Arm Unveils 2023 Mobile CPU Core Designs: Cortex-X4, A720, and A520 - the Armv9.2 Family

www.anandtech.com

soresu · May 30, 2023

ikjadoon said:
Here we go:

Arm Unveils 2023 Mobile CPU Core Designs: Cortex-X4, A720, and A520 - the Armv9.2 Family

www.anandtech.com

I can't say I'm particularly impressed by all of this.

X4 looks like a solid single gen boost, but A720 and A520 are just perf/watt focused cores along with some optimisation of security features relating to protection from bad code execution.

Also none of the PR mentions Sophia Antipolis design team involvement at all.

Perhaps they are just obscuring that information now after the whole mess over ARM cores being developed in Austin/Texas being under the remit of "US technology" and not suitable for export to ARM China with the current tech trade restrictions.

Geddagod · May 30, 2023

I'm curious, even the 'big' ARM cores removed the uOP cache with this generation of cores right? Is it easier for ARM to add decoders and remove the uOP cache while for x86, since adding decoders is harder, it's better for them to increase the uOP cache instead of adding more decoders?

soresu · Aug 30, 2023

Geddagod said:
I'm curious, even the 'big' ARM cores removed the uOP cache with this generation of cores right? Is it easier for ARM to add decoders and remove the uOP cache while for x86, since adding decoders is harder, it's better for them to increase the uOP cache instead of adding more decoders?

I don't know about 'easier' per se, but ARM Ltd's CPU core design philosophy does seem to be extremely modular by comparison to that of other µArch designers - in no small part to allow benefits from one design team to be quickly added to a project that another is working on.

JoeRambo · Aug 31, 2023

Geddagod said:
Is it easier for ARM to add decoders and remove the uOP cache while for x86, since adding decoders is harder, it's better for them to increase the uOP cache instead of adding more decoders?

Intel is up to 6 decoders now and 32 bytes per cycle from L1 Inst for big cores, and marketing cores are 6 decoders and 16 bytes per cycle. Byte "reading" includes "predecode and finding instruction boundaries".
The current limit seems to be 6 uOPs from decoders, clearly below what proper 6 decoders can generate and below what 6 "complex + simple" decoders can generate as well. But this "limit" is plenty of width before even considering uOP cache, since rename/allocate is 6 wide as well.

They are ready for future decode expansion now and not really touching anything in front end in GNR besides improving caching by moving to 64KB setup ( and that also means less spillage to L2 that is carrying whole company on it's back ).

Tup3x · Aug 31, 2023

Doesn't look too bad for Neoverse V2.

NVIDIA Grace CPU Offers Up To 2X Performance Versus AMD Genoa & Intel Sapphire Rapids x86 Chips At Same Power

NVIDIA has unveiled new benchmarks of its upcoming Arm-based Grace GPU which will power next generation data centers and servers.

wccftech.com

soresu · Aug 31, 2023

Tup3x said:
Doesn't look too bad for Neoverse V2.

NVIDIA Grace CPU Offers Up To 2X Performance Versus AMD Genoa & Intel Sapphire Rapids x86 Chips At Same Power

NVIDIA has unveiled new benchmarks of its upcoming Arm-based Grace GPU which will power next generation data centers and servers.

wccftech.com

Those slides come from a full article on the ARM company website.

This is the core layout diagram from it:

Kryohi · Aug 31, 2023

Tup3x said:
Doesn't look too bad for Neoverse V2.

NVIDIA Grace CPU Offers Up To 2X Performance Versus AMD Genoa & Intel Sapphire Rapids x86 Chips At Same Power

NVIDIA has unveiled new benchmarks of its upcoming Arm-based Grace GPU which will power next generation data centers and servers.

wccftech.com

Not too bad at all, but in the end basically the same performance as Genoa it seems. Bergamo would beat that and likely approach or surpass it in the efficiency metrics it seems.

ikjadoon · Sep 1, 2023

Kryohi said:
Not too bad at all, but in the end basically the same performance as Genoa it seems. Bergamo would beat that and likely approach or surpass it in the efficiency metrics it seems.

I'm unsure on the efficiency bit.

AMD Genoa (2S x 9654) = 192 Zen4 cores @ 720W to 800W (3.9W / core)
NVIDIA Grace (2S x Grace) = 144 V2 cores @ 500W minus 960GB LPDDR5X (<3.4W / core)
AMD Bergamo (1S x 9754) = 128 Zen4c cores @ 320W to 400W (2.8W / core)
AMD Bergamo (2S x 9754) = 256 Zen4c cores @ 720W to 800W (2.8W / core)

Per-core, V2 is ~33% faster than Zen4—so claims NVIDIA, as we have no independent benchmarks.

On power: V2 should be less than Zen4 (Genoa), but close to or somewhat more than Zen4c (Bergamo).

Or have I missed something here?

Markfw · Sep 1, 2023

ikjadoon said:
I'm unsure on the efficiency bit.

AMD Genoa (2S x 9654) = 192 Zen4 cores @ 720W to 800W (3.9W / core)
NVIDIA Grace (2S x Grace) = 144 V2 cores @ 500W minus 960GB LPDDR5X (<3.4W / core)
AMD Bergamo (1S x 9754) = 128 Zen4c cores @ 320W to 400W (2.8W / core)
AMD Bergamo (2S x 9754) = 256 Zen4c cores @ 720W to 800W (2.8W / core)

Per-core, V2 is ~33% faster than Zen4—so claims NVIDIA, as we have no independent benchmarks.

On power: V2 should be less than Zen4 (Genoa), but close to or somewhat more than Zen4c (Bergamo).

Or have I missed something here?

9554 320 watt 2S = 128 cores, but they are a LOT faster. Mine run 2.7 ghz on my 9654 and 3.5 ghz on the 9554, same load. could make a big difference. This is why we need independent benchmarking. Both the watts and the performance are different. And how can they exclude wattage for memory, when the Genoa I think includes it ?

ikjadoon · Sep 1, 2023

Markfw said:
9554 320 watt 2S = 128 cores, but they are a LOT faster. Mine run 2.7 ghz on my 9654 and 3.5 ghz on the 9554, same load. could make a big difference. This is why we need independent benchmarking. Both the watts and the performance are different. And how can they exclude wattage for memory, when the Genoa I think includes it ?

Agreed. I'd not put a single corp above cherry-picking to hell in its marketing. It's telling NVIDIA didn't provide actual SPEC #s & actual power draw.

Genoa TDP is CPU-only, right, as customers can add as much / little RAM as they want?

NVIDIA's power # includes its soldered LPDDR5, so the RAM power is known from the get go (again, NVIDIA does not tease it out).

At 500W, with memory included in that figure, it is fairly power efficient. That AMD EPYC 9654 has a 360W TDP but also has 12 memory channels which can use another 60W+.

From STH's notes, I can imagine many ways NVIDIA is playing with its efficiency / capacity-in-power-constrained-environments:

Technically, Grace doesn't require DIMMs, which add height. So is NVIDIA simply stacking more units?
NVIDIA's fabric & interconnects are claimed to be quite efficient, and LPDDR5 is definitely more efficient. So NVIDIA likely including all that for its "5 MW" numbers.

For NVIDIA's end-customers, the full stack efficiency is relevant, as IO can up so much power on AI workloads.

Just not helpful to us to for the CPU perf & CPU power comparison 🤣

//

Again, first-party data: Arm has claimed one Neoverse V2 core (alone) eats 1.4W @ 2.8 GHz (everything else basically unknown, so yeah). This is 2MB L2$, meanwhile Grace is only 1MB L2$.

Markfw · Sep 1, 2023

ikjadoon said:
Agreed. I'd not put a single corp above cherry-picking to hell in its marketing. It's telling NVIDIA didn't provide actual SPEC #s & actual power draw.

Genoa TDP is CPU-only, right, as customers can add as much / little RAM as they want?

NVIDIA's power # includes its soldered LPDDR5, so the RAM power is known from the get go (again, NVIDIA does not tease it out).

From STH's notes, I can imagine many ways NVIDIA is playing with its efficiency / capacity-in-power-constrained-environments:

Technically, Grace doesn't require DIMMs, which add height. So is NVIDIA simply stacking more units?

NVIDIA's fabric & interconnects are claimed to be quite efficient, and LPDDR5 is definitely more efficient. So NVIDIA likely including all that for its "5 MW" numbers.

For NVIDIA's end-customers, the full stack efficiency is relevant, as IO can up so much power on AI workloads.

Just not helpful to us to for the CPU perf & CPU power comparison 🤣

32o watt is for default for 9554 and 9654, but both can be set to 400 watt and the performance is quite a bit more. And yes, wattage of Genoa is without ram, as its 12 channel, and many servers allow as much as 24 dimms per CPU (Mine only have 12 dimm slots). I will wait until a non-nvidia source benchmarks the 2, like Phoronix, before I declare one a winner.

ikjadoon · Sep 5, 2023

Not super relevant, but interesting disclosure about Arm's IPO's "cornerstone investors" was released today (in alphabetical order)

AMD
Apple
Cadence
Google
Intel
MediaTek
NVIDIA
Samsung
Synopsys
TSMC

Conspicuously, no Microsoft nor Amazon. AMD is the most curious one: beyond RDNA2 in Samsung's smartphone SoCs, what business does AMD have with Arm beyond some Xilinx stuff?

Nothingness · Sep 6, 2023

ikjadoon said:
AMD is the most curious one: beyond RDNA2 in Samsung's smartphone SoCs, what business does AMD have with Arm beyond some Xilinx stuff?

I was surprised too, but then I remembered this: https://www.anandtech.com/show/6007...cortexa5-processor-for-trustzone-capabilities
I don't know if they still use Arm CPUs for security in their processors.

And then there's this: https://www.howtogeek.com/848691/amd-made-an-arm-chip-for-space-satellites/

And there was the short lived AMD Opteron A1100: https://www.amd.com/system/files/documents/hierofalcon-product-brief.pdf

Is that enough to justify such an interest in Arm IPO, can't say. But AMD definitely uses Arm in many places (as everyone).

soresu · Sep 6, 2023

Nothingness said:
I don't know if they still use Arm CPUs for security in their processors.

I'd be surprised if they didn't dump it for RISC-V eventually to keep the licensing simpler.

NTMBK · Sep 6, 2023

ikjadoon said:
AMD is the most curious one: beyond RDNA2 in Samsung's smartphone SoCs, what business does AMD have with Arm beyond some Xilinx stuff?

They still have an architectural license for ARM as far as I'm aware. It makes sense that they'd want to keep ARM independent though. In terms of "best for AMD", I'd rank the outcomes like this:

AMD remains part of the x86 duopoly, and ARM servers don't really penetrate the mainstream outside of hyperscalers
AMD competes as an ARM server CPU provider in an open and vibrant ARM market
AMD has to compete in a market dominated by a hostile ARM controlled by e.g. Nvidia

This investment is a hedge so that if (1) doesn't work out, they get outcome (2) instead of (3).

poke01 · Sep 6, 2023

soresu said:
I'd be surprised if they didn't dump it for RISC-V eventually to keep the licensing simpler.

RISC-V is not a magic bullet that solves everything.

soresu · Sep 6, 2023

poke01 said:
RISC-V is not a magic bullet that solves everything.

Didn't say so - my point was only that it would be one less company to license IP from.

moinmoin · Sep 6, 2023

ikjadoon said:
AMD is the most curious one: beyond RDNA2 in Samsung's smartphone SoCs, what business does AMD have with Arm beyond some Xilinx stuff?

While an older gen a low power ARM core is in every single Zen chip.

The PSP itself represents an ARM core (ARM Cortex A5) with the TrustZone extension which is inserted into the main CPU die as a coprocessor. The PSP contains on-chip firmware which is responsible for verifying the SPI ROM and loading off-chip firmware from it.

AMD Platform Security Processor - Wikipedia

en.wikipedia.org

Discussion Future ARM Cortex + Neoverse µArchs Discussion

Diamond Member

Platinum Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Member

Senior member

Moderator Emeritus, Elite Member

Senior member

Moderator Emeritus, Elite Member

Senior member

Diamond Member

Diamond Member

Lifer

Platinum Member

Diamond Member

Diamond Member