Discussion Future ARM Cortex + Neoverse µArchs Discussion

soresu

Diamond Member
Dec 19, 2014
3,273
2,549
136
Thought the subject of ARM Ltd CPU core rumours were better discussed in their own thread, so here's what I recently posted in the Apple SoC thread to kick it off.....

Found this while trolling Google for hints on ARM's future Neoverse roadmap:


I'm assuming "built on Poseidon" means that V4/Aphrodite is based on the same µArchitecture design team lineage.

Here's the link if you can do better than Google Translate for the text.



Found another nugget of information from the same website about Cortex X5, which apparently is codenamed Logan (and not Chaberton-ELP) according to their information:

This seems to imply that unlike X1 -> X4 it will not just be the Chaberton/Cortex A730 µArch with more resources, but another CPU µArch designed from the ground up primarily for performance.

A true divergence between the mid and high end of ARM CPU design.

Here's the link for the website again.
 

dark zero

Platinum Member
Jun 2, 2015
2,655
140
106
Seems that Cortex X5 might be the true high performance core ARM needs.
Meanwhile any info about the A7XX and the A5XX series?

That's because the A5XX are being less and less used.
 

soresu

Diamond Member
Dec 19, 2014
3,273
2,549
136
Meanwhile any info about the A7XX and the A5XX series?
Only what was released last year:


Hunter is likely to be called Cortex A720, and Hayes likely to be A520.

We know that there will be a Hunter-ELP core corresponding to Cortex X4.

Chaberton is the 2024 successor to Hunter, so likely named A725/A730.

Hayes will be present at least in the 2023 and 2024 IP announcements - but a 2 year core refresh at the little level for A510 -> Hayes is at least twice as fast as we have seen so far with the A53 -> A55 -> A510 cadence, so hopefully this means ARM Ltd is getting more serious about this segment.
That's because the A5XX are being less and less used.
On the contrary, it's used everywhere for the lower end of smart devices like streamers.

This is a segment where barebones performance for mass market is concentrated - products like Amazon Fire TV Stick's and Chromecast's use them and they shift a lot of units.
 

ikjadoon

Senior member
Sep 4, 2006
235
513
146
Seems that Cortex X5 might be the true high performance core ARM needs.
Meanwhile any info about the A7XX and the A5XX series?

That's because the A5XX are being less and less used.

According to Arm,

A700: upgrades in 2023, 2024
A500: upgrades in 2023
DSU cluster: upgrades in 2023

NoC and other upgrades in 2023, too. This year seems like a total SoC-wide update.



From here: https://www.fudzilla.com/news/mobile/55064-arm-details-client-roadmap-until-2024

Rumors are high for Qualcomm’s X4 / CXC23, but I don’t read those tea leaves with any seriousness this far out from shipping products.

Arm is on a solid roll, so let’s hope they can deliver. AMD, Apple, and Intel have now all had a 2-year gap with basically 0% IPC increases.

EDIT: ninja’d by mere seconds haha
 

soresu

Diamond Member
Dec 19, 2014
3,273
2,549
136
Arm is on a solid roll, so let’s hope they can deliver. AMD, Apple, and Intel have now all had a 2-year gap with basically 0% IPC increases.
On the AMD side I hear that waiting on CXL was the main reason that Zen4 got delayed, albeit global supply chain problems and COVID likely didn't help all of them.

On Intel's side it's less a problem with IPC increases as the power draw increases that seem to have come with them - they aren't quite in Bulldozer territory because they are still performant, but their perf/watt is terrible at the moment, which is likely why they are not releasing much in the server department, despite the fact that this is almost certainly hitting them hard in the bottom line while AMD made bank on Milan and now Genoa EPYC SKUs.

I agree though that it does look like ARM are executing well - that being said the latest big A and X cores from 2022 did not bring such a great perf/watt with their IPC increases (what little A715 had), so I hope the forthcoming µArchs remedy that.
 
Reactions: ikjadoon

soresu

Diamond Member
Dec 19, 2014
3,273
2,549
136
I can't say I'm particularly impressed by all of this.

X4 looks like a solid single gen boost, but A720 and A520 are just perf/watt focused cores along with some optimisation of security features relating to protection from bad code execution.

Also none of the PR mentions Sophia Antipolis design team involvement at all.

Perhaps they are just obscuring that information now after the whole mess over ARM cores being developed in Austin/Texas being under the remit of "US technology" and not suitable for export to ARM China with the current tech trade restrictions.
 

Geddagod

Golden Member
Dec 28, 2021
1,296
1,368
106
I'm curious, even the 'big' ARM cores removed the uOP cache with this generation of cores right? Is it easier for ARM to add decoders and remove the uOP cache while for x86, since adding decoders is harder, it's better for them to increase the uOP cache instead of adding more decoders?
 

soresu

Diamond Member
Dec 19, 2014
3,273
2,549
136
I'm curious, even the 'big' ARM cores removed the uOP cache with this generation of cores right? Is it easier for ARM to add decoders and remove the uOP cache while for x86, since adding decoders is harder, it's better for them to increase the uOP cache instead of adding more decoders?
I don't know about 'easier' per se, but ARM Ltd's CPU core design philosophy does seem to be extremely modular by comparison to that of other µArch designers - in no small part to allow benefits from one design team to be quickly added to a project that another is working on.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Is it easier for ARM to add decoders and remove the uOP cache while for x86, since adding decoders is harder, it's better for them to increase the uOP cache instead of adding more decoders?

Intel is up to 6 decoders now and 32 bytes per cycle from L1 Inst for big cores, and marketing cores are 6 decoders and 16 bytes per cycle. Byte "reading" includes "predecode and finding instruction boundaries".
The current limit seems to be 6 uOPs from decoders, clearly below what proper 6 decoders can generate and below what 6 "complex + simple" decoders can generate as well. But this "limit" is plenty of width before even considering uOP cache, since rename/allocate is 6 wide as well.

They are ready for future decode expansion now and not really touching anything in front end in GNR besides improving caching by moving to 64KB setup ( and that also means less spillage to L2 that is carrying whole company on it's back ).
 

soresu

Diamond Member
Dec 19, 2014
3,273
2,549
136
Reactions: Gideon and Tlh97

Kryohi

Member
Nov 12, 2019
43
94
91
Doesn't look too bad for Neoverse V2.
Not too bad at all, but in the end basically the same performance as Genoa it seems. Bergamo would beat that and likely approach or surpass it in the efficiency metrics it seems.
 

ikjadoon

Senior member
Sep 4, 2006
235
513
146
Not too bad at all, but in the end basically the same performance as Genoa it seems. Bergamo would beat that and likely approach or surpass it in the efficiency metrics it seems.

I'm unsure on the efficiency bit.

AMD Genoa (2S x 9654) = 192 Zen4 cores @ 720W to 800W (3.9W / core)
NVIDIA Grace (2S x Grace) = 144 V2 cores @ 500W minus 960GB LPDDR5X (<3.4W / core)
AMD Bergamo (1S x 9754) = 128 Zen4c cores @ 320W to 400W (2.8W / core)
AMD Bergamo (2S x 9754) = 256 Zen4c cores @ 720W to 800W (2.8W / core)

Per-core, V2 is ~33% faster than Zen4—so claims NVIDIA, as we have no independent benchmarks.

On power: V2 should be less than Zen4 (Genoa), but close to or somewhat more than Zen4c (Bergamo).

Or have I missed something here?
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,258
15,387
136
I'm unsure on the efficiency bit.

AMD Genoa (2S x 9654) = 192 Zen4 cores @ 720W to 800W (3.9W / core)
NVIDIA Grace (2S x Grace) = 144 V2 cores @ 500W minus 960GB LPDDR5X (<3.4W / core)
AMD Bergamo (1S x 9754) = 128 Zen4c cores @ 320W to 400W (2.8W / core)
AMD Bergamo (2S x 9754) = 256 Zen4c cores @ 720W to 800W (2.8W / core)

Per-core, V2 is ~33% faster than Zen4—so claims NVIDIA, as we have no independent benchmarks.

On power: V2 should be less than Zen4 (Genoa), but close to or somewhat more than Zen4c (Bergamo).

Or have I missed something here?
9554 320 watt 2S = 128 cores, but they are a LOT faster. Mine run 2.7 ghz on my 9654 and 3.5 ghz on the 9554, same load. could make a big difference. This is why we need independent benchmarking. Both the watts and the performance are different. And how can they exclude wattage for memory, when the Genoa I think includes it ?
 

ikjadoon

Senior member
Sep 4, 2006
235
513
146
9554 320 watt 2S = 128 cores, but they are a LOT faster. Mine run 2.7 ghz on my 9654 and 3.5 ghz on the 9554, same load. could make a big difference. This is why we need independent benchmarking. Both the watts and the performance are different. And how can they exclude wattage for memory, when the Genoa I think includes it ?

Agreed. I'd not put a single corp above cherry-picking to hell in its marketing. It's telling NVIDIA didn't provide actual SPEC #s & actual power draw.

Genoa TDP is CPU-only, right, as customers can add as much / little RAM as they want?

NVIDIA's power # includes its soldered LPDDR5, so the RAM power is known from the get go (again, NVIDIA does not tease it out).

At 500W, with memory included in that figure, it is fairly power efficient. That AMD EPYC 9654 has a 360W TDP but also has 12 memory channels which can use another 60W+.

From STH's notes, I can imagine many ways NVIDIA is playing with its efficiency / capacity-in-power-constrained-environments:
  • Technically, Grace doesn't require DIMMs, which add height. So is NVIDIA simply stacking more units?
  • NVIDIA's fabric & interconnects are claimed to be quite efficient, and LPDDR5 is definitely more efficient. So NVIDIA likely including all that for its "5 MW" numbers.
For NVIDIA's end-customers, the full stack efficiency is relevant, as IO can up so much power on AI workloads.

Just not helpful to us to for the CPU perf & CPU power comparison 🤣

//

Again, first-party data: Arm has claimed one Neoverse V2 core (alone) eats 1.4W @ 2.8 GHz (everything else basically unknown, so yeah). This is 2MB L2$, meanwhile Grace is only 1MB L2$.

 
Reactions: Gideon and Tlh97

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,258
15,387
136
Agreed. I'd not put a single corp above cherry-picking to hell in its marketing. It's telling NVIDIA didn't provide actual SPEC #s & actual power draw.

Genoa TDP is CPU-only, right, as customers can add as much / little RAM as they want?

NVIDIA's power # includes its soldered LPDDR5, so the RAM power is known from the get go (again, NVIDIA does not tease it out).



From STH's notes, I can imagine many ways NVIDIA is playing with its efficiency / capacity-in-power-constrained-environments:
  • Technically, Grace doesn't require DIMMs, which add height. So is NVIDIA simply stacking more units?
  • NVIDIA's fabric & interconnects are claimed to be quite efficient, and LPDDR5 is definitely more efficient. So NVIDIA likely including all that for its "5 MW" numbers.
For NVIDIA's end-customers, the full stack efficiency is relevant, as IO can up so much power on AI workloads.

Just not helpful to us to for the CPU perf & CPU power comparison 🤣
32o watt is for default for 9554 and 9654, but both can be set to 400 watt and the performance is quite a bit more. And yes, wattage of Genoa is without ram, as its 12 channel, and many servers allow as much as 24 dimms per CPU (Mine only have 12 dimm slots). I will wait until a non-nvidia source benchmarks the 2, like Phoronix, before I declare one a winner.
 

ikjadoon

Senior member
Sep 4, 2006
235
513
146
Not super relevant, but interesting disclosure about Arm's IPO's "cornerstone investors" was released today (in alphabetical order)
  1. AMD
  2. Apple
  3. Cadence
  4. Google
  5. Intel
  6. MediaTek
  7. NVIDIA
  8. Samsung
  9. Synopsys
  10. TSMC
Conspicuously, no Microsoft nor Amazon. AMD is the most curious one: beyond RDNA2 in Samsung's smartphone SoCs, what business does AMD have with Arm beyond some Xilinx stuff?
 

Nothingness

Diamond Member
Jul 3, 2013
3,134
2,145
136
AMD is the most curious one: beyond RDNA2 in Samsung's smartphone SoCs, what business does AMD have with Arm beyond some Xilinx stuff?
I was surprised too, but then I remembered this: https://www.anandtech.com/show/6007...cortexa5-processor-for-trustzone-capabilities
I don't know if they still use Arm CPUs for security in their processors.

And then there's this: https://www.howtogeek.com/848691/amd-made-an-arm-chip-for-space-satellites/

And there was the short lived AMD Opteron A1100: https://www.amd.com/system/files/documents/hierofalcon-product-brief.pdf

Is that enough to justify such an interest in Arm IPO, can't say. But AMD definitely uses Arm in many places (as everyone).
 
Reactions: ikjadoon and Tlh97

NTMBK

Lifer
Nov 14, 2011
10,328
5,379
136
AMD is the most curious one: beyond RDNA2 in Samsung's smartphone SoCs, what business does AMD have with Arm beyond some Xilinx stuff?
They still have an architectural license for ARM as far as I'm aware. It makes sense that they'd want to keep ARM independent though. In terms of "best for AMD", I'd rank the outcomes like this:

  1. AMD remains part of the x86 duopoly, and ARM servers don't really penetrate the mainstream outside of hyperscalers
  2. AMD competes as an ARM server CPU provider in an open and vibrant ARM market
  3. AMD has to compete in a market dominated by a hostile ARM controlled by e.g. Nvidia
This investment is a hedge so that if (1) doesn't work out, they get outcome (2) instead of (3).
 

moinmoin

Diamond Member
Jun 1, 2017
5,094
8,098
136
AMD is the most curious one: beyond RDNA2 in Samsung's smartphone SoCs, what business does AMD have with Arm beyond some Xilinx stuff?
While an older gen a low power ARM core is in every single Zen chip.

The PSP itself represents an ARM core (ARM Cortex A5) with the TrustZone extension which is inserted into the main CPU die as a coprocessor. The PSP contains on-chip firmware which is responsible for verifying the SPI ROM and loading off-chip firmware from it.
 
Reactions: ikjadoon
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |