Question What will the CPU industry marketshare look like in the future? The Rise of ARM?

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

SpudLobby

Senior member
May 18, 2022
914
617
106
You mean to say doing your own designs and having them fabbed to get exactly what you need plus cut out the middleman's profit (whether Intel, AMD, or Ampere) is cheaper for hyperscalers.
Haha. Yes, Doug.
Correct. But this is extremely pedantic, because (and even as a shill I will say this) merchant Arm servers are a joke, and I don’t anticipate that changing.

Ampere indeed is a meme.
 

Doug S

Platinum Member
Feb 8, 2020
2,427
3,922
136
Haha. Yes, Doug.
Correct. But this is extremely pedantic, because (and even as a shill I will say this) merchant Arm servers are a joke, and I don’t anticipate that changing.

Ampere indeed is a meme.

But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.
 

Mahboi

Senior member
Apr 4, 2024
741
1,313
96
But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.
Also Oracle.
I'm convinced that buying from Oracle or being bought by Oracle is some kind of Seal of Doom.
 

soresu

Platinum Member
Dec 19, 2014
2,888
2,098
136
Samsung is also poised to put their own chips in PCs next year.


According to rumours, there will be two versions of their flagship Exynos 2500 chip. The 2500-A and 2500-B.

The 2500-B is the version intended for Windows AI PCs. It will most likely be used in Samsung's own Galaxybooks.
If that diagram represents an accurate accounting of silicon spent on NPUs it augurs bad tidings for mobile GPU going forward.
 

SpudLobby

Senior member
May 18, 2022
914
617
106
But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.
Sure. But it’s also that the advantages Arm (mobile) vendors that traditionally focus on low power fabrics have for a PC (notice it’s mostly laptops) don’t seem to be as big of a deal in servers for obvious reasons, but with hyperscalers they could literally be the exact same core and design as AMD/Intel and they’d still save money from cutting out a middle man. I mean, power is still a big deal for TCO, but not in the same way.

I also think you’re underestimating that Ampere just actually sucks. Their core IP sucks. If Qualcomm were doing Nuvia servers it might be a bit different.

Generally agree about the squeeze though.
 

xpea

Senior member
Feb 14, 2014
447
141
116
Not at all. Lol.

The 34 TOPS NPU in Apple A17 uses only 5 mm²
And that's trashcan level NPU. New NV NPU block does 130 TOPS/W/nm2 on N3P for the basic CoPilot redacted app and AI PC moniker. Serious AI will be one on Blackwell Tensor


no profanity in tech


esquared
Anandtech Forum Director
 
Last edited by a moderator:

FlameTail

Platinum Member
Dec 15, 2021
2,917
1,652
106
And that's trashcan level NPU. New NV NPU block does 130 TOPS/W/nm2
130 TOPS per Watt per "nm²" ?

nanometer squared??
on N3P for the basic CoPilot redacted app and AI PC moniker. Serious AI will be one on Blackwell Tensor
So are you talking about Nvidia's upcoming AI PC SoC, or their Blackwell GPU?
 

xpea

Senior member
Feb 14, 2014
447
141
116
130 TOPS per Watt per "nm²" ?

nanometer squared??

So are you talking about Nvidia's upcoming AI PC SoC, or their Blackwell GPU?
I was talking about next Nvidia N3P SoC for these new shiny Microsoft AI PCs.
And sorry my memory failed me (a bit) after checking. So let me explain quickly:
Test chip was on TSMC 5nm and reached nearly 100 TOPS/W (see attached slides). On next N3P SoC, efficiency should be 30% higher, thus my 130 TOPS/W number. So this was correct.
But I remembered wrong on the area. The test chip is 0.153mm2 (still on 5nm) without the interconnect. I expect double the area for the complete block. But final SoC is on N3P so it should be smaller. Real number is difficult to estimate but NV will dedicate less than 3mm2 for the NPU (with local cache) for 300~500 TOPS (I don't have the final number) and less than 5W. By the way, for the story, this VS-Quant INT4 DL NPU test chip is present in a corner of each NVSwitch silicon...
All in one, the important fact is that Nvidia will bring a NPU magnitude faster and power efficient than anything right now in the market (and I expect to still be true compared to 2025-26 competition). It was one point that put Microsoft on notice to make Nvidia the next preferred vendor for AI PCS on ARM
 

Attachments

  • VS-quant0.jpg
    110.3 KB · Views: 23
  • VS-Quant1.jpg
    150.4 KB · Views: 21

FlameTail

Platinum Member
Dec 15, 2021
2,917
1,652
106
Test chip was on TSMC 5nm and reached nearly 100 TOPS/W (see attached slides). On next N3P SoC, efficiency should be 30% higher, thus my 130 TOPS/W number. So this was correct.
But I remembered wrong on the area. The test chip is 0.153mm2 (still on 5nm) without the interconnect. I expect double the area for the complete block. But final SoC is on N3P so it should be smaller. Real number is difficult to estimate but NV will dedicate less than 3mm2 for the NPU (with local cache) for 300~500 TOPS (I don't have the final number) and less than 5W. By the way, for the story, this VS-Quant INT4 DL NPU test chip is present in a corner of each NVSwitch silicon
That's insane numbers.
How are they going to feed it?
LPDDR6 to the rescue?
...
All in one, the important fact is that Nvidia will bring a NPU magnitude faster and power efficient than anything right now in the market (and I expect to still be true compared to 2025-26 competition). It was one point that put Microsoft on notice to make Nvidia the next preferred vendor for AI PCS on ARM
Do you have any numbers for the NPU efficiency of competitors (Intel/AMD/Qualcomm) ?
 

xpea

Senior member
Feb 14, 2014
447
141
116
That's insane numbers.
How are they going to feed it?
LPDDR6 to the rescue?

Do you have any numbers for the NPU efficiency of competitors (Intel/AMD/Qualcomm) ?
INT4 needs very low bandwidth but yes it's LPDDR6 on package
marketing slides I saw says 10+ times efficiency gain but I guess they don't compare with same precision...
 

FlameTail

Platinum Member
Dec 15, 2021
2,917
1,652
106
INT4 needs very low bandwidth but yes it's LPDDR6 on package
Okay, this is something I have been wondering for a long time.

How much memory bandwidth does an NPU use? Can you give a ballpark figure like "1 TOPS of INT8 = 1 GB/s"?

Nobody has been able to give me a satisfactory answer to this question, yet.
 

FlameTail

Platinum Member
Dec 15, 2021
2,917
1,652
106
Test chip was on TSMC 5nm and reached nearly 100 TOPS/W (see attached slides). On next N3P SoC, efficiency should be 30% higher, thus my 130 TOPS/W number. So this was correct.
But I remembered wrong on the area. The test chip is 0.153mm2 (still on 5nm) without the interconnect. I expect double the area for the complete block. But final SoC is on N3P so it should be smaller. Real number is difficult to estimate but NV will dedicate less than 3mm2 for the NPU (with local cache) for 300~500 TOPS (I don't have the final number) and less than 5W. By the way, for the story, this VS-Quant INT4 DL NPU test chip is present in a corner of each NVSwitch silicon...
For comparison;

Apple A17 Pro (N3B)

Neural Engine
~5mm²
34 TOPS of FP16/BF16

Qualcomm Snapdragon 8 Gen 3 [N4P]

NPU
~7 mm²
45 TOPS* of INT8

If the Snapdragon NPU were ported to N3P, it would be about 5 mm² and ~50 TOPS.
I suppose, theoretically it could do 100 TOPS of INT4 then.

Still, these are far behind the numbers you are quoting for Nvidia, which sounds unbelievable. 300 TOPS in 3 mm²!?

What sorcery is this?

I suppose that's what comes of being the market leader in AI, and billions of dollars of R&D.

______

*Qualcomm hasn't officially disclosed the TOPS figure for Snapdragon 8 Gen 3. This number comes from the leaker Revegnus.
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,888
2,098
136
But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.
Ampere don't really make desktop systems.

Outside of that you have servers and datacenters where a lot of the "industry standard" software for that market segment has since been ported to ARM far in advance of the consumer, DCC and video game software that we are still waiting on to fill the void in the native ARM market.
 
Feb 17, 2020
104
282
136
Still, these are far behind the numbers you are quoting for Nvidia, which sounds unbelievable. 300 TOPS in 3 mm²!?
He's reading the graph and paper wrong. He's taking the efficiency at Vmin, which is always going to be the most efficient point in operation, and applying that to the whole operating range.

Due to the low frequency at that point, you'd need somewhere around 30mm^2 of silicon to get to 50 tops. Going by another table in the paper, they're at 11.7 TOPS/mm^2 at Vmax (INT8), so that puts them at ~4mm^2 to get 45 TOPS, not including interconnect.

So potentially quite good, but nowhere close to the 300 TOPS in 3mm^2 figure he cites.

 

Doug S

Platinum Member
Feb 8, 2020
2,427
3,922
136
Apple used INT8 on A17 Pro and M4.
M3 is FP16.

I wonder what drove that change, aside from "others are using INT8 we have to use it too so we don't look like we're behind"? Maybe M3's NPU didn't support INT8 so it had to report on FP16, but a new NPU was used in A17P and M4 that did doubled the score?

Use of INT4 for inference is becoming more common, since it hasn't been shown to be all that much different than INT8 (or INT16/FP16 for that matter) and there's research looking at results below INT4 that supposedly still holds up. So NPUs that today are reporting TOPS based on INT8 might double tomorrow if they support INT4.

For many years the acronyn MIPS (millions of instructions per second) was derided as "meaningless indicator of processor speed". We need something similar for TOPS. Terribly Overblown Performance Specification? Any other ideas?
 
Reactions: Tlh97 and FlameTail

xpea

Senior member
Feb 14, 2014
447
141
116
He's reading the graph and paper wrong. He's taking the efficiency at Vmin, which is always going to be the most efficient point in operation, and applying that to the whole operating range.

Due to the low frequency at that point, you'd need somewhere around 30mm^2 of silicon to get to 50 tops. Going by another table in the paper, they're at 11.7 TOPS/mm^2 at Vmax (INT8), so that puts them at ~4mm^2 to get 45 TOPS, not including interconnect.

So potentially quite good, but nowhere close to the 300 TOPS in 3mm^2 figure he cites.

View attachment 99762
The classic story of the half empty glass...
First, The efficiency I mentioned is true and measured. I didn't twisted anything.
Second, the NPU will have it's dedicated power island. It's obvious.
Third, Vmax (1760MHz) is only to know how the circuit reacts at it's limit. Final speed of production silicon will be well under. So you can't use Vmax to estimate the efficiency, as it's far from the optimal point on the curve.
The final number is obviously in between. And we still must add that N3P brings 40% power improvement and 35% density improvement over N5.

PS: Obviously I don't remember well this marketing slide (from a live presentation, not a document) but NV rep was very bullish on the NPU. I will check and will come back later
 

FlameTail

Platinum Member
Dec 15, 2021
2,917
1,652
106
As the OP of this thread, I just wanted to make an adjustment and clarificaton.

This thread is to discuss the future of the CPU market- particularly in the context of ARM in general. I have changed the thread title to reflect that. I also added to the first post; (1) Samsung's upcoming Exynos SoC for ARM PCs and (2) Rumours that Huawei is working on their own ARM PC processor.

For those who want to discuss exclusively about Nvidia's ARM SoC, another thread already exists for that purpose.


Cheers
 
Reactions: SpudLobby

FlameTail

Platinum Member
Dec 15, 2021
2,917
1,652
106

No, Windows on ARM PCs aren't likely to feature Thunderbolt ports, but that does not mean you can't use Thunderbolt accessories
No way.

The new wave of Windows on ARM Copilot+ Windows PCs may seem at risk of losing access to Thunderbolt because of this, but we've already seen new Windows on ARM PCs boasting USB4 ports, which can do all the same things in theory.
The issue with USB4 is that OEMs might put in the lowest spec version of USB4 and call it a day. Even if it is branded as 40 Gbps USB4, it may not have features like PCIe tunneling or Displayport alt-mode. Those are stuff that is mandatory for the Thunderbolt 4 specification.
 

coercitiv

Diamond Member
Jan 24, 2014
6,341
12,598
136
The issue with USB4 is that OEMs might put in the lowest spec version of USB4 and call it a day. Even if it is branded as 40 Gbps USB4, it may not have features like PCIe tunneling or Displayport alt-mode. Those are stuff that is mandatory for the Thunderbolt 4 specification.
Think of it this way, having access to TB would not have stopped OEMs from using the lowest USB4 implementation. As always, it's up to them to bring the appropriate connectivity based on their asking price.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |