Discussion Future ARM Cortex + Neoverse µArchs Discussion

SpudLobby · Oct 3, 2023

Antey said:
That comparison is not fair.

Snapdragon 865 was released in 2019 (november 15) vs 2019 (september 10) for apple's A13.

Snapdragon 8 gen 2 was released 2022 (november 15) vs 2023 (september 26) for apple's A17.

For a more accurate comparison you need to wait until november for the Snapdragon 8 gen 3.

I'm estimating 2200-2300 score (IPC + high freq increase), that would means the A17 is 'just' 28-30% faster.

Apple does not retain its perf advantages, apple cores are stagnant.

Fwiw, 8 Gen 3 is rumored to be 3.2GHz at the baseline. So Apple will have a 15% clock advantage. Adjusting them for 3.2GHz, the A17 is at about 2400-2500.

8 Gen 2 phones can hit the 2000’s, even at 3.2GHz iirc, so with a 10-12% gain I think probably 2200-2300 is about right. That’s what rumors show too.

Gap between the X4 and A17 will be like 10%. More interesting to me is power performance curves.

FlameTail · Oct 3, 2023

GUYS THIS IS INSANE!

https://www.reddit.com/r/hardware/comments/125zh6n/_/je8fsk2

So they mention how Cortex X3 with 16 MB L3 is 34% faster than Intel i7 1260P. That is incredible.

What's your thoughts on this thread guys? (Read the entire comment chain)

If increasing the L3 cache from 8 MB to 16 MB can bring such huge gains, why aren't mobile SoC vendors doing it? They could massively close the gap with Apple.

SpudLobby · Oct 3, 2023

FlameTail said:
GUYS THIS IS INSANE!

https://www.reddit.com/r/hardware/comments/125zh6n/_/je8fsk2

So they mention how Cortex X3 with 16 MB L3 is 34% faster than Intel i7 1260P. That is incredible.

What's your thoughts on this thread guys? (Read the entire comment chain)

If increasing the L3 cache from 8 MB to 16 MB can bring such huge gains, why aren't mobile SoC vendors doing it? They could massively close the gap with Apple.

Lmfao don’t read into that. Vince also hypes up the L3 gains and a lot of that is probably dependent on what you’re doing which for the average user besides gaming you are not getting “such huge gains”. The “faster” is also true but at a particular wattage and some benchmark and we know Intel’s aren’t meaningful there anyways — Arm is just taking the flipside of that almost certainly e.g. capping the TDP. It’s not new that a jacked out X3 — especially X4 now — could make a good laptop chip.

The main difference L3 would make is in terms of power, not general performance on a smartphone. The gap is already closing with Apple to some extent anyways btw, but ultimately Apple do have better cores with larger L1s, massive L2s that shift their memory hierarchy down a set. It’s a better design for the purpose (great performance and low power) but costly.

FlameTail · Oct 3, 2023

SpudLobby said:
Lmfao don’t read into that. Vince also hypes up the L3 gains and a lot of that is probably dependent on what you’re doing which for the average user besides gaming you are not getting “such huge gains”. The “faster” is also true but at a particular wattage and some benchmark and we know Intel’s aren’t meaningful there anyways — Arm is just taking the flipside of that almost certainly e.g. capping the TDP. It’s not new that a jacked out X3 — especially X4 now — could make a good laptop chip.

Yeah I was a bit skeptical, hence why I posted here to see a different viewpoint.

SpudLobby said:
The main difference L3 would make is in terms of power, not general performance on a smartphone. The gap is already closing with Apple to some extent anyways btw, but ultimately Apple do have better cores with larger L1s, massive L2s that shift their memory hierarchy down a set. It’s a better design for the purpose (great performance and low power) but costly.

But there would still be a minor improvement in performance, would it not? If we are measuring by benchmarks like Geebench 6, could you estimate the % performance gain going from 8 MB to 16 MB?

FlameTail · Oct 15, 2023

Question: Is reduced yield (apart from cost), the reason why Android SoCs cheap out on the cache?

I came across someone who said putting more SRAM on the die reduces the overall yield.

soresu · Oct 15, 2023

FlameTail said:
Question: Is reduced yield (apart from cost), the reason why Android SoCs cheap out on the cache?

I came across someone who said putting more SRAM on the due reduces the overall yield.

More likely die area and therefore profitability from more SoC's for any given wafer regardless of yield.

If there was a possibility of just switching to a new memory type like advanced MRAM variants over SRAM that had much better bit density without it impacting the wafer cost then any SoC designer would do so in a heartbeat if the performance hit wasn't drastic.

SRAM is simply huge for what you get out of it.

Doug S · Oct 15, 2023

The fact that cache basically doesn't shrink with new nodes anymore makes it a lot more costly to add more. You used to be able to nearly double cache "for free" with a new process, but those days ended with N5 and are completely over with N3E - which has identical cache density to N5. Heck even Apple backed off a previous increase of their SLC to 32 MB down to 24 MB. Though that may have been less about cost but about them hitting diminishing returns where the extra 8 MB wasn't helping enough to be worth the extra die area.

There are also power budgets to keep in mind, while SRAM doesn't require refresh cycles like DRAM they are all active transistors meaning they all leak. You can power gate a CPU core that's not in use or even parts of a CPU core that is in use, but you can't power gate SRAM unless you disable blocks of cache and run with less (and if you do that very often why did you increase your cache size in the first place?)

In addition to that leakage power, the more you use SRAM the more power it draws. So sure using a larger cache might increase performance but in doing so it also burns more power, leaving less for your CPU, GPU, etc. It isn't as simple as "I'll pay 10% more for a 10% bigger die that's got lots of cache and get more performance with no downside". Everything has tradeoffs in the CPU design world.

DrMrLordX · Oct 15, 2023

Phone margins are usually pretty low. OEMs want to cut corners wherever possible.

qmech · Oct 16, 2023

Doug S said:
There are also power budgets to keep in mind, while SRAM doesn't require refresh cycles like DRAM they are all active transistors meaning they all leak. You can power gate a CPU core that's not in use or even parts of a CPU core that is in use, but you can't power gate SRAM unless you disable blocks of cache and run with less (and if you do that very often why did you increase your cache size in the first place?)

In addition to that leakage power, the more you use SRAM the more power it draws. So sure using a larger cache might increase performance but in doing so it also burns more power, leaving less for your CPU, GPU, etc. It isn't as simple as "I'll pay 10% more for a 10% bigger die that's got lots of cache and get more performance with no downside". Everything has tradeoffs in the CPU design world.

Just a note on the power consumption. While massive low-level caches can certainly increase power consumption, ARM claimed significant savings when doubling L2 on their X4 core.

Last level cache is also generally considered a net power saving feature, as it reduces jumps to DRAM. As with most things, there is bound to be an inflection point where increasing cache size hurts power consumption more than it helps.

NTMBK · Oct 16, 2023

FlameTail said:
Question: Is reduced yield (apart from cost), the reason why Android SoCs cheap out on the cache?

I came across someone who said putting more SRAM on the die reduces the overall yield.

You also need a very good design to be able to get low latency with larger cache sizes. The whole reason why we have separate L1 and L2 caches is because there is a trade-off between capacity and latency- the larger a cache, the longer it takes to search. L1 is blazing fast but low capacity, L2 is larger but noticeably slower.

The fact that Apple can get such good latencies with large L1 caches is a testament to the talent of their CPU design teams. Zen 3 cores have a total of 64KB L1, but Apple M1 cores have 288KB L1. Cache design is hard.

Doug S · Oct 16, 2023

NTMBK said:
You also need a very good design to be able to get low latency with larger cache sizes. The whole reason why we have separate L1 and L2 caches is because there is a trade-off between capacity and latency- the larger a cache, the longer it takes to search. L1 is blazing fast but low capacity, L2 is larger but noticeably slower.

The fact that Apple can get such good latencies with large L1 caches is a testament to the talent of their CPU design teams. Zen 3 cores have a total of 64KB L1, but Apple M1 cores have 288KB L1. Cache design is hard.

Apple has it easier than Intel and AMD because they are targeting clock rates ~4 GHz while the latter are targeting ~6 GHz. If for example you have a latency of one nanosecond to your cache that means Apple has 4 cycles of latency while Intel/AMD have 6, or if Apple was OK with a 6 cycle latency they could have a bigger cache, more ways, or more read/write ports.

FlameTail · Oct 17, 2023

Seems like Cortex A510 and Samsung 4nm is not a good mix...

RIP Tensor G3.

hemedans · Oct 17, 2023

FlameTail said:
Seems like Cortex A510 and Samsung 4nm is not a good mix...

RIP Tensor G3.
View attachment 87297

That's old A510, The one used with A715 is refreshed one.

FlameTail · Oct 17, 2023

hemedans said:
That's old A510, The one used with A715 is refreshed one.

Which is measly 5% more efficient, according to ARM themselves.

🤣

qmech · Oct 18, 2023

FlameTail said:
Seems like Cortex A510 and Samsung 4nm is not a good mix...

RIP Tensor G3.
View attachment 87297

Those are, unfortunately, not comparable A510 cores. The Snapdragon 8g1 features a merged core design and less cache at every level above L1 (and possibly even at L1, since ARM allows 64 kB to 128 kB total L1). The Dimensity also features more advanced memory, leading to (according to MediaTek) 17% faster transfer speeds and 15% lower latency - in addition to (according to Samsung) 20% lower power consumption. The merged core design also shares the NEON SIMD engine between pairs of cores, the bandwidth of which can additionally be halved.

So, basically, the two A510 "systems" are *completely* different.

FlameTail · Oct 18, 2023

qmech said:
The Dimensity also features more advanced memory, leading to (according to MediaTek) 17% faster transfer speeds and 15% lower latency - in addition to (according to Samsung) 20% lower power consumption.

By 'memory', you mean RAM ?

So yes the Mediatek *supports* LPDDR5X but the phone in question should have it to take advantage of the gains you mentioned.

qmech said:
The merged core design also shares the NEON SIMD engine between pairs of cores, the bandwidth of which can additionally be halved.

So, basically, the two A510 "systems" are *completely* different.

But shouldn't the merged core design theoretically be more efficient?

qmech · Oct 18, 2023

FlameTail said:
By 'memory', you mean RAM ?

So yes the Mediatek *supports* LPDDR5X but the phone in question should have it to take advantage of the gains you mentioned.

But shouldn't the merged core design theoretically be more efficient?

Yes, by memory I mean RAM. Without knowing the exact model used, one can obviously not be sure what kind of memory is in either phone nor the bandwidth.

As for the merged core design being more "efficient"... That depends very much on which definition of efficient you are using. It is certainly more efficient in terms of die space. It also has the potential to be more energy efficient at minimal performance. Running GB5 on the cores is not in this range of "minimal performance".

However, if you look at the power required to reach a certain level of performance, the merged design would likely need to run at a higher frequency. If the benchmark makes extensive use of NEON SIMD instructions, that frequency differential could be quite substantial. The merged design also has half the L2 cache bandwidth and half the L2 cache per core (256 kB per core *pair* on the 8g1, as opposed to 256 kB per core on the Dimensity). See the annotated die shot in here. Less L2 cache is, generally, going to hurt performance / watt at higher CPU utilization rates. I haven't seen A510 latency tests, but I also wouldn't be surprised if the merged core has a higher latency to L2.

Bottom line is that the merged A510 does have the potential to be more efficient at very low power on cache-light load that doesn't use too much SIMD. The GB5 benchmark is unlikely to illustrate that. (And, frankly, that's not what the cores were meant for.)

hemedans · Oct 18, 2023

FlameTail said:
Which is measly 5% more efficient, according to ARM themselves.

🤣

In real Life A715+A510 combo has really good battery life Compare to A710+A510. Most Reviewers don't have standardized review for battery life, the only one I can find now is Phonearena

Google Pixel 8 battery and charging: Everything you need to know

The Pixel 8 series received a bump in battery capacity, along some other battery and charging improvements. Learn all about it here.

www.phonearena.com

and you can see there in some test Pixel 8 battery life is significantly better.

Let's wait for Notebookcheck and Gsmarena reviews to get better picture.

FlameTail · Oct 18, 2023

These are next generation of Android SoCs. The Tensor is already here, and others will be out in the coming months.

Tensor G3
(1 × X3) + (4 × A715) + (4 × A510)

Exynos 2400
(1 × X4) + (5 × A720) + (4 × A520)

Snapdragon 8 Gen 3
(1 × X4) + (5 × A720) + (2 × A520)

Dimensity 9300
(1× X4) + (3 × X4) + (4 × A720)

Now the interesting thing is that Samsung LSI seems to continue putting stock on the A5xx cores, while the others are moving away from them. Qualcomm just uses 2 A520s and Mediatek has entirely ditched the A5xx.

Meanwhile Samsung LSI (Exynos 2400 as well as Tensor G3- which is technically designed by them), is using 4 A5xx cores.

I remember reading that even ARM themselves recommends only 2 A5xx cores for the TCS23 generation. So why is Samsung doing otherwise? Is there a plausible explanation?

FlameTail · Oct 19, 2023

Found some spicy stuff guys...

https://x.com/QaM_Section31/status/1689899527250251776?s=20

https://x.com/QaM_Section31/status/1689509759957999616?s=20

FlameTail · Oct 19, 2023

Snapdragon 8 Gen 2 : 118.4 mm^2

Dimensity 9200/9200+** : 120.5 mm^2

Geddagod · Oct 19, 2023

A

FlameTail said:
View attachment 87428 View attachment 87429
Snapdragon 8 Gen 2 : 118.4 mm^2

Dimensity 9200/9200+** : 120.5 mm^2

Are these both TSMC N4?

FlameTail · Oct 19, 2023

Geddagod said:
A

Are these both TSMC N4?

The tweet attached says SD8G2 is N4 and D9200 is N4P.

Which... sounds kinda dubious.

Geddagod · Oct 19, 2023

FlameTail said:
The tweet attached says SD8G2 is N4 and D9200 is N4P.

Which... sounds kinda dubious.

Slightly unrelated question, do different companies that use ARM IP customize the layout of the same core for better perf or efficiency? Because the A715 in the snapdragon chip doesnt look like the A715 in the dimensity chip

qmech · Oct 20, 2023

Geddagod said:
Slightly unrelated question, do different companies that use ARM IP customize the layout of the same core for better perf or efficiency? Because the A715 in the snapdragon chip doesnt look like the A715 in the dimensity chip

They are the same.

The quality isn't great (ideally you would take multiple shots at different exposures and combine them). All the pretty die shots from AMD/Intel/Apple are composite shots with individually adjusted exposures.

Compare the top right 715 on the Snapdragon to the bottom right on the Dimensity. If you flip one on the vertical, they appear virtually identical, save for the different exposures. The middle square along the right (with a "notch" in it) is darker on the Dimensity, but it is there.

Discussion Future ARM Cortex + Neoverse µArchs Discussion

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Member

Lifer

Platinum Member

Diamond Member

Senior member

Diamond Member

Member

Diamond Member

Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Member