Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

igor_kavinski · May 16, 2024

Eug said:
As for old machines, they run GB on them because they can. Like why not, esp. as a comparison when they get new hardware? It’s free and easy and takes just a couple of minutes.

It's frustrating searching for results. Yesterday the oldest result was from 2022 and there were 300 pages and most of them were these old macbooks. GB browser badly needs filtering options.

Eug · May 16, 2024

igor_kavinski said:
It's frustrating searching for results. Yesterday the oldest result was from 2022 and there were 300 pages and most of them were these old macbooks. GB browser badly needs filtering options.

That is a valid complaint. While the database is freely searchable, it needs more fine grained filtering options. Actually, it was easier to search in GB 4 than it is now with GB 6.

Nothingness · May 16, 2024

I don't know if this was previously posted: https://scalable.uni-jena.de/opt/sme/micro.html

So it is confirmed that M4 has SME with streaming SVE. Vector length is 512-bit. A single core can reach 31 FP32 GFLOPS with SVE fmla (that's mul+add), 111 GFLOPS with NEON, and >2000 GLFOPS with SME fmopa.

igor_kavinski · May 16, 2024

Nothingness said:
I don't know if this was previously posted: https://scalable.uni-jena.de/opt/sme/micro.html

Excellent find!

poke01 · May 16, 2024

Nothingness said:
I don't know if this was previously posted: https://scalable.uni-jena.de/opt/sme/micro.html

So it is confirmed that M4 has SME with streaming SVE. Vector length is 512-bit. A single core can reach 31 FP32 GFLOPS with SVE fmla (that's mul+add), 111 GFLOPS with NEON, and >2000 GLFOPS with SME fmopa.

Funny, Intel gets rid of 512-bit vector extensions due to hybrid architecture in client cpus but Apple's like nah, here's 512-bit vectors in a base M chip with a hybrid architecture as well.

roger_k · May 16, 2024

Nothingness said:
I don't know if this was previously posted: https://scalable.uni-jena.de/opt/sme/micro.html

So it is confirmed that M4 has SME with streaming SVE. Vector length is 512-bit. A single core can reach 31 FP32 GFLOPS with SVE fmla (that's mul+add), 111 GFLOPS with NEON, and >2000 GLFOPS with SME fmopa.

The SSVE result is surprisingly low. I wonder whether there is a problem with their code or whether the coprocessor is indeed not useful for vector operations.

roger_k · May 16, 2024

poke01 said:
Funny, Intel gets rid of 512-bit vector extensions due to hybrid architecture in client cpus but Apple's like nah, here's 512-bit vectors in a base M chip with a hybrid architecture as well.

Well, Apple shipped a 512-bit outer product engine in iPhones since 2018 if I remember correctly?

igor_kavinski · May 16, 2024

poke01 said:
Funny, Intel gets rid of 512-bit vector extensions due to hybrid architecture in client cpus but Apple's like nah, here's 512-bit vectors in a base M chip with a hybrid architecture as well.

Intel dropped the ball at the wrong moment, on their feet!

Intel just didn't want to bother finding a solution. Coz they knew that AVX-512 and E-cores working together would bring the performance down due to thermal throttling. It's not impossible to design software that queries which cores are available, then distribute its AVX-512 optimized threads among P-cores and non-AVX-512 threads to E-cores. But that could potentially cause not enough power available for the P-cores to properly accelerate the AVX-512 workload.

roger_k · May 16, 2024

igor_kavinski said:
Intel dropped the ball at the wrong moment, on their feet!

Intel just didn't want to bother finding a solution. Coz they knew that AVX-512 and E-cores working together would bring the performance down due to thermal throttling. It's not impossible to design software that queries which cores are available, then distribute its AVX-512 optimized threads among P-cores and non-AVX-512 threads to E-cores. But that could potentially cause not enough power available for the P-cores to properly accelerate the AVX-512 workload.

IMO, their problem was that they chose an approach that does not scale. Apple strategy of splitting up wide vector functionality into a separate hardware unit that feeds from L2 instead of L1 makes much more sense to me.

igor_kavinski · May 16, 2024

roger_k said:
Apple strategy of splitting up wide vector functionality into a separate hardware unit that feeds from L2 instead of L1 makes much more sense to me.

They benefit from their huge L2.

roger_k · May 16, 2024

igor_kavinski said:
They benefit from their huge L2.

They benefit from not needing a wide data path to L1. Huge L2 or not is a secondary question.

igor_kavinski · May 16, 2024

roger_k said:
They benefit from not needing a wide data path to L1.

Interesting. If you don't mind, would you like to compare and contrast Intel/Apple's approach? How is AMD's different (since they are able to do AVX-512 more energy efficiently. Is it only because they don't have dedicated 512-bit units like Intel?)

Mopetar · May 16, 2024

poke01 said:
Funny, Intel gets rid of 512-bit vector extensions due to hybrid architecture in client cpus but Apple's like nah, here's 512-bit vectors in a base M chip with a hybrid architecture as well.

I think Intel did it because their hardware was a physical 512-but implementation, which was costly both in terms of transistors and power to run it. AMD used a 256-bit physical hardware unit to support the 512-bit operations.

There's nothing that requires Apple to have a particularly large hardware unit to execute the vector instructions. If they just had a regular 64-bit wide unit, they'd just need to run it 8 times to process a 512-bit vector operation. Or they could have four 64-bit wide execution units that the instruction gets split across over two cycles.

The only reason to have a full 512-bit hardware unit is that you want to support operations on 512-bit operands that are actually that large. That's definitely something that you'd want in certain scientific workloads, but consumer software is just using it to process 16x 32-bit floats with a single instruction.

roger_k · May 16, 2024

Mopetar said:
There's nothing that requires Apple to have a particularly large hardware unit to execute the vector instructions. If they just had a regular 64-bit wide unit, they'd just need to run it 8 times to process a 512-bit vector operation. Or they could have four 64-bit wide execution units that the instruction gets split across over two cycles.

The only reason to have a full 512-bit hardware unit is that you want to support operations on 512-bit operands that are actually that large. That's definitely something that you'd want in certain scientific workloads, but consumer software is just using it to process 16x 32-bit floats with a single instruction.

They do want to have good performance for wide vector and matrix workloads, so using a wide outer product engine makes sense. For them apparently it made enough sense to even include it on an iPhone.

What makes Apple solution a bit more special though is that it is a coprocessor and not part of the CPU core.

Nothingness · May 16, 2024

roger_k said:
What makes Apple solution a bit more special though is that it is a coprocessor and not part of the CPU core.

I think Intel AMX unit also is external and shared by multiple cores.

Mopetar · May 16, 2024

The vector width and the underlying hardware that backs it don't really matter. A 512-bit vector unit in hardware can crunch a 512-bit vector in one cycle (or whatever it takes) but you can use much smaller hardware units depending on what instructions you want to support. Vector instructions just let the hardware know that there's no data dependencies on the vector. You could do the whole thing at once or the individual pieces of data in any order.

I haven't looked at Apple's ISA to see what they support, but hasn't the overall trend been in the opposite direction where people don't care as much about operations on large data values, but instead want more granularity so that more operations in total can be done? If you just wanted to support INT8 operations you could have a single 8-bit execution unit chew through the entire 512-bit vector. If you only want to do a bunch of INT4 operations, having hardware that only allows as low as 8-bit granularity means that half of that hardware is effectively wasted. It doesn't matter if the Matrix itself is large if you just want all of the values to be 4-bit integers.

There's also no reason it has to be a part of the core either though. I think the only reason that Intel and AMD have done that is that they first added vector instructions when they only had single core CPUs so it made sense for them to carry that decision forward. The x86 code those CPUs execute is probably similarly structured around these same assumptions.

Apple at least knows to what extent their own code makes use of these vector instructions and they probably realized that it wasn't enough for each core to duplicate the same hardware resources. It's no different than having the NPU/GPU separate from the CPU cores. It probably also makes building a larger physical vector unit a lot easier since Intel ran into issues where any core that was actively using it had to drop frequency to stay within TDP limits, which is going to create a performance penalty for mixed workloads.

SarahKerrigan · May 16, 2024

Nothingness said:
I think Intel AMX unit also is external and shared by multiple cores.

This is essentially true; AFAIK TMUL is a shared resource, but not all AMX ops use TMUL, and every core has a Tiles register block.

Doug S · May 16, 2024

roger_k said:
The SSVE result is surprisingly low. I wonder whether there is a problem with their code or whether the coprocessor is indeed not useful for vector operations.

Doesn't SME require SSVE? I think Apple doesn't care about SSVE, they only implemented it because they had to but they don't intend for anyone to use it - with NEON being so much faster that's what they want you to use.

They implemented their proprietary AMX to get what they wanted when they wanted, and once they saw SME as a suitable replacement that could be "fully supported" from the ISA unlike AMX they made the switch. Probably very little changed in the AMX unit, other than a few additions I understand it to be almost identical to AMX. Heck for all we know Apple delivered AMX to ARM when they completed their internal spec and said "we'd like this" and ARM got feedback from others and added a few extras but in the end pretty much standardized what had been Apple proprietary instructions. SSVE coming along for the ride must not have been something they asked for. Maybe in the future they'll expand the "AMX" unit to better handle it, maybe they leave it as a red headed stepchild and expect people to continue using NEON.

roger_k · May 16, 2024

Doug S said:
Doesn't SME require SSVE? I think Apple doesn't care about SSVE, they only implemented it because they had to but they don't intend for anyone to use it - with NEON being so much faster that's what they want you to use.

They implemented their proprietary AMX to get what they wanted when they wanted, and once they saw SME as a suitable replacement that could be "fully supported" from the ISA unlike AMX they made the switch. Probably very little changed in the AMX unit, other than a few additions I understand it to be almost identical to AMX. Heck for all we know Apple delivered AMX to ARM when they completed their internal spec and said "we'd like this" and ARM got feedback from others and added a few extras but in the end pretty much standardized what had been Apple proprietary instructions. SSVE coming along for the ride must not have been something they asked for. Maybe in the future they'll expand the "AMX" unit to better handle it, maybe they leave it as a red headed stepchild and expect people to continue using NEON.

AMX unit also accelerates the vector algebra routines and it was previously much faster than these new results. Notably, their result is consistent with the use of single accumulator. I looked at the code and they use the Zx registers, maybe one gets better performance by using the ZA array as accumulator (this would fit the AMX theme). Here you have results for M1 Max, 180 GFLOPs for single thread in vector mode: https://github.com/corsix/amx/blob/main/fma.md

At any rate, all this tells us just how leaky all these abstractions can be. I was very enthusiastic about the idea of scalable vectors, now I become increasingly skeptical about how feasible it is to code hardware-agnostic high performance algorithms. Maybe RVV has the right idea after all with their model.

carancho · May 16, 2024

This is always overlooked probably because the controversies created due to measuring error fuel the reviews and benchmarks content industry. It is amazing, however, that no one produces content based on scrapped GB scores. Imagine all that you can wrangle out of that database.

carancho · May 16, 2024

roger_k said:
Regarding the discussion about IPC improvements... I am becoming increasingly convinced that most of the discourse is useless because of the deeply flawed methodology. In different Geekerwan videos the results reported for A17 Pro have a relative error of almost 5%. Combine this with the non-precise frequency estimation and you end up with huge relative error for the IPC estimates.

We need more clear methodology, results from multiple devices (to circumvent device bias), and most importantly, we need to start looking at the variance instead of point estimates. I tried to do this earlier for some of the GB6 data and I hope I could illustrate how much more useful this approach it.

Bottomline: the data is crap, methodology is crap, the relative error is crap, meaning that the results are crap.

carancho said:
This is always overlooked probably because the controversies created due to measuring error fuel the reviews and benchmarks content industry. It is amazing, however, that no one produces content based on scrapped GB scores. Imagine all that you can wrangle out of that database.

I was replying to roger's message.

Eug · May 16, 2024

M4 iPad Pro teardown

The M4 chip is directly underneath the Apple logo, under a thin layer of copper and graphite paper.

igor_kavinski · May 17, 2024

Mopetar said:
consumer software is just using it to process 16x 32-bit floats with a single instruction.

Examples of such software if you don't mind?

Nothingness · May 17, 2024

carancho said:
This is always overlooked probably because the controversies created due to measuring error fuel the reviews and benchmarks content industry. It is amazing, however, that no one produces content based on scrapped GB scores. Imagine all that you can wrangle out of that database.

GB database has too many overclocked systems and plain wrong or fake results. Of course there are methods to detect and exclude outliers, but it's not trivial.

roger_k · May 17, 2024

Nothingness said:
GB database has too many overclocked systems and plain wrong or fake results. Of course there are methods to detect and exclude outliers, but it's not trivial.

I don't think that dealing with outliers is the biggest challenge. Scraping the results is probably the most annoying part IMO.

Discussion Apple Silicon SoC thread

Lifer

Lifer

Lifer

Platinum Member

Lifer

Golden Member

Member

Member

Lifer

Member

Lifer

Member

Lifer

Diamond Member

Member

Platinum Member

Diamond Member

Senior member

Platinum Member

Member

Member

Member

Lifer

Lifer

Platinum Member

Member