Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

ThatBuzzkiller · Nov 16, 2020

Carfax83 said:
Maybe not outright replacement, but what about a renewal similar to what x86-64 did, but more ambitious?

x86-64 is just one of the many extensions to the x86 ISA. There's nothing inherently special about it compared to the other extensions that we see ...

Carfax83 said:
Eventually I think they should remove the really old legacy stuff from the ISA, and maybe add some more general purpose registers.

Sure they could remove it but unless AMD, Intel, and Microsoft collectively agree to deprecate those instructions they won't accept the optics of having weaker compatibility. They are politically pressured to support legacy functionality because if either AMD or Intel don't implement them then one of them will and use it as material against their competitor having worse compatibility so they don't dare remove hardware support ...

Doing emulation brings in even more questions than solutions. Who will write and maintain this said emulator ? Who's microachitecture is this emulator going to be optimized for ? Which x86 vendor between AMD or Intel is going to end up with superior emulation performance ? If one x86 vendor does see noticeably inferior performance in emulation compared to their competitor do they just accept the performance hit or implement these instructions in hardware instead ?

As far as new functionality is concerned, I'd like to see AVX-512 being standardized and I'd also like to see AMD's implementation of transactional memory since Intel's TSX has been a disaster so far and see if we can standardize their implementation instead. AMD had a proposal for Advanced Synchronization Facility in the past. Double dereference instructions is another potentially interesting possibility too ...

tempestglen · Nov 16, 2020

wlee15 said:
We have our first M1 Cinebench R23 courtesy of Bits and Chips.

Do you know the result of R23 on M1?

tempestglen · Nov 16, 2020

tempestglen said:
Do you know the result of R23 on M1?

Apple M1 Cinebench R23 benchmark scores revealed

And an Affinity Photo dev has tested the new M1 SoC against an Intel Core i5-9600.

hexus.net

Please check the A12Z's score, 987! So I suspect the M1 frequency was degraded to 2.5Ghz during 10 minutes test, was it from macbook Air?

TheGiant · Nov 16, 2020

CB single core test can't overheat macbook air...or yes?

even with passive cooling it should be very cool based on that tdp

unless we have tdp and tdp as with our beloved 2 x86 manufacturers

amrnuke · Nov 16, 2020

TheGiant said:
cibenech single core test can't overheat macbook air...or yes?

even with passive cooling it should be very cool based on that tdp

unless we have tdp and tdp as with our beloved 2 x86 manufacturers

Heat issues were huge on with the Intel Macbooks.

While voltage-frequency curves on the A12 don't look stellar, and on my iPhone 11 Pro, it gets hot way more than my iPhone 8 Plus, we also have reviews the iPhone 12 / A14 seems to have solved some of the power issues. So I'd be surprised if it had that much of a heat issue. If it does, there are a lot of explanations, one may be that Arm ISA and the uarchs that best leverage it may simply have a voltage/frequency curve inflection point that's intrinsically at a lower frequency than on x86 for any given process node.

Eug · Nov 16, 2020

TheGiant said:
cibenech single core test can't overheat macbook air...or yes?

even with passive cooling it should be very cool based on that tdp

unless we have tdp and tdp as with our beloved 2 x86 manufacturers

With extended high workloads, it's pretty likely the M1 MacBook will throttle. However, if my Intel 12" MacBook is any indication, it may hopefully throttle gracefully so that performance gradually decreases somewhat in inverse proportion to the temperature (as opposed to the machine just shutting down completely). For users like me that do their heavier stuff on a different computer* and usually need the portable machines for lighter stuff, that's fine.

I posted this graph earlier. What it demonstrates are the Cinebench scores of the 2017 MacBooks at each repeated pass of Cinebench.

I finally got Big Sur and Cinebench R23 installed on that last Apple fanless MacBook, a 16 GB 2017 Core m3-7Y32, and here are my results. Note that this is an average of two passes, because I can't seem to set the application to do just one pass in multi-core. It seems to require a minimum of 10 minutes (and it takes a long time for that Mac to do each pass, so just 2 multi-core passes is longer than 10 minutes).

606 single-core
1458 multi-core

Basically, I'm guessing the change from the last Intel fanless MacBook to the first fanless Arm MacBook Air is going to result a 4X to 5X the multi-core performance.

*My "heavier stuff" computer is an 2017 iMac 27" Core i5-7600. The fanless M1 MacBook Air will blow it out of the water.

TheGiant · Nov 16, 2020

amrnuke said:
Heat issues were huge on with the Intel Macbooks.

While voltage-frequency curves on the A12 don't look stellar, and on my iPhone 11 Pro, it gets hot way more than my iPhone 8 Plus, we also have reviews the iPhone 12 / A14 seems to have solved some of the power issues. So I'd be surprised if it had that much of a heat issue. If it does, there are a lot of explanations, one may be that Arm ISA and the uarchs that best leverage it may simply have a voltage/frequency curve inflection point that's intrinsically at a lower frequency than on x86 for any given process node.

ofc macbook air icelake had heat issues, the cooling system is not a cooling system
I wanted to buy it for my son for light youtuber editing and music, but after some youtube videos where they opened the case ...no
just watch the icelake macbook air videos..but prepare a bucket if you can't make to the toilet

Eug said:
With extended high workloads, it's pretty likely the M1 MacBook will throttle. However, if my Intel 12" MacBook is any indication, it may hopefully throttle gracefully so that performance gradually decreases somewhat in inverse proportion to the temperature (as opposed to the machine just shutting down completely). For users like me that do their heavier stuff on a different computer* and usually need the portable machines for lighter stuff, that's fine.

I posted this graph earlier. What it demonstrates are the Cinebench scores of the 2017 MacBooks at each repeated pass of Cinebench.

View attachment 33944

I finally got Big Sur and Cinebench R23 installed on that last Apple fanless MacBook, a 16 GB 2017 Core m3-7Y32, and here are my results. Note that this is an average of two passes, because I can't seem to set the application to do just one pass in multi-core. It seems to require a minimum of 10 minutes (and it takes a long time for that Mac to do each pass, so just 2 multi-core passes is longer than 10 minutes).

606 single-core
1458 multi-core

View attachment 33942

View attachment 33943

Basically, I'm guessing the change from the last Intel fanless MacBook to the first fanless Arm MacBook Air is going to result a 4X to 5X the multi-core performance.

*My "heavier stuff" computer is an 2017 iMac 27" Core i5-7600. The fanless M1 MacBook Air will blow it out of the water.

I am surprised than the single core result in throttling, the m1 is presented as mightly efficient
MT loads that is logical, but single in 10mins of cinepeenbench ? not really

Eug · Nov 16, 2020

TheGiant said:
ofc macbook air icelake had heat issues, the cooling system is not a cooling system
I wanted to buy it for my son for light youtuber editing and music, but after some youtube videos where they opened the case ...no
just watch the icelake macbook air videos..but prepare a bucket if you can't make to the toilet

I am surprised than the single core result in throttling, the m1 is presented as mightly efficient
MT loads that is logical, but single in 10mins of cinepeenbench ? not really

You misunderstood. I’m guessing the fanless M1 MacBook Air will throttle with extended high multi-core workloads.

shady28 · Nov 16, 2020

wlee15 said:
CPU encoders usually produce better quality output than hardware encoders. I occasionally do a bit of encoding an I use x264.

Yes, that's what the same "enthusiast" sites use to try to justify their use of CPU for encoding benchmarks, which no sane person does.

The quality loss occurs at lower bit rates. Once the bit rate is high, like >6Mbit/s which it is many times that on these benchmarks, the quality differences disappear. You can also run at higher bit rates on GPU than CPU, encode in the same or less time, and have higher quality.

The only real argument for CPU encode is where the HW encoder doesn't support the format you want to target. For a professional shop who has to target multiple platforms this might matter, but it's not going to matter for normal people.

Heartbreaker · Nov 16, 2020

wlee15 said:
CPU encoders usually produce better quality output than hardware encoders. I occasionally do a bit of encoding an I use x264.

NVenc at least has come a long way and is now the recommended option for Streamers.

LightningZ71 · Nov 16, 2020

There's no intrinsic reason that the ARM ISA nor a generic hardware implementation of said ISA should have a voltage and frequency cut off that's unique. That's just not how this works. Instead, it is likely a function of Apple's design direction wh the Firestorm core and the process node that it's designed for. They chose to go "wide" and attack every possible cause of instruction stalls to get maximum IPC. It is my opinion that their approach to having large, highly associative, low latency caches is limiting their to clock their processors as high as we see in desktop x86 products.

I'm not saying this is bad or good. If you want to save on power usage, which in mobile is king, then their design makes a bunch of sense. It's also going to make them very competitive in portable devices in all markets, and in devices that have space constraints or noise constraints, such as uSFF products, set top boxes, and trim line AIO devices like the iMac line. What we don't know yet is how this plays out on the top end of the performance market. With AMD able to put 64 cores in a threadripper workstation with eight channel ram in the WX series, that's going to be hard to beat for what we believe to be thir best case scenario for Mac Pro, which is a pair of 8+8 die linked with a high speed link in the same package. That's a maximm of 16 full speed threads and 16 efficiency threads against 64 SMT cores. While having an IPC advantage and complete stack control are certainly useful things, as they say in the car world, there is no replacement for displacement.

Eug · Nov 16, 2020

LightningZ71 said:
I'm not saying this is bad or good. If you want to save on power usage, which in mobile is king, then their design makes a bunch of sense. It's also going to make them very competitive in portable devices in all markets, and in devices that have space constraints or noise constraints, such as uSFF products, set top boxes, and trim line AIO devices like the iMac line. What we don't know yet is how this plays out on the top end of the performance market. With AMD able to put 64 cores in a threadripper workstation with eight channel ram in the WX series, that's going to be hard to beat for what we believe to be thir best case scenario for Mac Pro, which is a pair of 8+8 die linked with a high speed link in the same package. That's a maximm of 16 full speed threads and 16 efficiency threads against 64 SMT cores. While having an IPC advantage and complete stack control are certainly useful things, as they say in the car world, there is no replacement for displacement.

Even if you're right, I don't think Apple really cares.

They're fine to concede the multi-core crown to AMD, for the 0.01% of their potential customer base that actually needs 64 cores.

I am more interested to see what kind of GPU and off-CPU stuff they have for the Mac Pros. For example, for an video editor, having a high end high quality super fast hardware 10-bit HEVC encoder might be a heluvalot more useful than having 64 general purpose CPU cores.

thunng8 · Nov 16, 2020

tempestglen said:
Apple M1 Cinebench R23 benchmark scores revealed

And an Affinity Photo dev has tested the new M1 SoC against an Intel Core i5-9600.

hexus.net

Please check the A12Z's score, 987! So I suspect the M1 frequency was degraded to 2.5Ghz during 10 minutes test, was it from macbook Air?

There are no m1 results yet. The hexus site is propagating the incorrect Twitter post with results from the a12z

Heartbreaker · Nov 16, 2020

LightningZ71 said:
T With AMD able to put 64 cores in a threadripper workstation with eight channel ram in the WX series, that's going to be hard to beat for what we believe to be thir best case scenario for Mac Pro, which is a pair of 8+8 die linked with a high speed link in the same package. That's a maximm of 16 full speed threads and 16 efficiency threads against 64 SMT cores. While having an IPC advantage and complete stack control are certainly useful things, as they say in the car world, there is no replacement for displacement.

Who is this "we" you are speaking for, because I certainly don't believe the Mac Pro will with 16 performance cores is the best case.

I don't think outsiders have any idea at all, what kind of Topology Apple will use for an ARM based Mac Pro/iMac Pro, but I would expect them to abandon efficiency cores, since the benefit from the die space used will largely be wasted on a non mobile, high performance workstation.

As far as 64 Core AMD, outperforming the iMac Pro in number crunching, that is already the case today, when Mac Pro maxes out with a 28 core Xeon. I don't think it's an issue for the target Audience.

What Apple will need to show is outperforming the 28 Core Xeon Mac Pros. Which I expect they will do.

I am very excited to see what the ARM Mac Pro will look like. So many unknowns on topology of both the CPU and GPU portions.

Qwertilot · Nov 16, 2020

They won't want just 'outperforming' either. If sanely possible they'll aim for ridiculing, much in the same way that the M1 does to their previous Intel notebook chips. Could be fascinating.

name99 · Nov 16, 2020

[mistyping]

jeanlain · Nov 16, 2020

wlee15 said:
We have our first M1 Cinebench R23 courtesy of Bits and Chips.

990

As others said, it's the A12Z. But we can extrapolate these results to the M1.
The A12Z give 1121 in geekbench and the M1 in the Mac mini 1741 (based on the Mac and iOS benchmark chart pages).
Hence, the Mac mini should produce a Cinebench score of about 990*1741/1121 = 1537, of course assuming that the A12Z and A14 "behave" similarly in both tests.
How does that compare to X86 cores? Pretty well I think, but maybe not as impressive as on geekbench.

IvanKaramazov · Nov 16, 2020

jeanlain said:
As others said, it's the A12Z. But we can extrapolate these results to the M1.
The A12Z give 1121 in geekbench and the M1 in the Mac mini 1741 (based on the Mac and iOS benchmark chart pages).
Hence, the Mac mini should produce a Cinebench score of about 990*1741/1121 = 1537, of course assuming that the A12Z and A14 "behave" similarly in both tests.
How does that compare to X86 cores? Pretty well I think, but maybe not as impressive as on geekbench.

Yeah, I tried doing this as well, extrapolating potential Cinebench scores based on the Geekbench delta between A12Z and M1. It's not a stupid way to guess, as if you compare GB and Cinebench for A12Z and the Intel chips they track pretty well. A (wildly) speculative, extrapolated 1537 would put the M1 faster than the 15w Tiger Lake ST, and in the ballpark of the 28w Tiger Lake ST. Which sounds about right. The same math on the MT scores would put the M1 just shy of 7000, while the 28w TGL sits around 6000. Again, sounds likely.

The real wildcard though is how much of the MT score is related to thermal constraints. The 4800u, for example, has an MT score of around 10,500, despite being very similar to the M1 in Geekbench MT. It's an open question whether M1 has a sustained MT profile closer to the Ryzen or Tiger Lake chips.

name99 · Nov 16, 2020

amrnuke said:
Thank you for bringing things back around to technical discussion.

One area is most interesting to me, brought up in two separate posts of yours but seem related:

I don't think there's any question that Apple are using way prediction. For any of the major players to not be using way-predicted set-associative cache would be absurd, no? AMD and Intel have been using microtag way prediction since W's presidency. In any case, how does Apple's implementation differ, if at all, and do you have any explanation for why you pinpoint that exact comment about way prediction, speculative scheduling, and replay?

Some things I was thinking about:

As far as I was aware, if flushing were the "problem" and by "flushing" you mean non-selective replay (which you may not - in which case, please explain!) then way predictors haven't been "problematic" since at least Pentium 4, which doesn't use non-selective replay; however, Seznac and Michaud indicate that non-selective replay may be viable as long as an ROB or buffer (a la the replay queue on P4) is available and efficient. As for needing a quality replay mechanism, isn't it better to prevent the need for a replay in the first place by designing better speculative scheduling? Seeing as AMD switched from a neural network branch predictor to TAGE, I doubt they missed those papers... and since Apple were using TAGE since 2013 at least, it would make sense that Apple is also aware of the benefit of preventing replay in the first place with better speculative scheduling.

That being said, given the width of Apple's core and the large ROB do you imagine they're using token-based selective replay (or something similar, or, since 2013 they called their core Cyclone, maybe they're using the cyclone replay scheme!) rather than other mechanisms? Wouldn't be the first time they've used WARF research! This may explain the large ROB. Though, a large ROB would solve a lot of replay issues regardless of replay scheme, including those associated with flushing.

Edit: Andrei says the ROB size "outclasses" any other design and questions how Apple can "achieve" this design, but it's not that simple, is it? A large ROB could be a Band-Aid for a larger speculation problem or could be relieving a bottleneck, or could reflect a different speculative scheduling/replay mechanism... or, my suspicion is that it's larger by virtue of Apple having such a wide core. Indeed, the next largest ROBs are on Intel chips, and the smallest on Zen3. I don't know enough about X1 to comment on why it'd be the smallest, but if this is indeed a bottleneck and X1 uses a similar branch prediction / replay scheme, then it wouldn't make much sense to have such a small ROB.

It feels to me like you don't quite grasp the essential distinctions here.

All speculation is a (statistics informed) guess about how the program will probably behave, but requiring a way to recover if your guess was incorrect.
The most obvious and well-established form of speculation is branch prediction. The speculation is ultimately about the sequence of instructions through the CPU, and the recovery mechanism has two main pieces:
- various mechanisms (physical vs logical registers, and the LSQ [as a secondary function]) hold calculated values provisionally, until each branch is resolved, at which point all values calculated prior to that branch can be graduated from provisional state to correct state.
- the ROB holds the provisional ordering of instructions, connecting the instruction ordering (which instructions are now known to be valid, given that the branch that led to them is correct) to the values described above (sitting in physical registers and the LSQ).

What's important is the dependency graph: what set of subsequent speculated values are dependent on a speculated branch, and thus will be shown to be incorrect if that branch was incorrect.
Now the nature of control speculation (speculation on the sequence of instructions) is that once you make a branch for practical purposes EVERY instruction after that branch depends on that branch. Which means that if a branch was guessed incorrectly (direction or target) everything after it needs to be flushed.
Now you might protest that this is not true, that there are branch structures (diamonds) like
if (a<0){
b=-a
}else{
b=a
}
where there's a tiny amount of constrained control divergence, after which flow reconverges. This is true. But it doesn't help. Even if you want to correct speculation that led down the wrong half of the diamond just by flushing the instruction b=-a, and executing b=a, everything after the diamond closes is dependent on the value of b and also is now incorrect. It's just not practical to track everything that does or does not depend on a particular branch and selectively flush that branch and its dependents and nothing else
(a) almost EVERYTHING is dependent, so this buys you very little and
(b) branches are so dense (think ~1/6 of instructions) that you'd need tremendously complicated accounting to track what is dependent on this branch not that.

So end result is: control speculation as a practical matter has to recover by flushing *everything* (all instructions, all calculated values) after a misprediction.
If you think about this in detail, it leads to a whole set of issues.
- Of course you want an accurate predictor, that's given. But you also want to catch mispredicts if you can, up to decode and rename, but before they enter the OoO machinery, because catching them there only flushes the instructions queued after them in various buffers sitting between fetch, decode, up to rename. Hence the value of a long latency (but even more accurate) secondary branch detection mechanism using even larger pools of storage.
- you want to avoid branches with the characteristic that they are hard to predict and do very little work (like the diamond I described above). Things like MAX or ABS. This leads to the value of predicated instructions and things like CSEL/CMOV. The whole story of CMOV in the x86 world is a tragedy, and since this is supposed to be purely technical I won't cover it. But the fallout is that much of the x86 world, even today, is convinced that predication is a bad idea (and imagine that tiny micro-benchmarks prove this). But microbenchmarks miss the big picture. The value of predication is that it converts a branch (which becomes massively expensive if it's mis-predicted) into straightline execution with no speculation and no penalties. Fortunately ARM CSEL was well designed and implemented from the start so ARM doesn't have this weird x86 aversion. IBM even converts short branches to predication and I suspect Apple does the same (just on the grounds that Apple seems to have implemented every good idea that has ever been invented).

There's vastly more that can be said about branches (some of it said by me here, going forwards and backwards from this anchor link):
Look for the posts by me, Maynard Handley
https://www.realworldtech.com/forum/?threadid=196054&curpostid=196130
Even apart from that, you can start asking how exactly you recover from a misspeculation... The first ways of doing this in the late 80s based on walking the ROB (not very large at the time) were adequate but didn't scale. So then came checkpoints but there's a whole subspecialty in when you implement checkpoints... EVERYTHING in this space is much more than just the words -- I can implement checkpoints, and you can implement checkpoints, but I can get much more value out of my checkpoints than you (ie mispredicts are a lot cheaper) if I am smarter in when I create each checkpoint.

But all the above was throat clearing. The point is that control speculation has characteristics that mean recovering from a misprediction require flushing everything after the misprediction. But there are other forms of speculation.
One of the earliest, which you know something about, is speculative scheduling (ie guess that a load will hit in cache, and schedule subsequent instructions based on that).
Another is load/store aliasing: if a store doesn't yet know its address, so it's just sitting in the store queue waiting, what do we do with a subsequent load? We could delay it until we know the address of every store, but chances are the load doesn't actually load from the address of that store, so we are delaying for nothing. Classical speculation territory... (One way to do this were what I referred to with store sets and the Moshovos patent) But once again, if your speculation is incorrect, then the contents of the load, and everything that depends on it, are now invalid.
A third possibility is value prediction. This notes that there are some loads that occur over and over again but what's loaded never changes, so you can bypass the load and just supply that value. This is the kind of thing you'd say "that's dumb, how often does it happen?" Well, unfortunately it happens way more than it should... Value prediction is in a kinda limbo right now. For years it was talked about but not practical (in the sense that, with limited resources, transistors were better spent elsewhere). But we are getting close to the point where it might start making sense. QC have been publishing on it for a few years now, but as far as I know no-one has stated that a commercial product is using it, though who knows -- maybe Apple have already implemented an initial version?

For each of these cases, we again need recover in the case of misspeculation. Recovery from these kinds of misspeculation is generically called replay. The important difference compared to control speculation is that data speculation lends itself much better to tracking the precise dependencies of successor instructions on the speculated instruction: the dependency chains are shorter and less dense. This means that, although the easiest response, in the face of a data misspeculation, is to reuse the control misspeculation machinery, that's not the only possible response.
But even if you accept this idea and so want to engage in some sort of selective replay, there are still many different ways to do it, and getting the details wrong can have severe consequences (as you are aware from P4, and Cyclone as a suggested mechanism for doing a lot better. However I'd see Cyclone best thought of as an OoO scheduler [something I've not discussed at all] rather than as a *generic* replay mechanism. And Cyclone was Michigan, not Wisconsin, though I also frequently confuse the two!)

So some points from all this:
- the sheer size of structures is not that important. Of course it is important, but not in the way the fanboy internet thinks. Almost every structure of interest consists of not just the structure size but an associated usage algorithm. And the structure is almost always sized optimally for that algorithm, in the sense that growing the structure, even substantially (like doubling) buys you very little in improved performance. The best paper demonstrating this is

https://arxiv.org/pdf/1906.08170.pdf

which shows that even quadrupling Skylake resources in a naive way gets you only about 1.5x extra performance. People latch onto these numbers, like size of the ROB, or amount of storage provided for branch prediction, because they're available. But that's the drunk looking for his keys under the lamp because that's where the light is! The numbers are made available (not by Apple, but by ARM. AMD, Intel, ...) precisely because they are NOT important, they don't offer much of a competitive advantage. The magic is in the algorithm of how that storage is used. Saying 620-entry ROB implies there's a single way that something called a ROB is used, and anyone can just double a ROB and get much better performance. NO! ROB essentially means how much control provisional state can be maintained and scaling that up involves scaling up many many independent pieces.

This is one reason I get so furious at people who say that Apple performance is "just" from a larger ROB or larger BTB or whatever. Such thought display utter ignorance as to the problems that each piece of functionality solves and what goes into the solution. So, consider the ROB. The point of the ROB, as I explained is to track what needs to be flushed if a misprediction goes wrong. So why not just double the ROB?
Well, consider instruction flow. Instruction go into the Issue queue(s), wait for dependencies to be resolved, issue, execute. Executing takes worst case a few cycles, why not have a ROB of 30 or 40 entries?
Because SOME instructions (specifically loads that miss in cache) take a long time, and the instruction at the head of the ROB cannot be removed from the ROB until it completes. So with a short ROB, after you've filled up the 40 slots with the instruction after that load, you stop and wait till the load returns.
OK, so the ROB is just a queue, easily scaled up. Why not make it 4000 entries?
Because almost every instruction that goes into the ROB also requires some other resource, and that resource is held onto until the instruction move to the head of the ROB and completes. (What is called register rename is essentially resource allocation. During rename the resources an instruction will require -- a ROB slot, probably a destination register, perhaps an entry in the load or store queues -- are allocated, and they are held onto until completion.) So sure, you can have 4000 ROB slots, but if you only have 128 physical registers, then after those 128 are all allocated as destination registers, your ROB is filled with ~128 entries and the rest of them are empty because the machine is going to stall until a new physical register becomes available.)
So the first constraint on growing the ROB is that to do it you also need to grow the number of physical registers and the size of the LSQ. Neither of these are at all easy. And both of them involve their own usage algorithms where, yes, you can grow either the register file or the LSQ if you switch to using them in a novel way. But again this novel usage model is not captured by saying "of they have a 144 entry load queue" as though that's some trivial achievement, just a little harder than a 72 entry load queue.

But even THAT is not the end of the game. Because even if you can grow the other resources sufficiently (or can bypass them in other ways: my favorite solution for this is Long Term Parking https://hal.inria.fr/hal-01225019/document ) you have the problem that a good branch predictor is not a perfect branch predictor. Branches 1 in 6; a 600 entry ROB will hold ~100 branches. Even if each of those has only a 1% chance of misprediction that means there's close to 100% certainty that there is A misprediction somewhere in the ROB in all those instructions that piled up behind the load that missed. It's (to good enough accuracy) equally likely anywhere, meaning that half the work in your ROB, 300 instructions, is likely wasted along with wasted energy, done after a bad branch and will have to be flushed. 99% accurate sounds great for a branch predictor (and neither AMD nor Intel hit that, nor Apple who are somewhat closer) -- but by itself a huge ROB and an imperfect branch predictor just mean you're doing a lot more work that will eventually get flushed.
A12 results here:

https://twitter.com/x/status/1307645405883183104

Since Apple have this huge ROB and clearly get value from it
(they didn't start that way, they started with designs that were apparently very similar to say x86 at the time, scaled 1.5x wider. Best thing I've found showing the evolution over the years is here: Not great, but gets the gist and I know nothing better
https://users.nik.uni-obuda.hu/sima...e_2019/Apple's_processor_lines_2018_12_23.pdf )
they are clearly NOT just scaling all the structures up without changing the underlying algorithms. I have some hypotheses for how they get value from that massive ROB abd imperfect branch predictor, but this is already too long!

But I hope you read all this and think about it. And then realize why I get so angry, why saying "oh they *just* doubled the size of the ROB [or cache or branch predictor or whatever" is such an ignorant statement. It's not just that doubling the size of those structures is hard (though it IS), it's that doubling them is not nearly enough, it just doesn't get you very much without accompanying changes in the way those structures are used.
And Apple has been engaged in these changes at an astonishing rate -- looks like pretty much every two years or so they swap out some huge area of functionality like the PRF or the ROB or the branch predictor or the LSQ and swap into something new and, after the first round or two of this, something that has never been built before.

Second point
- your post keeps confusing replay with flushing. Read what I said. These are different concepts, implemented in different ways. Likewise you seem to think that a large ROB helps deal with low quality speculation. Precisely backward! A large ROB is only valuable if your (control) speculation is extremely high quality and provides for extremely fast recovery from mis-speculation. Likewise you seem to think that the ROB is somehow related to the machine width. Not at all.

I suggest that, based on all the reading I have given you, start thinking this stuff out in your head. Consider the flow of instructions from fetch to completion -- at first you don't even need OoO or superscalar, just consider an in-order single issue CPU that utilizes branch prediction. Think about the type of functionality that will HAVE TO be present for this to possibly work, how it will recover when something goes wrong. And then you can start adding additional functionality (superscalar, then OoO) into your mental model.

name99 · Nov 16, 2020

LightningZ71 said:
What we don't know yet is how this plays out on the top end of the performance market. With AMD able to put 64 cores in a threadripper workstation with eight channel ram in the WX series, that's going to be hard to beat for what we believe to be thir best case scenario for Mac Pro, which is a pair of 8+8 die linked with a high speed link in the same package. That's a maximm of 16 full speed threads and 16 efficiency threads against 64 SMT cores. While having an IPC advantage and complete stack control are certainly useful things, as they say in the car world, there is no replacement for displacement.

Please don't report this sort of stuff as "we believe".
I have no idea on what basis you believe that the best case Mac Pro scenario is "a pair of 8+8 die linked with a high speed link in the same package", but plenty of us do not believe that...

This is uninformed speculation, not any sort of sensible (historically or engineering informed) guess.

Doug S · Nov 16, 2020

IvanKaramazov said:
It is entirely possible that Apple is pushing the M1 to its top possible frequency, and that no further ST performance is to be had there by simply improving thermals (e.g. putting it in a desktop form factor). In fact I suspect that's true.

I very much doubt that's the case. Apple is not binning their M1s for clock rate, they are all running at the same clock just like on the iPhone. When you fab chips you get a range of power/performance from the 'working' chips. You can do like AMD & Intel and bin them on speed/power and sell the ones able to run faster or at lower power for more more, in several bins down to the ones that 'work' but at the bottom end of the curve end up being sold in the cheapo Best Buy specials. Or you can do like Apple and have them all run at the same speed and within the same power budget.

In order for the M1 to be running at its top possible frequency, Apple would have to be throwing out the large majority of working M1s and selecting only the top bin. There's no way that's the case, it would cost a fortune just to get a few hundred extra MHz. They are likely selecting closer to the bottom of the curve so some M1s are running as fast as they can, but others have plenty of headroom (where "plenty" depends on how wide the power/performance curve from TSMC's N5 process is)

So far the only binning we've seen Apple do is on A12X/A12Z and M1 running with 7 instead of 8 GPU cores. When the A14X iPad Pro ships we'll have to see if it gets the same score, indicating it is running at the same clock rate as the M1, or is a little slower indicating they are at least binning between Mac and iPad Pro.

IvanKaramazov · Nov 16, 2020

Doug S said:
I very much doubt that's the case. Apple is not binning their M1s for clock rate, they are all running at the same clock just like on the iPhone. When you fab chips you get a range of power/performance from the 'working' chips. You can do like AMD & Intel and bin them on speed/power and sell the ones able to run faster or at lower power for more more, in several bins down to the ones that 'work' but at the bottom end of the curve end up being sold in the cheapo Best Buy specials. Or you can do like Apple and have them all run at the same speed and within the same power budget.

In order for the M1 to be running at its top possible frequency, Apple would have to be throwing out the large majority of working M1s and selecting only the top bin. There's no way that's the case, it would cost a fortune just to get a few hundred extra MHz. They are likely selecting closer to the bottom of the curve so some M1s are running as fast as they can, but others have plenty of headroom (where "plenty" depends on how wide the power/performance curve from TSMC's N5 process is)

So far the only binning we've seen Apple do is on A12X/A12Z and M1 running with 7 instead of 8 GPU cores. When the A14X iPad Pro ships we'll have to see if it gets the same score, indicating it is running at the same clock rate as the M1, or is a little slower indicating they are at least binning between Mac and iPad Pro.

That's fair. My suspicion has been that they're binning precisely as you said, with slower chips for the iPad next spring. I guess we'll see!

shady28 · Nov 16, 2020

IvanKaramazov said:
Yeah, I tried doing this as well, extrapolating potential Cinebench scores based on the Geekbench delta between A12Z and M1. It's not a stupid way to guess, as if you compare GB and Cinebench for A12Z and the Intel chips they track pretty well. A (wildly) speculative, extrapolated 1537 would put the M1 faster than the 15w Tiger Lake ST, and in the ballpark of the 28w Tiger Lake ST. Which sounds about right. The same math on the MT scores would put the M1 just shy of 7000, while the 28w TGL sits around 6000. Again, sounds likely.

The real wildcard though is how much of the MT score is related to thermal constraints. The 4800u, for example, has an MT score of around 10,500, despite being very similar to the M1 in Geekbench MT. It's an open question whether M1 has a sustained MT profile closer to the Ryzen or Tiger Lake chips.

And we should know in about 48 hours. Right now Tiger Lake is the fastest single thread sub 35W laptop CPU you can buy in the x86 world along with the fastest iGPU. I would bet that on a sustained throughput metric like CineBench Tiger Lake will beat M1 on the passive cooled MacBook Air, but may lose to the Mini and MacBook Pro with their active cooling.

It will be interesting to compare and contrast M1 / TGL along with the older 8 core 4700U and much more common / lower price 6 core / 12 thread 4600U.

Here's a good comparative model, ZenBook 13 16GB RAM / 1TB SSD / Thunderbolt 4 using i7-1165G7. The 1165G7 is 100Mhz slower than the top Tiger 1185G7. This one sells for $999, same as MacBook Air, but with twice the RAM and 4x the SSD size of the $999 Air model. 13 Hr 47 Min battery test at Tomshardware. I think this is pretty close to the best the x86 world has to offer in this price range for thin and light iGPU laptops.

Review :

Asus ZenBook 13 (11th Gen) Review: Tiger Lake’s Impressive Debut

The Tiger Lake ZenBook 13 is stronger and cheaper than competitors.

www.tomshardware.com

Geekbench scores for this model :

Well maybe not the *best*, 9310 seems to take that (on Linux):

shady28 · Nov 16, 2020

Sorry M1, this is not even the top TGL SKU. This is a Linux platform though :

IvanKaramazov · Nov 16, 2020

shady28 said:
Sorry M1, this is not even the top TGL SKU. This is a Linux platform though :

View attachment 33970

One has to trawl through about 28 pages of GB results to find that 1716 score, which is paired with an anomalously low MT score for some reason. It's 8 pages at the moment to the first score above 1600. The theoretically faster 1185G7 is hardly shipping anywhere and only has 3 pages of results. Very few are above 1600 and the absolute highest reported is 1607.

The highest score currently reported for the new MBP is 1740, and the absolute lowest yet is 1559. And that is the sole score below 1600. And of course that leaves MT entirely out of the discussion, as well as the much faster GPU, and the likelihood that M1 is achieving its results at literally half the power of 28w TGL. Perhaps when a wider variety of metrics are available to compare, TGL will be competitive with M1. But based purely on what we know so far it doesn't seem likely.

Discussion Apple Silicon SoC thread

Lifer

Golden Member

Member

Member

Senior member

Golden Member

Lifer

Senior member

Lifer

Platinum Member

Diamond Member

Platinum Member

Lifer

Member

Diamond Member

Golden Member

Senior member

Member

Member

Senior member

Senior member

Diamond Member

Member

Platinum Member

Platinum Member

Member