Thank you for bringing things back around to technical discussion.
One area is most interesting to me, brought up in two separate posts of yours but seem related:
I don't think there's any question that Apple are using way prediction. For any of the major players to not be using way-predicted set-associative cache would be absurd, no? AMD and Intel have been using microtag way prediction since W's presidency. In any case, how does Apple's implementation differ, if at all, and do you have any explanation for why you pinpoint that exact comment about way prediction, speculative scheduling, and replay?
Some things I was thinking about:
As far as I was aware, if flushing were the "problem" and by "flushing" you mean non-selective replay (which you may not - in which case, please explain!) then way predictors haven't been "problematic" since at least Pentium 4, which doesn't use non-selective replay; however, Seznac and Michaud indicate that non-selective replay may be viable as long as an ROB or buffer (a la the replay queue on P4) is available and efficient. As for needing a quality replay mechanism, isn't it better to prevent the need for a replay in the first place by designing better speculative scheduling? Seeing as AMD switched from a neural network branch predictor to TAGE, I doubt they missed those papers... and since Apple were using TAGE since 2013 at least, it would make sense that Apple is also aware of the benefit of preventing replay in the first place with better speculative scheduling.
That being said, given the width of Apple's core and the large ROB do you imagine they're using token-based selective replay (or something similar, or, since 2013 they called their core Cyclone, maybe they're using the cyclone replay scheme!) rather than other mechanisms? Wouldn't be the first time they've used WARF research! This may explain the large ROB. Though, a large ROB would solve a lot of replay issues regardless of replay scheme, including those associated with flushing.
Edit: Andrei says the ROB size "outclasses" any other design and questions how Apple can "achieve" this design, but it's not that simple, is it? A large ROB could be a Band-Aid for a larger speculation problem or could be relieving a bottleneck, or could reflect a different speculative scheduling/replay mechanism... or, my suspicion is that it's larger by virtue of Apple having such a wide core. Indeed, the next largest ROBs are on Intel chips, and the smallest on Zen3. I don't know enough about X1 to comment on why it'd be the smallest, but if this is indeed a bottleneck and X1 uses a similar branch prediction / replay scheme, then it wouldn't make much sense to have such a small ROB.
It feels to me like you don't quite grasp the essential distinctions here.
All speculation is a (statistics informed) guess about how the program will probably behave, but requiring a way to recover if your guess was incorrect.
The most obvious and well-established form of speculation is branch prediction. The speculation is ultimately about the sequence of instructions through the CPU, and the recovery mechanism has two main pieces:
- various mechanisms (physical vs logical registers, and the LSQ [as a secondary function]) hold calculated values provisionally, until each branch is resolved, at which point all values calculated prior to that branch can be graduated from provisional state to correct state.
- the ROB holds the provisional ordering of instructions, connecting the instruction ordering (which instructions are now known to be valid, given that the branch that led to them is correct) to the values described above (sitting in physical registers and the LSQ).
What's important is the dependency graph: what set of subsequent speculated values are dependent on a speculated branch, and thus will be shown to be incorrect if that branch was incorrect.
Now the nature of control speculation (speculation on the sequence of instructions) is that once you make a branch for practical purposes EVERY instruction after that branch depends on that branch. Which means that if a branch was guessed incorrectly (direction or target) everything after it needs to be flushed.
Now you might protest that this is not true, that there are branch structures (diamonds) like
if (a<0){
b=-a
}else{
b=a
}
where there's a tiny amount of constrained control divergence, after which flow reconverges. This is true. But it doesn't help. Even if you want to correct speculation that led down the wrong half of the diamond just by flushing the instruction b=-a, and executing b=a, everything after the diamond closes is dependent on the value of b and also is now incorrect. It's just not practical to track everything that does or does not depend on a particular branch and selectively flush that branch and its dependents and nothing else
(a) almost EVERYTHING is dependent, so this buys you very little and
(b) branches are so dense (think ~1/6 of instructions) that you'd need tremendously complicated accounting to track what is dependent on this branch not that.
So end result is: control speculation as a practical matter has to recover by flushing *everything* (all instructions, all calculated values) after a misprediction.
If you think about this in detail, it leads to a whole set of issues.
- Of course you want an accurate predictor, that's given. But you also want to catch mispredicts if you can, up to decode and rename, but before they enter the OoO machinery, because catching them there only flushes the instructions queued after them in various buffers sitting between fetch, decode, up to rename. Hence the value of a long latency (but even more accurate) secondary branch detection mechanism using even larger pools of storage.
- you want to avoid branches with the characteristic that they are hard to predict and do very little work (like the diamond I described above). Things like MAX or ABS. This leads to the value of predicated instructions and things like CSEL/CMOV. The whole story of CMOV in the x86 world is a tragedy, and since this is supposed to be purely technical I won't cover it. But the fallout is that much of the x86 world, even today, is convinced that predication is a bad idea (and imagine that tiny micro-benchmarks prove this). But microbenchmarks miss the big picture. The value of predication is that it converts a branch (which becomes massively expensive if it's mis-predicted) into straightline execution with no speculation and no penalties. Fortunately ARM CSEL was well designed and implemented from the start so ARM doesn't have this weird x86 aversion. IBM even converts short branches to predication and I suspect Apple does the same (just on the grounds that Apple seems to have implemented every good idea that has ever been invented).
There's vastly more that can be said about branches (some of it said by me here, going forwards and backwards from this anchor link):
Look for the posts by me, Maynard Handley
https://www.realworldtech.com/forum/?threadid=196054&curpostid=196130
Even apart from that, you can start asking how exactly you recover from a misspeculation... The first ways of doing this in the late 80s based on walking the ROB (not very large at the time) were adequate but didn't scale. So then came checkpoints but there's a whole subspecialty in when you implement checkpoints... EVERYTHING in this space is much more than just the words -- I can implement checkpoints, and you can implement checkpoints, but I can get much more value out of my checkpoints than you (ie mispredicts are a lot cheaper) if I am smarter in when I create each checkpoint.
But all the above was throat clearing. The point is that control speculation has characteristics that mean recovering from a misprediction require flushing everything after the misprediction. But there are other forms of speculation.
One of the earliest, which you know something about, is speculative scheduling (ie guess that a load will hit in cache, and schedule subsequent instructions based on that).
Another is load/store aliasing: if a store doesn't yet know its address, so it's just sitting in the store queue waiting, what do we do with a subsequent load? We could delay it until we know the address of every store, but chances are the load doesn't actually load from the address of that store, so we are delaying for nothing. Classical speculation territory... (One way to do this were what I referred to with store sets and the Moshovos patent) But once again, if your speculation is incorrect, then the contents of the load, and everything that depends on it, are now invalid.
A third possibility is value prediction. This notes that there are some loads that occur over and over again but what's loaded never changes, so you can bypass the load and just supply that value. This is the kind of thing you'd say "that's dumb, how often does it happen?" Well, unfortunately it happens way more than it should... Value prediction is in a kinda limbo right now. For years it was talked about but not practical (in the sense that, with limited resources, transistors were better spent elsewhere). But we are getting close to the point where it might start making sense. QC have been publishing on it for a few years now, but as far as I know no-one has stated that a commercial product is using it, though who knows -- maybe Apple have already implemented an initial version?
For each of these cases, we again need recover in the case of misspeculation. Recovery from these kinds of misspeculation is generically called replay. The important difference compared to control speculation is that data speculation lends itself much better to tracking the precise dependencies of successor instructions on the speculated instruction: the dependency chains are shorter and less dense. This means that, although the easiest response, in the face of a data misspeculation, is to reuse the control misspeculation machinery, that's not the only possible response.
But even if you accept this idea and so want to engage in some sort of selective replay, there are still many different ways to do it, and getting the details wrong can have severe consequences (as you are aware from P4, and Cyclone as a suggested mechanism for doing a lot better. However I'd see Cyclone best thought of as an OoO scheduler [something I've not discussed at all] rather than as a *generic* replay mechanism. And Cyclone was Michigan, not Wisconsin, though I also frequently confuse the two!)
So some points from all this:
- the sheer size of structures is not that important. Of course it is important, but not in the way the fanboy internet thinks. Almost every structure of interest consists of not just the structure size but an associated usage algorithm. And the structure is almost always sized optimally for that algorithm, in the sense that growing the structure, even substantially (like doubling) buys you very little in improved performance. The best paper demonstrating this is
which shows that even quadrupling Skylake resources in a naive way gets you only about 1.5x extra performance. People latch onto these numbers, like size of the ROB, or amount of storage provided for branch prediction, because they're available. But that's the drunk looking for his keys under the lamp because that's where the light is! The numbers are made available (not by Apple, but by ARM. AMD, Intel, ...) precisely because they are NOT important, they don't offer much of a competitive advantage. The magic is in the algorithm of how that storage is used. Saying 620-entry ROB implies there's a single way that something called a ROB is used, and anyone can just double a ROB and get much better performance. NO! ROB essentially means how much control provisional state can be maintained and scaling that up involves scaling up many many independent pieces.
This is one reason I get so furious at people who say that Apple performance is "just" from a larger ROB or larger BTB or whatever. Such thought display utter ignorance as to the problems that each piece of functionality solves and what goes into the solution. So, consider the ROB. The point of the ROB, as I explained is to track what needs to be flushed if a misprediction goes wrong. So why not just double the ROB?
Well, consider instruction flow. Instruction go into the Issue queue(s), wait for dependencies to be resolved, issue, execute. Executing takes worst case a few cycles, why not have a ROB of 30 or 40 entries?
Because SOME instructions (specifically loads that miss in cache) take a long time, and the instruction at the head of the ROB cannot be removed from the ROB until it completes. So with a short ROB, after you've filled up the 40 slots with the instruction after that load, you stop and wait till the load returns.
OK, so the ROB is just a queue, easily scaled up. Why not make it 4000 entries?
Because almost every instruction that goes into the ROB also requires some other resource, and that resource is held onto until the instruction move to the head of the ROB and completes. (What is called register rename is essentially resource allocation. During rename the resources an instruction will require -- a ROB slot, probably a destination register, perhaps an entry in the load or store queues -- are allocated, and they are held onto until completion.) So sure, you can have 4000 ROB slots, but if you only have 128 physical registers, then after those 128 are all allocated as destination registers, your ROB is filled with ~128 entries and the rest of them are empty because the machine is going to stall until a new physical register becomes available.)
So the first constraint on growing the ROB is that to do it you also need to grow the number of physical registers and the size of the LSQ. Neither of these are at all easy. And both of them involve their own usage algorithms where, yes, you can grow either the register file or the LSQ if you switch to using them in a novel way. But again this novel usage model is not captured by saying "of they have a 144 entry load queue" as though that's some trivial achievement, just a little harder than a 72 entry load queue.
But even THAT is not the end of the game. Because even if you can grow the other resources sufficiently (or can bypass them in other ways: my favorite solution for this is Long Term Parking
https://hal.inria.fr/hal-01225019/document ) you have the problem that a good branch predictor is not a perfect branch predictor. Branches 1 in 6; a 600 entry ROB will hold ~100 branches. Even if each of those has only a 1% chance of misprediction that means there's close to 100% certainty that there is A misprediction somewhere in the ROB in all those instructions that piled up behind the load that missed. It's (to good enough accuracy) equally likely anywhere, meaning that half the work in your ROB, 300 instructions, is likely wasted along with wasted energy, done after a bad branch and will have to be flushed. 99% accurate sounds great for a branch predictor (and neither AMD nor Intel hit that, nor Apple who are somewhat closer) -- but by itself a huge ROB and an imperfect branch predictor just mean you're doing a lot more work that will eventually get flushed.
A12 results here:
Since Apple have this huge ROB and clearly get value from it
(they didn't start that way, they started with designs that were apparently very similar to say x86 at the time, scaled 1.5x wider. Best thing I've found showing the evolution over the years is here: Not great, but gets the gist and I know nothing better
https://users.nik.uni-obuda.hu/sima...e_2019/Apple's_processor_lines_2018_12_23.pdf )
they are clearly NOT just scaling all the structures up without changing the underlying algorithms. I have some hypotheses for how they get value from that massive ROB abd imperfect branch predictor, but this is already too long!
But I hope you read all this and think about it. And then realize why I get so angry, why saying "oh they *just* doubled the size of the ROB [or cache or branch predictor or whatever" is such an ignorant statement. It's not just that doubling the size of those structures is hard (though it IS), it's that doubling them is not nearly enough, it just doesn't get you very much without accompanying changes in the way those structures are used.
And Apple has been engaged in these changes at an astonishing rate -- looks like pretty much every two years or so they swap out some huge area of functionality like the PRF or the ROB or the branch predictor or the LSQ and swap into something new and, after the first round or two of this, something that has never been built before.
Second point
- your post keeps confusing replay with flushing. Read what I said. These are different concepts, implemented in different ways. Likewise you seem to think that a large ROB helps deal with low quality speculation. Precisely backward! A large ROB is only valuable if your (control) speculation is extremely high quality and provides for extremely fast recovery from mis-speculation. Likewise you seem to think that the ROB is somehow related to the machine width. Not at all.
I suggest that, based on all the reading I have given you, start thinking this stuff out in your head. Consider the flow of instructions from fetch to completion -- at first you don't even need OoO or superscalar, just consider an in-order single issue CPU that utilizes branch prediction. Think about the type of functionality that will HAVE TO be present for this to possibly work, how it will recover when something goes wrong. And then you can start adding additional functionality (superscalar, then OoO) into your mental model.