Originally posted by: imgod2u
Unfortunately, whether the cache blocks or not, the processor will still run into problems when the data needed is not there. This is the major bottleneck in modern systems with processor idle time being in the 200+ cycles. That's way more of a hit than any branch mispredict.
I think you missed the significance of a nonblocking cache, but you do bring up another good point. Although stalls on memory accesses are pretty rare (ignoring compulsory misses), their associated penalties are large enough to be a problem.
You don't neccessarily have to resolve all branches. But in the 200+ cycle wait time, that's at least 10 or so branches you could resolve, granted a lot of them will be discarded but performance will still be higher.
Let me try to be a bit more clear. First off, a branch mispredict does not necessarily incur a memory access penalty. It does, however, result in a pipeline flush. Current branch prediction schemes can hit over 95% accuracy easily, with numbers of 98% on average being entirely possible.
Let us see if performance would be higher. Assume an ideal processor that always has a full pipeline. A branch mispredict flushes 120 micro-ops. Given a simplistic algorithm like gshare that yields 93% accuracy, that means 7% of the time the processor mispredicts the branch and flushes the entire pipe.
Given a particular instruction stream, average of 6 instructions between branches and 1.5 micro-ops per instruction, that means you have a branch instruction every 9 micro-ops. It is very likely a branch itself is broken down into at least two micro-ops, one for address calculation and one for branch resolution. This yields a nice round number, 10 micro-ops, from one branch to another. That means we have 12 branches in flight at any time.
Add two and two and .7% of all micro-ops incur a pipeline flush. The result is a performance penalty of 84%.
Now assume that one branch is always correct. Every twelfth branch is predicted correctly because both possible outcomes are issued. Assume, for the sake of simplicity, the extra hardware incurs zero penalty. This means every twelfth branch effectively halves the issue rate of the processor until the branch is retired.
One tenth of all micro-ops branch. One twelfth of one tenth will never mispredict, but will halve the issue rate until retirement. This gives a penalty of 8.33...%. The rest of the one tenth will use the branch predictor. The associated penalty is now 77%. The total performance penalty is now 85%.
Now assume that all branch outcomes are interleaved to yield 100% branch prediction. 12 branches are in flight. The first branch spawns two threads, which each spawn two threads, which each spawn two more. However, once the first branch is retired, about half the pipeline is flushed, regardless of the outcome. I say about half because only 12 segments can fit, yet there are 2+4+8=14 segments. The performance penalty is now about 100%.
Redoing the math for an ideal issue rate that gives 60 micro-ops in flight (the reason for HT), the numbers come out to 42%, 38.5% and 0%. This is assuming that all required resources are available on any clock cycle. However, this also means it takes twice as long to finish as compared to the max issue rate with 120 micro-ops in flight. Relative speedup in relation to 120 micro-ops with standard branch prediction is 65% with half the issue rate, 67% with half and one branch, and 92% with every branch in flight.
Since reality is rarely ideal, it's quite possible to get different numbers. I assumed the processor issues and retires 5 micro-ops. I ignored memory accesses due to cache misses since I don't know the miss rates. That means I also ignored the penalties with x86 decoding vs a hit in the trace cache. I also ignored penalties associated with additional logic. For branch prediction, 93% is a decade old figure. The Pentium 4 branch predictor is better. If I remember correctly, it uses a tournament predictor with accuracy approaching 97% or better. Speaking of which, I also ignored pipeline bubbles at the fetch and decode stages from an overriden prediction.
MUX's make up a very small part of the circuitry though. The control units will have to be bigger but the overall increase in complexity isn't as high.
I simply used MUX's as an example of exponential growth in complexity.
As for keeping track of instructions in stream, each branch would add maybe 5-7 instructions to monitor (as opposed to the original method of predicting) as they're all part of the same instruction stream. You wouldn't need to double your instruction window or your reorder window.
That only applies to loops. Most branches will require the processor to fetch more instructions from non-sequential memory segments. This is why the trace cache was originally developed. Storing entire traces per line instead of memory segments means that no single instruction straddles two cache lines. Caching micro-ops was simply a great idea.
And if you run out of places to put those instructions, you can always stall, but I think the bottleneck with modern MPUs isn't the lack of execution resources but the means to fill them. This is an excellent way to find excess instructions (albeit, most of them will be discarded, but those that aren't will improve performance).
It is perhaps the easiest way to find instructions to issue. However, it is also the easiest way to waste resources. If you do it for only one branch, you're essentially duplicating everything up to the ROB. If you're doing this with multiple branches, it gets worse and worse as the pipeline lengthens. Going with every single branch would require duplicating every single stage prior to retire just to handle branch-intensive code. Imagine the memory traffic.
Each mispredicted branch that is avoided because the right path was already taken will save 31 cycles on a Prescott and that was time that would've been wasted anyway having the processor sit idle waiting for memory.
I think your numbers a mixed up. A pipeline flush wastes 31 cycles, but a memory access requires hundreds of cycles. How many cycles would be saved by 100% branch prediction is difficult to calculate without hard data. I'm pretty sure the L2 cache will hit greater than 90% of instructions, which means less than 1% of all instructions in any given program will miss (ignoring compulsory misses).