Is x86 architecture hampering performance?

Vee · Jul 9, 2004

Originally posted by: oconnect
While I don't know a whole lot about other architectures I have heard there are/were better types of architecture that can beat our current architecture. What do you guys think?

No.
That game is over. The increased number of transistors available, pipelining, register renaming, register files, vector operations, OoO execution, have basically cancelled all of RISC's advantages. Someone at AMD put it something like this way: x86 was not the best way to start or the most rational way to travel underways, but it is the right destination in the end.
Today, the instruction execution speed is not the issue. The memory system is.
Today, a "threat" comes from IBM's cell processor, but x86 will go down that way too, - multicore, multi cpu.

P.S. However, there was one extremely alien/radical idea for software-OS-CPU architecture that is still secret, (so I can't tell you anything). Just to give you an idea of how alien it was, - no existing programming language was compatible for writing software source! No existing software could even be ported. You'd have to do all from scratch again. You all understand the obvious reason it was abandoned.

jhu · Jul 10, 2004

Originally posted by: Sahakiel
Last I checked, x87 is IEEE compliant. The internal 80-bit precision is not.

i got one of my posts wrong. x87 and sse/sse2 are ieee 754 compliant (although with sse/sse2 you can turn off compliance). ieee 754 only describes the way floating point numbers are layed out in binary and on the operation of the numbers.

CTho9305 · Jul 10, 2004

Originally posted by: Vee
Today, the instruction execution speed is not the issue. The memory system is.

Then why are do so many applications perform differently on 2ghz vs 1.5ghz CPUs? Modern caches do a good enough job at hiding a lot of the memory performance issues that execution speed does matter.

Sahakiel · Jul 10, 2004

Originally posted by: CTho9305

Originally posted by: Vee
Today, the instruction execution speed is not the issue. The memory system is.

Click to expand...

Then why are do so many applications perform differently on 2ghz vs 1.5ghz CPUs? Modern caches do a good enough job at hiding a lot of the memory performance issues that execution speed does matter.

I think what he's trying to say is that modern technology has pretty much factored out ISA design from the equation. You'd pretty much have to be trying to make ISA an issue in order to have it hamper performance.

glugglug · Jul 10, 2004

Originally posted by: Vee
P.S. However, there was one extremely alien/radical idea for software-OS-CPU architecture that is still secret, (so I can't tell you anything). Just to give you an idea of how alien it was, - no existing programming language was compatible for writing software source! No existing software could even be ported. You'd have to do all from scratch again. You all understand the obvious reason it was abandoned.

Photo of the facility where this "alien CPU" was discovered

(well sort of, if you look at the URL of the image itself, you see that this location is "special" in terraserver, and doesn't match the way the images of all other locations are put together.

This view about a half mile east of the supposed coordinates is more interesting

CTho9305 · Jul 10, 2004

Originally posted by: Sahakiel

Originally posted by: CTho9305

Originally posted by: Vee
Today, the instruction execution speed is not the issue. The memory system is.

Click to expand...

Then why are do so many applications perform differently on 2ghz vs 1.5ghz CPUs? Modern caches do a good enough job at hiding a lot of the memory performance issues that execution speed does matter.

Click to expand...

I think what he's trying to say is that modern technology has pretty much factored out ISA design from the equation. You'd pretty much have to be trying to make ISA an issue in order to have it hamper performance.

3-5 stages of decode for x86 is a pretty decent hit if your branch predictor isn't doing well.

Sahakiel · Jul 11, 2004

Originally posted by: CTho9305
3-5 stages of decode for x86 is a pretty decent hit if your branch predictor isn't doing well.

Heh.. if your branch predictor isn't doing well, you're probably having loads of other problems as well.
The point was that modern design technologies compensate for ISA deficiencies to the point where either you're really trying to make the ISA a significant issue or your design team has no clue.

Vee · Jul 11, 2004

Originally posted by: CTho9305

Originally posted by: Vee
Today, the instruction execution speed is not the issue. The memory system is.

Click to expand...

Then why are do so many applications perform differently on 2ghz vs 1.5ghz CPUs? Modern caches do a good enough job at hiding a lot of the memory performance issues that execution speed does matter.

Caches are part of the memory system, as are prefetch etc.
Otherwise, I don't understand you. Of course applications perform differently on different architectures.

imgod2u · Jul 11, 2004

Originally posted by: CTho9305

Originally posted by: Sahakiel

Originally posted by: CTho9305

Originally posted by: Vee
Today, the instruction execution speed is not the issue. The memory system is.

Click to expand...

Then why are do so many applications perform differently on 2ghz vs 1.5ghz CPUs? Modern caches do a good enough job at hiding a lot of the memory performance issues that execution speed does matter.

Click to expand...

I think what he's trying to say is that modern technology has pretty much factored out ISA design from the equation. You'd pretty much have to be trying to make ISA an issue in order to have it hamper performance.

Click to expand...

3-5 stages of decode for x86 is a pretty decent hit if your branch predictor isn't doing well.

With a trace cache, it's much less of a problem. Look at it this way. Even on a 31 stage (from the trace cache), superpipelined MPU, a branch mispredict will cost you about 31 cycles (probably less as you know the branch target before the branch instruction finishes retires). On the same MPU, a cache miss and subsequent fetch to memory will cost you 300+ cycles (assuming your processor is clocked at 3+ GHz and your memory maintains 200MHz with all the latencies of the memory controller and memory fetch system). You tell me which is worse.

Memory is, as of right now, dominant in terms of performance. Just look at the wonders an integrated memory controller did for the K8. Of course, if you clock your CPU higher, it'll still bring an increase in performance as some parts of the running program still fit into cache. That doesn't mean that the few cases of a cache miss does not hurt performance still, and much more significantly than any other hazards.

Sahakiel · Jul 12, 2004

Originally posted by: Vee

Originally posted by: CTho9305

Originally posted by: Vee
Today, the instruction execution speed is not the issue. The memory system is.

Click to expand...

Then why are do so many applications perform differently on 2ghz vs 1.5ghz CPUs? Modern caches do a good enough job at hiding a lot of the memory performance issues that execution speed does matter.

Click to expand...

Caches are part of the memory system, as are prefetch etc.
Otherwise, I don't understand you. Of course applications perform differently on different architectures.

Basically, he's saying take a P4C at 2.4 GHz and compare it to a P4C 3.2 GHz and you'll get a performance boost with the faster proc despite both running the memory bus at 200MHz. Caches compensate for slow memory systems and these days they work well enough to allow for measurable performance differences.

Jeff7181 · Jul 12, 2004

I think parallel is the way to go. We're already approaching the limits of the speed we can run things at before they create so much heat it's not worth the cooling solution necessary to get the processor to run that fast. The only problem I see with creating highly parallel hardware optimized for certain things is you lose flexibility. Look at a GPU... it's insanely powerful... use 3DMark2003 for comparison... the CPU benchmark tests use the same demo as the GPU benchmarks, but in lower resolution and the GPU isn't being used to render the images... the GPU does a more complicated task (higher resolution) 4, 5, 6, sometimes up to 10 times faster.

So there's no doubt if someone made a CPU specifically designed to encode video, you could encode a whole movie in 5-10 minutes. But what else would it be good for? Probably not much.

Look at the xbox and the playstation 2. Everyone says how crappy the PS2 is and how difficult it is to program for it. Those same people are saying how easy the xbox is to program for... and it just so happens that the processor in an xbox is an x86 CPU. Go figure...

CTho9305 · Jul 12, 2004

Originally posted by: Jeff7181
Look at a GPU... it's insanely powerful... use 3DMark2003 for comparison... the CPU benchmark tests use the same demo as the GPU benchmarks, but in lower resolution and the GPU isn't being used to render the images... the GPU does a more complicated task (higher resolution) 4, 5, 6, sometimes up to 10 times faster.

I thought the CPU test still used the GPU, but with low resolution in order to 1) minimize the delays from the video card and 2) keep the CPU doing the same type of work it does in normal gaming. If you had the CPU do software rendering in the CPU test, you'd be giving it a VERY different workload from the type it normally has, and producing useless benchmark results.

So there's no doubt if someone made a CPU specifically designed to encode video, you could encode a whole movie in 5-10 minutes. But what else would it be good for? Probably not much.

You'll find a lot more specializing in low-power parts. For example, AMD's Alchemy Au1550 has a dedicated hardware encryption unit, which makes it VERY fast at encrypting/decrypting data. Via's newest C3s also have hardware AES support. The problem with including that hardware in a general purpose CPU, as you said, is that it's not used the vast majority of the time, but it does increase manufacturing and design costs.

Look at the xbox and the playstation 2. Everyone says how crappy the PS2 is and how difficult it is to program for it. Those same people are saying how easy the xbox is to program for... and it just so happens that the processor in an xbox is an x86 CPU. Go figure...

The playstation 2 is not difficult to program for just because it isn't x86. If that was the reason, the Gamecube would also be hard to program for (as would Sparc, other PPC, and MIPS-based systems). The PS2 is hard to program for because it's got many processors, and they aren't all the same - as I understand it, some are vector processors and some aren't. This means that to effectively use its power, not only do you have to be able to parallelize your task, you have to write it in a way that works well on each of the processor types in the PS2.

imported_jediknight · Jul 13, 2004

Edit:
Whoops. Mistake in my post.. it's irrelevant now.

Spencer278 · Jul 13, 2004

Originally posted by: jhu

Originally posted by: Sahakiel
Last I checked, x87 is IEEE compliant. The internal 80-bit precision is not.

Click to expand...

i got one of my posts wrong. x87 and sse/sse2 are ieee 754 compliant (although with sse/sse2 you can turn off compliance). ieee 754 only describes the way floating point numbers are layed out in binary and on the operation of the numbers.

I don't think the sse/sse2 instructions implement all the rounding modes from IEEE standard but I don't know if they are required or optionial. Most people don't care about rounding modes but I'm sure someone does.

Vee · Jul 15, 2004

Originally posted by: Sahakiel

Originally posted by: Vee

Originally posted by: CTho9305

Originally posted by: Vee
Today, the instruction execution speed is not the issue. The memory system is.

Click to expand...

Then why are do so many applications perform differently on 2ghz vs 1.5ghz CPUs? Modern caches do a good enough job at hiding a lot of the memory performance issues that execution speed does matter.

Click to expand...

Caches are part of the memory system, as are prefetch etc.
Otherwise, I don't understand you. Of course applications perform differently on different architectures.

Click to expand...

Basically, he's saying take a P4C at 2.4 GHz and compare it to a P4C 3.2 GHz and you'll get a performance boost with the faster proc despite both running the memory bus at 200MHz. Caches compensate for slow memory systems and these days they work well enough to allow for measurable performance differences.

Well. I'm saying a core's execution power can be made almost arbitrarily powerful, given transistors. The problem is feeding it.
So what, a slower CPU will be slower?
P4 is not a perfect example, since it concentrates on doing only the easy things fast.
But consider instead a 2.4GHz Celeron, 2.4GHz P4A, 2.4GHz P4B, and a 2.4GHz P4C.
Basically the same execution at the same clock, but different memory systems. No, I don't think cashes work well enough, or do a good enough job.

Currently, I believe additional transistors are best used by exploiting software multithreading, with multiple cores. But considering a single core, most additional transistors would have to be used in the memory system (remember I consider prefetch and cashes part of the memory system) in order to utilize additional execution unit power. For that reason, there is no gain in using other ISA, since they (risc) primarily attempts to better use a limited number of transistors, for the speed of the execution unit. Often at a penalty in the memory system. So, regarding OP's question, - no, I don't think x86 is hampering performance today.
Consider what a new ISA brought in the IA64 aka EPIC: more than 400 million transistors, most of it cache, still - SPECint comparable to a 100 million transistor Opteron.

imgod2u · Jul 15, 2004

Originally posted by: Vee

Originally posted by: Sahakiel

Originally posted by: Vee

Originally posted by: CTho9305

Originally posted by: Vee
Today, the instruction execution speed is not the issue. The memory system is.

Click to expand...

Then why are do so many applications perform differently on 2ghz vs 1.5ghz CPUs? Modern caches do a good enough job at hiding a lot of the memory performance issues that execution speed does matter.

Click to expand...

Caches are part of the memory system, as are prefetch etc.
Otherwise, I don't understand you. Of course applications perform differently on different architectures.

Click to expand...

Basically, he's saying take a P4C at 2.4 GHz and compare it to a P4C 3.2 GHz and you'll get a performance boost with the faster proc despite both running the memory bus at 200MHz. Caches compensate for slow memory systems and these days they work well enough to allow for measurable performance differences.

Click to expand...

Well. I'm saying a core's execution power can be made almost arbitrarily powerful, given transistors. The problem is feeding it.
So what, a slower CPU will be slower?
P4 is not a perfect example, since it concentrates on doing only the easy things fast.
But consider instead a 2.4GHz Celeron, 2.4GHz P4A, 2.4GHz P4B, and a 2.4GHz P4C.
Basically the same execution at the same clock, but different memory systems. No, I don't think cashes work well enough, or do a good enough job.

Currently, I believe additional transistors are best used by exploiting software multithreading, with multiple cores. But considering a single core, most additional transistors would have to be used in the memory system (remember I consider prefetch and cashes part of the memory system) in order to utilize additional execution unit power. For that reason, there is no gain in using other ISA, since they (risc) primarily attempts to better use a limited number of transistors, for the speed of the execution unit. Often at a penalty in the memory system. So, regarding OP's question, - no, I don't think x86 is hampering performance today.
Consider what a new ISA brought in the IA64 aka EPIC: more than 400 million transistors, most of it cache, still - SPECint comparable to a 100 million transistor Opteron.

Until you look at its SpecFP score. Even Deerfield with 1.5MB of L3 cache (less than that of some of the Xeons) performs very admirably with a lower TDP than x86.

As time goes on, power limitations will become dominant in terms of performance, and x86 does not have that advantage. In fact, it is significantly disadvantaged over a simple core such as a VLIW core. Itanium runs hot because it has so much more execution power than comparable x86 cores and yet, its TDP is still somewhat reasonable.

sao123 · Jul 15, 2004

not to change ths subject, well yes ok to change it...

x86 of today seems to be hindered by branching prediction and pipestalls as a result of misprediction...

would it be possible to engineer a (partial dual core) processor which would have 2 identical units running the same code...

then... when a branch is encountered...one core takes route a, and the other takes route b.
They both continue upon their respective paths... Then when the decision actually occurs in the pipe, the core containing the incorrect path would resync with the other core, and start the process all over again.

This would provide much less stalling, and 100% accuracy in branch prediction...causing much less penalty hits.

I understand the design could be complex and cost a lot, but is it possible, and how would you design it?

Sahakiel · Jul 15, 2004

Originally posted by: Vee

Well. I'm saying a core's execution power can be made almost arbitrarily powerful, given transistors.

Technically, there is a limit to die size, but yes, adding transistor logic ad infinitum yields infinitely power processors.
At the same time, a memory system can also be made incredibly powerful. Current cache hierarchies provide almost ideal memory latencies most of the time. Making the memory faster is then a simple problem of adding more caches and replacing DRAM with SRAM. Old Cray supercomputers used that philosophy to build entire systems with only SRAM; not sure if they still do.
If each memory cell isn't fast enough, then simply widen the data path. Even the worst branch predictor in the world can be compensated with a very wide data path. Again, cost is the issue.
So, while I would agree that the actual ISA design isn't as much of a limiting factor as it has been historically, I also think it's still not a negligible issue. As far as I know, any shortcomings in any ISA design can be compensated with other design tradeoffs. The primary concern is money, which means indirectly, the ISA does have a noticeable effect on processor performance.

Consider what a new ISA brought in the IA64 aka EPIC: more than 400 million transistors, most of it cache, still - SPECint comparable to a 100 million transistor Opteron.

SPEC is good, but it is still a benchmark. Much like MIPS ratings, they can also be meaningless indicators of processor speed.
As mentioned before, SPECint may be comporable to an Opteron, but SPECfp is much stronger. To me, that simply means the software is immature. EPIC can be seen as geared towards massively parallel code. Historically, integer based code is highly serial.

Originally posted by: sao123
would it be possible to engineer a (partial dual core) processor which would have 2 identical units running the same code...

I think it's been done before, but if I remember correctly, the benefits were dwarfed by the sheer cost. I forget exactly which CPU used that design, but I believe it was simply a replication of the first half of the pipeline.
Technically, it gave 100% branch prediction. However, I think it was a very short 7 stage pipeline, which meant it would rarely, if ever, encounter multiple branches within the replicated stages. It was also very expensive, having to duplicate about half the entire logic core. At the time, it was a relatively poor investment.
I'm not sure how well it would work with today's pipeline lengths. I imagine it would bring somewhat limited benefit considering the extra power and logic required as well as the 95+% branch predictors already in place. You'd also have two copies of a lot of logic, and in some cases four or even eight.
For something like the Pentium 4, I think interleaving possible branch outcomes would virtually eliminate branch mispredicts. However, it would also lower the effective performance. It would run akin to a processor with clock speed between one half to maybe one eighth the Pentium 4 and require much, much more logic.

imgod2u · Jul 15, 2004

Originally posted by: Sahakiel

Originally posted by: Vee

Well. I'm saying a core's execution power can be made almost arbitrarily powerful, given transistors.

Click to expand...

Technically, there is a limit to die size, but yes, adding transistor logic ad infinitum yields infinitely power processors.
At the same time, a memory system can also be made incredibly powerful. Current cache hierarchies provide almost ideal memory latencies most of the time. Making the memory faster is then a simple problem of adding more caches and replacing DRAM with SRAM. Old Cray supercomputers used that philosophy to build entire systems with only SRAM; not sure if they still do.
If each memory cell isn't fast enough, then simply widen the data path. Even the worst branch predictor in the world can be compensated with a very wide data path. Again, cost is the issue.
So, while I would agree that the actual ISA design isn't as much of a limiting factor as it has been historically, I also think it's still not a negligible issue. As far as I know, any shortcomings in any ISA design can be compensated with other design tradeoffs. The primary concern is money, which means indirectly, the ISA does have a noticeable effect on processor performance.

Current caching is getting farther and farther away from "ideal". Even cache latencies nowadays are in the multicycles. Ideally, memory should only offer a latency of 1 (whereas registers have 0 latency). However, even the L1 data cache on the P4 Prescott has been given a latency hike in order to provide better clockspeed headroom (and the results of this is apparant in the performance penalty). L2 caches put up a whopping 20+ cycle latency nowadays. Even the best OoOE window (Prescott) can't hide that. Caches are no longer an ideal solution and you certainly can't just keep adding more of it and expect performance increases. You're limited by latency.

Originally posted by: sao123
would it be possible to engineer a (partial dual core) processor which would have 2 identical units running the same code...

Click to expand...

I think it's been done before, but if I remember correctly, the benefits were dwarfed by the sheer cost. I forget exactly which CPU used that design, but I believe it was simply a replication of the first half of the pipeline.

Itanium. It attempts to execute both execution paths when it hits a branch and tosses away the one that was incorrect. This has more or less been proven to be ineffective as there simply isn't enough execution hardware at times to execute both branches.
This might, however, be an interesting solution to getting better single-threaded performance out of dual-core processors in the future. If the second core is idle (being sent noop), there might be a mechanism to "switch" the core to "assist" mode in which it simply executes the other execution path of a branch that the main core is currently executing. You have that hardware there anyway, why not use it instead of having it sit idle?

Technically, it gave 100% branch prediction. However, I think it was a very short 7 stage pipeline, which meant it would rarely, if ever, encounter multiple branches within the replicated stages. It was also very expensive, having to duplicate about half the entire logic core. At the time, it was a relatively poor investment.

Hardly. It was better, but when you ran into conditional and/or nested branches, it crapped out and just did prediction.

I'm not sure how well it would work with today's pipeline lengths. I imagine it would bring somewhat limited benefit considering the extra power and logic required as well as the 95+% branch predictors already in place. You'd also have two copies of a lot of logic, and in some cases four or even eight.
For something like the Pentium 4, I think interleaving possible branch outcomes would virtually eliminate branch mispredicts. However, it would also lower the effective performance. It would run akin to a processor with clock speed between one half to maybe one eighth the Pentium 4 and require much, much more logic.

Considering how much of modern day cores remains unused due to branches and/or data dependencies, usage of the hardware during idle time (say, when a cache miss occurs) to process secondary branches may be an effective use of the processor time. The hardware for executing multiple code streams at the same time is already on modern day P4's. SMT anyone?

Sahakiel · Jul 16, 2004

Originally posted by: imgod2u
Current caching is getting farther and farther away from "ideal". Even cache latencies nowadays are in the multicycles. Ideally, memory should only offer a latency of 1 (whereas registers have 0 latency). However, even the L1 data cache on the P4 Prescott has been given a latency hike in order to provide better clockspeed headroom (and the results of this is apparant in the performance penalty). L2 caches put up a whopping 20+ cycle latency nowadays. Even the best OoOE window (Prescott) can't hide that. Caches are no longer an ideal solution and you certainly can't just keep adding more of it and expect performance increases. You're limited by latency.

Caches were never an ideal situation. The introduction of caching was a stop-gap measure. It is now considered a normal part of any system design because the original problem has never been solved.
Latency is that original problem. Every type of memory ever designed has had latency problems. That is how register-based architectures came to be. By the time lithography hits 60nm, clock scaling will force memory access latencies of at least two clock cycles to access memories as small as 256 Bytes.
The only ways to hide that latency has been wider data paths, OOOE, ILP, and pipelining. Three of those four options are hitting the wall hard. The fourth is limited by cost.

Itanium. It attempts to execute both execution paths when it hits a branch and tosses away the one that was incorrect. This has more or less been proven to be ineffective as there simply isn't enough execution hardware at times to execute both branches.

I wasn't talking about Itanium. I was thinking of an older architecture I came across in a book only a few months ago. Unfortunately, I don't have the book with me, so I'm not even sure if it was a RISC architecture.
As for Itanium, if I remember correctly, EPIC does not define the number and type of execution units in hardware. Its primary concern is the delineation between explicitly parallel code bundles seperated by the highly-advanced stop bit. If predicated branching is truly being limited by hardware, it's a simple matter to rectify. However, I get the feeling it's not too much of a problem.

This might, however, be an interesting solution to getting better single-threaded performance out of dual-core processors in the future. If the second core is idle (being sent noop), there might be a mechanism to "switch" the core to "assist" mode in which it simply executes the other execution path of a branch that the main core is currently executing. You have that hardware there anyway, why not use it instead of having it sit idle?

Speaking of latencies, I'm pretty sure that modern CPU's would require more clock cycles to transfer data to the neighboring core and back than it would to simply do it the old fashioned way. If instead you do SMT instead of dual-core, you'll reduce the latencies for this one action. However, you'll also decrease overall clock frequencies.

Technically, it gave 100% branch prediction. However, I think it was a very short 7 stage pipeline, which meant it would rarely, if ever, encounter multiple branches within the replicated stages. It was also very expensive, having to duplicate about half the entire logic core. At the time, it was a relatively poor investment.

Click to expand...

Hardly. It was better, but when you ran into conditional and/or nested branches, it crapped out and just did prediction.

Besides the fact we're talking about different architectures, I find it exceedingly difficult, if not impossible, to implement better than 100% branch prediction.

Considering how much of modern day cores remains unused due to branches and/or data dependencies, usage of the hardware during idle time (say, when a cache miss occurs) to process secondary branches may be an effective use of the processor time. The hardware for executing multiple code streams at the same time is already on modern day P4's. SMT anyone?

Cache misses are most often hidden using non-blocking caches.
Hyperthreading isn't quite the same as resolving multiple branch outcomes. SMT for the Pentium 4 simply needs to keep track of two execution streams rather than one. It's relatively simpler than interleaving branch outcomes.
If you were to try to predicate every single branch by interleaving code, you'd need to keep track of up to 31 streams. If you assume an average of 4-7 instructions between branches and 1.5 micro-ops per instruction, you'll only need to keep track of about 4-6 possible outcomes. However, you'll not only tie up all resources on chip assigned to this single instruction stream, you'll still have other resources laying idle. If you decide to execute possible target streams in parallel, you'll quickly run out of resources. If you execute in parallel and interleave, you'll increase chip complexity to the point where it affects clock speeds. Either way it's done, you're gonna end up with one very large and complex re-order buffer. After all is said and done, I'm not so sure it would increase performance as much relative to its current incarnation.

Vee · Jul 16, 2004

Originally posted by: Sahakiel

Originally posted by: imgod2u
Itanium. It attempts to execute both execution paths when it hits a branch and tosses away the one that was incorrect. This has more or less been proven to be ineffective as there simply isn't enough execution hardware at times to execute both branches.

Click to expand...

I wasn't talking about Itanium. I was thinking of an older architecture I came across in a book only a few months ago. Unfortunately, I don't have the book with me, so I'm not even sure if it was a RISC architecture.

Lots of interesting views, guys. I might want to add to the conversation, but for now, I think I might help clear this up. CISC. Motorola 68040. It had a 'shadow pipe' to handle branches. The concept was abandoned for the 68060, in favor of 'branch folding' and using multiple pipes for superscalar execution instead. Apparently, that's better use of the transistors. 68k pipes were very short, 6 stages for 68040, I think, and clock rates were slow compared to Intel. But the 68040 averaged something outrageos (for a single pipe), don't remember for certain, but maybe 1.18 clock cycle per instruction. Or something like that.

imgod2u · Jul 16, 2004

Cache misses are most often hidden using non-blocking caches.

Depends. If you mean non-blocking caches as in an L3 type cache such as that of the Power4, then you're in for a disappointment. Those caches are extraordinarily slow. So much so that some *memory* subsystems actually have a lower latency.
Besides, having multi-MB of fast L3 cache (ala Madison) would be very expensive and not really practical. And even then, there's still cache misses in applications with really low data locality.

Hyperthreading isn't quite the same as resolving multiple branch outcomes. SMT for the Pentium 4 simply needs to keep track of two execution streams rather than one. It's relatively simpler than interleaving branch outcomes.

No, it's not, but it does execute multiple instruction streams, which is exactly what multiple branch resolution would do. The only difference is that instead of having 2 different register sets for each stream, you'd need 2 that were identical to each other and keep track of which outcome was correct. Both instruction streams would still be completely independent just like 2 threads.

If you were to try to predicate every single branch by interleaving code, you'd need to keep track of up to 31 streams.

How so? Only 2 streams stem from a branch, taken or not taken. You'd need to keep track of 120+ instructions-in-flight for each stream (on the P4), but that's no different than modern SMT on the P4.

If you assume an average of 4-7 instructions between branches and 1.5 micro-ops per instruction, you'll only need to keep track of about 4-6 possible outcomes. However, you'll not only tie up all resources on chip assigned to this single instruction stream, you'll still have other resources laying idle. If you decide to execute possible target streams in parallel, you'll quickly run out of resources. If you execute in parallel and interleave, you'll increase chip complexity to the point where it affects clock speeds. Either way it's done, you're gonna end up with one very large and complex re-order buffer. After all is said and done, I'm not so sure it would increase performance as much relative to its current incarnation.

You need to keep track of multiple instruction streams anyway. The window is something like 120+ instructions on modern P4's. If you were to interleave branch execution, it'd be no different than interleaving 2 thread streams. The only extra thing you would need to keep track of was that they were, indeed, 2 branch paths and that one should be discarded in the end. As for nested branches, again, you wouldn't have to do this for every branch but most branches are rather simple and aren't nested. As for performance increase, well, only a simulator will tell. Sounds like an interesting project though for anyone out there with access to FPGA's.

Sahakiel · Jul 16, 2004

Originally posted by: imgod2u
Depends. If you mean non-blocking caches as in an L3 type cache such as that of the Power4, then you're in for a disappointment. Those caches are extraordinarily slow. So much so that some *memory* subsystems actually have a lower latency.
Besides, having multi-MB of fast L3 cache (ala Madison) would be very expensive and not really practical. And even then, there's still cache misses in applications with really low data locality.

A non-blocking cache is one where read and write misses do not stall the processor. I believe most, if not all, modern processors use non-blocking caches. Some may decide not to because of MP behavior.

No, it's not, but it does execute multiple instruction streams, which is exactly what multiple branch resolution would do. The only difference is that instead of having 2 different register sets for each stream, you'd need 2 that were identical to each other and keep track of which outcome was correct. Both instruction streams would still be completely independent just like 2 threads.

My apologies for not being clear. I am more concerned about hitting multiple branches. With such a long pipeline, the odds of having more than one branch in the pipeline is almost a certainty. That means you need more hardware than what is currently allocated to HT in order to keep track of multiple branching.

How so? Only 2 streams stem from a branch, taken or not taken. You'd need to keep track of 120+ instructions-in-flight for each stream (on the P4), but that's no different than modern SMT on the P4.

It is different beast when keeping track of instruction streams as opposed to instructions in flight. With a single thread, you only bother with keeping instructions in order. The massive number of micro-ops in flight is primarily due to the length of the pipeline. Therefor, the problem of keeping track of even more instructions scales almost linearly given the same number of execution units. Multi-threading scales exponentially due to hardware complexity and circuit behavior. A simple four-way MUX would be more than twice as slow as a two-way MUX.

Sounds like an interesting project though for anyone out there with access to FPGA's.

Yeah, I would also love to try it out and see. However, not only would a FPGA of the required size be very expensive. The results would be inaccurate, but good enough for testing purposes.

imgod2u · Jul 17, 2004

Originally posted by: Sahakiel

Originally posted by: imgod2u
Depends. If you mean non-blocking caches as in an L3 type cache such as that of the Power4, then you're in for a disappointment. Those caches are extraordinarily slow. So much so that some *memory* subsystems actually have a lower latency.
Besides, having multi-MB of fast L3 cache (ala Madison) would be very expensive and not really practical. And even then, there's still cache misses in applications with really low data locality.

Click to expand...

A non-blocking cache is one where read and write misses do not stall the processor. I believe most, if not all, modern processors use non-blocking caches. Some may decide not to because of MP behavior.

Unfortunately, whether the cache blocks or not, the processor will still run into problems when the data needed is not there. This is the major bottleneck in modern systems with processor idle time being in the 200+ cycles. That's way more of a hit than any branch mispredict.

No, it's not, but it does execute multiple instruction streams, which is exactly what multiple branch resolution would do. The only difference is that instead of having 2 different register sets for each stream, you'd need 2 that were identical to each other and keep track of which outcome was correct. Both instruction streams would still be completely independent just like 2 threads.

Click to expand...

My apologies for not being clear. I am more concerned about hitting multiple branches. With such a long pipeline, the odds of having more than one branch in the pipeline is almost a certainty. That means you need more hardware than what is currently allocated to HT in order to keep track of multiple branching.

You don't neccessarily have to resolve all branches. But in the 200+ cycle wait time, that's at least 10 or so branches you could resolve, granted a lot of them will be discarded but performance will still be higher.

How so? Only 2 streams stem from a branch, taken or not taken. You'd need to keep track of 120+ instructions-in-flight for each stream (on the P4), but that's no different than modern SMT on the P4.

Click to expand...

It is different beast when keeping track of instruction streams as opposed to instructions in flight. With a single thread, you only bother with keeping instructions in order. The massive number of micro-ops in flight is primarily due to the length of the pipeline. Therefor, the problem of keeping track of even more instructions scales almost linearly given the same number of execution units. Multi-threading scales exponentially due to hardware complexity and circuit behavior. A simple four-way MUX would be more than twice as slow as a two-way MUX.

MUX's make up a very small part of the circuitry though. The control units will have to be bigger but the overall increase in complexity isn't as high.
As for keeping track of instructions in stream, each branch would add maybe 5-7 instructions to monitor (as opposed to the original method of predicting) as they're all part of the same instruction stream. You wouldn't need to double your instruction window or your reorder window. And if you run out of places to put those instructions, you can always stall, but I think the bottleneck with modern MPUs isn't the lack of execution resources but the means to fill them. This is an excellent way to find excess instructions (albeit, most of them will be discarded, but those that aren't will improve performance). Each mispredicted branch that is avoided because the right path was already taken will save 31 cycles on a Prescott and that was time that would've been wasted anyway having the processor sit idle waiting for memory.

Sounds like an interesting project though for anyone out there with access to FPGA's.

Yeah, I would also love to try it out and see. However, not only would a FPGA of the required size be very expensive. The results would be inaccurate, but good enough for testing purposes.[/quote]

Sahakiel · Jul 17, 2004

Originally posted by: imgod2u
Unfortunately, whether the cache blocks or not, the processor will still run into problems when the data needed is not there. This is the major bottleneck in modern systems with processor idle time being in the 200+ cycles. That's way more of a hit than any branch mispredict.

I think you missed the significance of a nonblocking cache, but you do bring up another good point. Although stalls on memory accesses are pretty rare (ignoring compulsory misses), their associated penalties are large enough to be a problem.

You don't neccessarily have to resolve all branches. But in the 200+ cycle wait time, that's at least 10 or so branches you could resolve, granted a lot of them will be discarded but performance will still be higher.

Let me try to be a bit more clear. First off, a branch mispredict does not necessarily incur a memory access penalty. It does, however, result in a pipeline flush. Current branch prediction schemes can hit over 95% accuracy easily, with numbers of 98% on average being entirely possible.
Let us see if performance would be higher. Assume an ideal processor that always has a full pipeline. A branch mispredict flushes 120 micro-ops. Given a simplistic algorithm like gshare that yields 93% accuracy, that means 7% of the time the processor mispredicts the branch and flushes the entire pipe.
Given a particular instruction stream, average of 6 instructions between branches and 1.5 micro-ops per instruction, that means you have a branch instruction every 9 micro-ops. It is very likely a branch itself is broken down into at least two micro-ops, one for address calculation and one for branch resolution. This yields a nice round number, 10 micro-ops, from one branch to another. That means we have 12 branches in flight at any time.
Add two and two and .7% of all micro-ops incur a pipeline flush. The result is a performance penalty of 84%.
Now assume that one branch is always correct. Every twelfth branch is predicted correctly because both possible outcomes are issued. Assume, for the sake of simplicity, the extra hardware incurs zero penalty. This means every twelfth branch effectively halves the issue rate of the processor until the branch is retired.
One tenth of all micro-ops branch. One twelfth of one tenth will never mispredict, but will halve the issue rate until retirement. This gives a penalty of 8.33...%. The rest of the one tenth will use the branch predictor. The associated penalty is now 77%. The total performance penalty is now 85%.
Now assume that all branch outcomes are interleaved to yield 100% branch prediction. 12 branches are in flight. The first branch spawns two threads, which each spawn two threads, which each spawn two more. However, once the first branch is retired, about half the pipeline is flushed, regardless of the outcome. I say about half because only 12 segments can fit, yet there are 2+4+8=14 segments. The performance penalty is now about 100%.
Redoing the math for an ideal issue rate that gives 60 micro-ops in flight (the reason for HT), the numbers come out to 42%, 38.5% and 0%. This is assuming that all required resources are available on any clock cycle. However, this also means it takes twice as long to finish as compared to the max issue rate with 120 micro-ops in flight. Relative speedup in relation to 120 micro-ops with standard branch prediction is 65% with half the issue rate, 67% with half and one branch, and 92% with every branch in flight.
Since reality is rarely ideal, it's quite possible to get different numbers. I assumed the processor issues and retires 5 micro-ops. I ignored memory accesses due to cache misses since I don't know the miss rates. That means I also ignored the penalties with x86 decoding vs a hit in the trace cache. I also ignored penalties associated with additional logic. For branch prediction, 93% is a decade old figure. The Pentium 4 branch predictor is better. If I remember correctly, it uses a tournament predictor with accuracy approaching 97% or better. Speaking of which, I also ignored pipeline bubbles at the fetch and decode stages from an overriden prediction.

MUX's make up a very small part of the circuitry though. The control units will have to be bigger but the overall increase in complexity isn't as high.

I simply used MUX's as an example of exponential growth in complexity.

As for keeping track of instructions in stream, each branch would add maybe 5-7 instructions to monitor (as opposed to the original method of predicting) as they're all part of the same instruction stream. You wouldn't need to double your instruction window or your reorder window.

That only applies to loops. Most branches will require the processor to fetch more instructions from non-sequential memory segments. This is why the trace cache was originally developed. Storing entire traces per line instead of memory segments means that no single instruction straddles two cache lines. Caching micro-ops was simply a great idea.

And if you run out of places to put those instructions, you can always stall, but I think the bottleneck with modern MPUs isn't the lack of execution resources but the means to fill them. This is an excellent way to find excess instructions (albeit, most of them will be discarded, but those that aren't will improve performance).

It is perhaps the easiest way to find instructions to issue. However, it is also the easiest way to waste resources. If you do it for only one branch, you're essentially duplicating everything up to the ROB. If you're doing this with multiple branches, it gets worse and worse as the pipeline lengthens. Going with every single branch would require duplicating every single stage prior to retire just to handle branch-intensive code. Imagine the memory traffic.

Each mispredicted branch that is avoided because the right path was already taken will save 31 cycles on a Prescott and that was time that would've been wasted anyway having the processor sit idle waiting for memory.

I think your numbers a mixed up. A pipeline flush wastes 31 cycles, but a memory access requires hundreds of cycles. How many cycles would be saved by 100% branch prediction is difficult to calculate without hard data. I'm pretty sure the L2 cache will hit greater than 90% of instructions, which means less than 1% of all instructions in any given program will miss (ignoring compulsory misses).

Is x86 architecture hampering performance?

Senior member

Lifer

Elite Member

Golden Member

Diamond Member

Elite Member

Golden Member

Senior member

Senior member

Golden Member

Lifer

Elite Member

Senior member

Diamond Member

Senior member

Senior member

Lifer

Golden Member

Senior member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Golden Member