Pipeline???

AthlonXP2

Junior Member
May 4, 2002
5
0
0
Can someone please clear somethings up.

Why is it that a longer pipeline does less per cycle and how does it increase in MHZ? and why is it that some stages in the pipepline repeat e.g P6 pipeline = fetch fetch decode decode decode... ?


Thanks in advance
 

SteelCityFan

Senior member
Jun 27, 2001
782
0
0

Been a long time since I did anything releating to CPU pipelines, but at one point I designed a Pipelined CPU using a software simulation during college. (Even had to make one that ran program code, while being interupted by a modem to send data to another instance of the PC simulation.. it was fun converting all the instructions to binary (16 bits).. haha)

Here is my understanding of it.. which I believe to be at least most of the way correct.


Having a longer pipeline in theory would not have any adverse effects. Once the pipline is full, theory says that you would be finishing up one instruction for every cycle. The problem with a longer pipeline is that you have more chances for things called "Hazards" to occur. Bascially this is when something in the previous stage can't move to the next stage because of a conflcting needed resources. This causes a stall or a hold in the pipeline (not sure what the term is anymore). You also have instances like needing the repeating stages.

As far as repeating pipeline stages... we did not cover this, but my assumtion would be that certain features take longer than 1 pipeline to finish, so they split into 2 stages for that particular step, gives more time for that to finish without slowing down the chip. This is done because the Mhz rating must be as low as it's slowest stage. If you have 19 stages that can finish in 3ns, but one that takes 4ns all of your stages would have to be given 4 ns to complete... even though you would be wasting 1 ns in each of the other 19 stages (19 ns of wasted time). If you split it into 2 stages of 2.5 each, you can cut the time per stage down to 3, and only waste 0.5ns for each stage plus the time for the extra stage (3), and only waste 4ns total.

That is my understanding, but it has been a long time since I dove that deep into CPU design.
 

AndyHui

Administrator Emeritus<br>Elite Member<br>AT FAQ M
Oct 9, 1999
13,141
16
81
Here's an answer that me and pm came up with a while ago:

What do you need to do when processing an instruction? In other words, what happens in a pipeline? The pipeline breaks up the processing of an instruction into smaller stages. In each clock cycle the instruction will move through the pipeline to the next stage. A typical pipeline consists of something like Fetch, Decode, Rename, Dispatch, Execute, Retire.

Doing one of these stages takes time....time for the processor to do its work. If you have more stages, the amount of work needed to be done in one clock cycle is less...what is known as a lower Instruction Per Clock Cycle. It's easier for the processor to work on, so fewer gates per stage are needed. The lower number of gates means that the clock speed can be increased without compromising stability or generating too much heat.

In one "tick" of the clock, an instruction will move through one part of the pipeline and then progress to the next stage in the next "tick" of the clock. Take my very short 6 stage pipeline example that I gave above: Fetch, Decode, Rename, Dispatch, Execute, Retire.

Clock cycle 1: Fetch
Clock cycle 2: Decode
Clock cycle 3: Rename
Clock cycle 4: Dispatch
Clock cycle 5: Execute
Clock cycle 6: Retire

As you can see, it takes six clock cycles to complete an instruction.

Of course, it would be silly to have just one instruction going through the pipeline at a time. The processor can line up another instruction right behind the first, and have a total 6 instructions going all at once through the pipeline at various stages of execution.

If I had a longer, 20 stage pipeline, it would take 20 clock cycles to completely finish an instruction. With 20 stages in the pipeline, the Pentium 4 does approximately 1/3 of the work in a single clock cycle that my 6 stage processor does.


Here's pm's somewhat more technical look:

Let's take the example of a one-stage pipeline CPU. Making the numbers easy to work with (but unrealistically slow, so stick with me here), you might find that it take 1s to complete the instruction decode, the add operation and then write the result back to memory. Since the clock needs to wait for the data to be ready, you would find that you could clock this theoretical CPU at 1Hz. Now, if we could chop the logic neatly in half, we would now find that we can complete the instruction decode, one half of the add in 0.5s and then finish the add and write the result back to memory in another 0.5s. Nothing has really changed - it still takes 1s to complete one add operation, but now we can clock the design at 2Hz. So, if you have back to back instructions filling up the pipeline, we can now complete them twice as fast. In theory, we have doubled the performance of this theoretical CPU. This is pipelining.

But, if you don't take advantage of the ability to clock this new design at 2Hz instead of 1Hz, then what have you accomplished? Basically nothing. You are still finishing one thing per clock (assuming the pipeline is full), but you are clocking the thing at the exact same speed as before. Since there are plenty of things which make pipelining a CPU less than 100% efficient, then in reality you are really managed to cripple your design slightly by pipelining. At the lower clock frequency, it's actually slower than the original one.

This is why it's crazy to do clock for clock comparisions of CPUs with different pipelines. The Athlon has a 10-stage pipeline, the Pentium 4 has one that's 20 stages. If you clock the Pentium 4 at really slow frequencies for comparison of course it's going to to look bad. It's not supposed to run that slowly. The pipeline stage increase allows you to clock it faster, so it should be run faster for comparison otherwise you are purposely defeating the point of having a lot of pipeline stages.

But back to pipelining, you might think, "well, what's the limit? why can't we put in 100+ pipeline stages into a CPU and make it 100x faster?". Aside from the obvious one, there are plenty of reasons, and I'm not going to go into clock skew/uncertainty, CK -> Q vs. logic delays and other really in-depth stuff. The obvious reason is what everyone has mentioned, branch prediction. Let's say we have all of these instructions in the pipeline and one of them is a branch instruction... say we are comparing two numbers and if they are equal we will execute one section of code, and if they aren't equal then we will execute another. We want to fill the pipeline, but we won't know the outcome of the branch until later. What do we do? We make an educated guess, which in CPU terms is "branch prediction". If we get it right, the pipeline stays full and everything continues on like before. If we get it wrong, then we need to dump all the instructions that we started after the branch, and the load in the other branch. This is the big downside of pipelining. There are others, but this is the biggie. Since we can't always get branchs right, we will take a misprediction penalty when we screw up. So you definitely don't want a 40 stage pipeline, because then you may have to wait 38-39 ) cycles before everything is back to normal on a misprediction. Devices that don't really have branches to worry about (DSP's spring to my mind immediately), tend to have really long pipelines. Theoretical studies back in the early to mid 90's said that the practical limit for a CPU based on current branch prediction methods at the time was approx. 16 (Computer Architecture : A Quantitative Approach, Hennessy and Patterson). But the branch predictor on the Pentium 4 is pretty good, so it could push past this a little.

Patrick Mahoney
IPF Microprocessor Design
Intel Corp.

 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |