http://www.anandtech.com/video/showdoc.aspx?i=2031
Using this article, I've been trying to get an understanding of the architecture of NVIDIA's pixel pipeline. One part I'm confused at is here:
"In both NV3x and NV40 architectures, z and color can be calculated per pixel at the same time. In addition, rather than coloring a pixel, a z or stencil operation can be performed in the color unit. This allows NV3x to perform 8 z or stencil ops per clock and NV40 to perform 32 z or stencil ops per clock. NVIDIA has started to call this "8x0" and "32x0", respectively, as no new pixels are drawn. This mode is very useful if a z only pass is performed first, or if stencil shadows are used (as is the case with Doom 3)."
First I thought that the NV3x could do 8 of those operations per clock because it has two texture units in each of its four pipelines. But then it says the NV40 can do 32 of them per clock, when it only has one texture unit per pipeline. Then I realized it's not saying they're being done in the texture unit, they're being done in the "color" unit. I've never heard of this one before. Then later on in the article a "math/shader" unit is mentioned.
So from what I gather, it breaks down like this. In the pixel pipeline of the NV40 you have all of these units:
a texture unit
two math/shader units
a color unit
Is this correct? And if it is, are there more unit types in the pipeline I'm not picking up on?
------------------------------------------------------------------------------------------------------------------------
Next half to this. It is said:
"If we had enough processing power, we could actually process every single pixel on the screen at the same time. Even though going to such extremes is currently not an option (I wonder where we'll be in another decade or two), currently graphics cards are able to process multiple pixels at a time."
In other words, current graphics cards can't process all the pixels on the screen at once. Then it says later on:
"The way NVIDIA overcame these [scheduling] issues in NV40 was to revamp the internals of their shader pipelines by adding an extra math unit to all the pixel pipes (pixel shaders can now execute two math instructions at the same time, or a math and texture instruction), and expanding the number of registers available for shader programs to use."
So, without texturing going on, in a pure pixel shader environment, the NV40 architecture can run two math operations per clock. Then the register space is increased.
Current resolutions by these day's standards are usually 1024x768, 1280x960, and 1600x1200, assuming we're talking about highend graphics cards (which the NV40 is). The total amount of pixels for these resolutions are as follows:
1024x768 = 786432 total pixels
1280x960 = 1228800 total pixels
1600x1200 = 1920000 total pixels
So 768,432 is the minimum amount we're talking about. Since it is said current graphics cards can't process all the pixels on the screen at the same time, this is going to be my threshold. This, along with taking into account the statement that the NV40 can have up to 8 times the shader performance of the NV3x, are to be used as guidelines for finding how many (unspecified) registers are in the NV3x and NV40 architecture. By choosing different variables, I've come up with these as the most likely numbers:
NV3x
4 pipelines
2 texture units (not used in pixel shading)
1 math unit
35 registers (amount of space)
5900 Ultra: 450 (clockspeed) x 4 (pipes) x 1 (math) x 35 (reg's) = 63000 pixels per sec
5950 Ultra: 475 (clockspeed) x 4 (pipes) x 1 (math) x 35 (reg's) = 66500 pixels per sec
NV40
16 pipelines
1 texture unit (not used in pixel shading)
2 math units
40 registers (amount of space)
6800 Ultra: 400 (clockspeed) x 16 (pipelines) x 2 (math) x 40 (reg's) = 512000 pixels per sec
Keep in mind of course that I pulled the number I have here of registers here out of my ass, as far as how accurate I know them to be. But if you round it out, the NV40 in that equation ends up having roughly 8x the pixel shader performance of the NV3x.
Now, I don't know nearly as much about ATI's pixel-pipeline architecture as I do for NVIDIA, but let's just say they're the same, except that the R42x architecture doesn't have a second math unit. So this is what it comes out to:
R42x
16 pipelines
1 texture unit (not used in pixel shading)
1 math unit
40 registers (amount of space)
X850XT PE: 540 (clockspeed) x 16 (pipelines) x 1 (math) x 40 (reg's) = 345600 pixels per sec
So this is suggesting that in a pure pixel shader environment, the 6800 Ultra is faster than the X850XT PE (by 48%; the X800XT PE would be 54% slower). This goes along with my theory that the R42x is faster in vertex processing (higher fillrate), but slower in pixel shading than the NV40.
HOWEVER, most games aren't just using pixel shading for all of their effects. They're also using texturing (normal mapping for instance). Since the NV40 architecture can only do either one math op and one texture op per clock, or just two math ops per clock, in a real-world setting where texturing does occur, the NV40 isn't able to do two math ops at once. So the number come down to:
6800 Ultra: 400 (clockspeed) x 16 (pipes) x 1 (math op) x 40 (reg's) = 256000 pixels per sec
X850XT PE: 540 (clockspeed) x 16 (pipes) x 1 (math op) x 40 (reg's) = 345600 pixels per sec
This then suggests that the X850XT PE is faster in real-world shader performance than the 6800 Ultra (by 35%; the X800XT PE would be 30% faster).
Keep in mind these are all theoretical numbers. They don't take into account driver and application efficiency (along with influences from other pieces of hardware).
So then, how far off the ball am I here?
Using this article, I've been trying to get an understanding of the architecture of NVIDIA's pixel pipeline. One part I'm confused at is here:
"In both NV3x and NV40 architectures, z and color can be calculated per pixel at the same time. In addition, rather than coloring a pixel, a z or stencil operation can be performed in the color unit. This allows NV3x to perform 8 z or stencil ops per clock and NV40 to perform 32 z or stencil ops per clock. NVIDIA has started to call this "8x0" and "32x0", respectively, as no new pixels are drawn. This mode is very useful if a z only pass is performed first, or if stencil shadows are used (as is the case with Doom 3)."
First I thought that the NV3x could do 8 of those operations per clock because it has two texture units in each of its four pipelines. But then it says the NV40 can do 32 of them per clock, when it only has one texture unit per pipeline. Then I realized it's not saying they're being done in the texture unit, they're being done in the "color" unit. I've never heard of this one before. Then later on in the article a "math/shader" unit is mentioned.
So from what I gather, it breaks down like this. In the pixel pipeline of the NV40 you have all of these units:
a texture unit
two math/shader units
a color unit
Is this correct? And if it is, are there more unit types in the pipeline I'm not picking up on?
------------------------------------------------------------------------------------------------------------------------
Next half to this. It is said:
"If we had enough processing power, we could actually process every single pixel on the screen at the same time. Even though going to such extremes is currently not an option (I wonder where we'll be in another decade or two), currently graphics cards are able to process multiple pixels at a time."
In other words, current graphics cards can't process all the pixels on the screen at once. Then it says later on:
"The way NVIDIA overcame these [scheduling] issues in NV40 was to revamp the internals of their shader pipelines by adding an extra math unit to all the pixel pipes (pixel shaders can now execute two math instructions at the same time, or a math and texture instruction), and expanding the number of registers available for shader programs to use."
So, without texturing going on, in a pure pixel shader environment, the NV40 architecture can run two math operations per clock. Then the register space is increased.
Current resolutions by these day's standards are usually 1024x768, 1280x960, and 1600x1200, assuming we're talking about highend graphics cards (which the NV40 is). The total amount of pixels for these resolutions are as follows:
1024x768 = 786432 total pixels
1280x960 = 1228800 total pixels
1600x1200 = 1920000 total pixels
So 768,432 is the minimum amount we're talking about. Since it is said current graphics cards can't process all the pixels on the screen at the same time, this is going to be my threshold. This, along with taking into account the statement that the NV40 can have up to 8 times the shader performance of the NV3x, are to be used as guidelines for finding how many (unspecified) registers are in the NV3x and NV40 architecture. By choosing different variables, I've come up with these as the most likely numbers:
NV3x
4 pipelines
2 texture units (not used in pixel shading)
1 math unit
35 registers (amount of space)
5900 Ultra: 450 (clockspeed) x 4 (pipes) x 1 (math) x 35 (reg's) = 63000 pixels per sec
5950 Ultra: 475 (clockspeed) x 4 (pipes) x 1 (math) x 35 (reg's) = 66500 pixels per sec
NV40
16 pipelines
1 texture unit (not used in pixel shading)
2 math units
40 registers (amount of space)
6800 Ultra: 400 (clockspeed) x 16 (pipelines) x 2 (math) x 40 (reg's) = 512000 pixels per sec
Keep in mind of course that I pulled the number I have here of registers here out of my ass, as far as how accurate I know them to be. But if you round it out, the NV40 in that equation ends up having roughly 8x the pixel shader performance of the NV3x.
Now, I don't know nearly as much about ATI's pixel-pipeline architecture as I do for NVIDIA, but let's just say they're the same, except that the R42x architecture doesn't have a second math unit. So this is what it comes out to:
R42x
16 pipelines
1 texture unit (not used in pixel shading)
1 math unit
40 registers (amount of space)
X850XT PE: 540 (clockspeed) x 16 (pipelines) x 1 (math) x 40 (reg's) = 345600 pixels per sec
So this is suggesting that in a pure pixel shader environment, the 6800 Ultra is faster than the X850XT PE (by 48%; the X800XT PE would be 54% slower). This goes along with my theory that the R42x is faster in vertex processing (higher fillrate), but slower in pixel shading than the NV40.
HOWEVER, most games aren't just using pixel shading for all of their effects. They're also using texturing (normal mapping for instance). Since the NV40 architecture can only do either one math op and one texture op per clock, or just two math ops per clock, in a real-world setting where texturing does occur, the NV40 isn't able to do two math ops at once. So the number come down to:
6800 Ultra: 400 (clockspeed) x 16 (pipes) x 1 (math op) x 40 (reg's) = 256000 pixels per sec
X850XT PE: 540 (clockspeed) x 16 (pipes) x 1 (math op) x 40 (reg's) = 345600 pixels per sec
This then suggests that the X850XT PE is faster in real-world shader performance than the 6800 Ultra (by 35%; the X800XT PE would be 30% faster).
Keep in mind these are all theoretical numbers. They don't take into account driver and application efficiency (along with influences from other pieces of hardware).
So then, how far off the ball am I here?