greater than 100% scaling? really?

ArchAngel777 · Dec 22, 2010

Without getting pulled into a debate (is not worth my time) I'll just say that Seero is incorrect. However, that does not mean there is no such thing as synergy. The debate is silly anyway.

The fact is: When SLI or Xfire performance is greater than double, it is the margin of error within the benchmark itself. There are no mircles here folks...

Maximilian · Dec 22, 2010

Ben90 said:
Just make sure you buy 8 times as much soapy water to professionally clean those cards with.

Good advice! :thumbsup:

taltamir · Dec 22, 2010

@OP: I see it often enough myself. possible reasons vary from case to case but generally fall into:
1. The most common cause is that the drivers contain a bug/lack a feature/intentional cheating to get better reviews such that a certain calculation is not performed in SLI/xfire (aka, you set the game to do HDR, it will do HDR on single GPU, but NOT on SLI/xfire, giving you lower quality image but greater than 2x FPS on SLI/xfire).
2. the drivers/game contain a bug that cripples performance in single GPU (ex: NWN2 had a bug that caused an infinite loop when used with nvidia 8800 series cards, which caused extremely lower performance), that bug somehow does not get manifested in multi GPU.
3. There is always slight variations on repeated testings, it might be that those lead to it measuring a 100+% scaling when it is in fact actually 98 or 97 or some such.. This should not be significant though and requires a baseline of near 100% scaling which is not likely.
IIRC there are a few other reasons but I am really really tired right now and I can't concentrate enough to remember them.

Seero said:
The reviews are not wrong, and it is theoretically possible. The total processing power did not exceed 2, but the processing time may decrease more than half. The theoretical maximium of dual play is double I/O * double memory * double processing power = 8 times.

that is nonsense.

Seero · Dec 23, 2010

taltamir said:
Seero said:

The reviews are not wrong, and it is theoretically possible. The total processing power did not exceed 2, but the processing time may decrease more than half. The theoretical maximium of dual play is double I/O * double memory * double processing power = 8 times.

Click to expand...

that is nonsense.

maybe if you read what you quote next time, it will make sense. Is it a nonsense that the total processing power won't get more than 2? Or did you missed that part?

In case you simply don't have the knowledge to understand simple sentences, I will explain more to you. Lets say FLOP is the unit of computing power, assume that there are no other bottleneck in the system but GPU itself, assume that GPU is running at maximum possible speed and utilization, and assume that 1 GPU can process 1* 1024^3 Flops at the best possible scenario.

Under all the assumption above, 2 GPU cannot exceed 2 times the performance of 1 GPU because the bottleneck is on GPU. That means, 2 GPU not exceed 2 * 1024^3 Flops even if it is running at the best possible scenario. I believe that is probably what you have in your mind, but I have stated that at the 2nd sentence of my post which you have quoted.

Now, usually GPU doesn't run on the best possible scenario because it is usually not the bottleneck. If GPU is not the bottleneck, adding one more will not increase performance. Even if GPU is the bottleneck, based on the above there is no possible way to get double performance by adding a card, so let alone more than double. However, instead of saying that the statistics are wrong, would it be something else that is wrong?

Now suppose other than GPU, there are other factors that are holding the overall performance of the card from producing 1x1024^3 flops, namely i/o performance and memory size. Although there can be one bottleneck at any given time, bottlenecks can occur on varies area in different times. Now those benchmarking take Frame per second as unit, isn't it possible that varies bottlenecks can occur while producing a single image? If so, than those so called average FPS is not going to show the maximum computing power which video card can produce, but more like the expected performance of a video card, which is exactly what benchmarking programs are designed for.

Now if it isn't the maximum performance, then bottleneck can be somewhere else. Lets keep it simple and concentrate on video card itself and find possible bottlenecks within a video card. There can be many, but the ones that are scaled by adding another new card which may impact performance is quantitative. I can see i/o bandwidth, memory size, and number of processors being the dominating factors to performance when it comes to scaling. There may be more, or the ones that I have selected actually doesn't scale, but at least we can start somewhere.

So under scenarios were bottleneck occurs on the video card itself, where a single video card produce an arbitrary number 100 flops at time t, where 0 <= t <= n, and n is the number of seconds that the benchmarking program actually ran. I say that it is NOT possible that the number of flops at time t can scale more than 8 times. Note that 8 x 100 = 800 flops < 2*1024^3 flops. If flops isn't a familiar term, try FPS(frame per second). Let say at time t, 2 fps was produced by a single card, I am saying that the scaling can not bring it higher than 16 fps by adding one more card given that a single card can produce a theoretical maximum of 1000 FPS.

If you are still confused, I am saying that if a single card can produce a theoretical maximum of 1000FPS, then 2 cards can't exceed 2000FPS. However, if one card is producing 2 fps at time t, adding one more card can bring x fps at time t where x is between 0 and 16.

The more I try, the more confusing it will become. If you are trying to understand it, but ain't able to, I am sorry. If you are simply trolling, then screw it.

taltamir · Dec 23, 2010

@Seero: it is still total and utter nonsense. You do not have a theoretical 8x the performance with dual GPU.
Your notion is that just doubling the flops gives a theoretical 100% improvement in OVERALL performance, doubling the bandwidth gives another 100% improvement, and so on.
The problem is that it doesn't. Aside from architectural limitation (memory is duplicated across the ram of both cards, so effectively it does NOT double the ram), there is the simple issue that your deductions are still utter nonsense.

If a certain calculation takes 5ms to transfer the data to the video card, and 10ms to execute, then doubling the execution capacity with 100% scaling on it would make it into 5ms transfer and 5ms execute. (effectively, 67% of the time is taken up by the processor, and 33% by the memory subsystem)
if you also double bandwidth (lets not get into access time yet) (again, with 100% efficiency) then your transfer rate goes from 5ms to 2.5ms. total time for those two metrics goes from 15ms to 7.5ms
Your notion is that if you double flops it magically halfs the time it takes to do EVERYTHING, including transfer data, then when you double the ability to transfer data it magically halves EVERYTHING again, including time to perform calculations.
I over simplified, there are actually more things that take up time in the whole equation, but basically, each one is a percentage of overall execution time, and doubling its capacity via multiple gpus would only hypothetically half that value alone, not all values. So once you add everything up, the theoretical maximum performance increase is 100%. except, in real life you have things that multiple GPUs DO NOT double (ram amount, speed of light and distance to system ram, etc) so its actually a little less than 100% theoretical performance increase.

In conclusion, what you say is complete and utter total nonsense

happy medium · Dec 23, 2010

Kenmitch said:
Just noticed this must be the only AMD related thread that Happy and Wreckage hasn't posted in....What gives?

The right answer was given in post #10, what more is to say?.

taltamir · Dec 23, 2010

happy medium said:
The right answer was given in post #10, what more is to say?.

I disagree, the answer in @10 is the least likely scenario IMAO. A driver issue causing it to have lower IQ by not performing some calculations makes far more sense.

oh, and for nvidia there is physX, which does not increase its gpu usage when you add a second GPU.

Seero · Dec 23, 2010

taltamir said:
@Seero: it is still total and utter nonsense. You do not have a theoretical 8x the performance with dual GPU.
Your notion is that just doubling the flops gives a theoretical 100% improvement in OVERALL performance, doubling the bandwidth gives another 100% improvement, and so on.
The problem is that it doesn't. Aside from architectural limitation (memory is duplicated across the ram of both cards, so effectively it does NOT double the ram), there is the simple issue that your deductions are still utter nonsense.

If a certain calculation takes 5ms to transfer the data to the video card, and 10ms to execute, then doubling the execution capacity with 100% scaling on it would make it into 5ms transfer and 5ms execute. (effectively, 67% of the time is taken up by the processor, and 33% by the memory subsystem)
if you also double bandwidth (lets not get into access time yet) (again, with 100% efficiency) then your transfer rate goes from 5ms to 2.5ms. total time for those two metrics goes from 15ms to 7.5ms
Your notion is that if you double flops it magically halfs the time it takes to do EVERYTHING, including transfer data, then when you double the ability to transfer data it magically halves EVERYTHING again, including time to perform calculations.
I over simplified, there are actually more things that take up time in the whole equation, but basically, each one is a percentage of overall execution time, and doubling its capacity via multiple gpus would only hypothetically half that value alone, not all values. So once you add everything up, the theoretical maximum performance increase is 100%. except, in real life you have things that multiple GPUs DO NOT double (ram amount, speed of light and distance to system ram, etc) so its actually a little less than 100% theoretical performance increase.

In conclusion, what you say is complete and utter total nonsense

It is nonsense because you don't understand the difference between expected running time vs theoretical maximum. To give you an idea, you can look at Upper and lower bound.
Clearly, if it can't go beyond 8 times, than 8 is an upper bound. 8 is not an arbitrary number from thin air, but a product of 3 factors that are related in the given case. You can argue that it is too big, but unless you can find a smaller upperbound, or a theory or even prove that in can in fact go beyond 8, then you really can't say I am wrong. It is like I am saying 2 < 10 and you say "nonsense, 2 can never reach 10."

taltamir · Dec 23, 2010

Seero said:
It is nonsense because you don't understand the difference between expected running time vs theoretical maximum. To give you an idea, you can look at Upper and lower bound.
Clearly, if it can't go beyond 8 times, than 8 is an upper bound. 8 is not an arbitrary number from thin air, but a product of 3 factors that are related in the given case. You can argue that it is too big, but unless you can find a smaller upperbound, or a theory or even prove that in can in fact go beyond 8, then you really can't say I am wrong. It is like I am saying 2 < 10 and you say "nonsense, 2 can never reach 10."

I understand it perfectly, you are applying it incorrectly. You remind me of my highschool precalc teacher who once tried to show me with a limit function how over an infinite duration a 1 liter container will leak out 3 liters of water and when I pointed out that it fails basic sanity check thought I was trying to claim the limits don't work (I wasn't, I was pointing out that she is using the limit incorrectly).
You can solve an equation correctly and still be wrong if you set it up incorrectly or even used the wrong equation. This is exactly what you are doing, you are simply setting up the equation all wrong, and this is why you are getting a theoretical maximum with 2 GPUs of 8x of the performance of a single GPU instead of a little under 2x the performance of a single GPU.

Voo · Dec 23, 2010

There a several factors I can see that can play into seeing a more than linear growth (and really none of them comes out to "if you take anything times two, you get exponential growth", heck if THAT was how it worked we'd have a really big problem).

- imprecission of the benchmark - benchmarks can't be perfect and even the best, will still only be accurate to what.. 3-5%?
- different code paths - even minimal changes to code can have some small effect - I've seen HPC benchmarks where adding some code resulted in a overall faster execution
- different code paths probably also mean that we get different optimizations

That won't explain a 150% performance boost, but 110%? Yeah can imagine that. And if it's more we can always blame the usual suspect:
- Human error

PS: But no really, if you double your resources you won't get exponential growth, just imagine what that would mean in a giant cluster (yay why didn't we add CPU 10k earlier?)

taltamir · Dec 23, 2010

@Voo, what do you think of my theory to explain 150+% performance boost?
this one: the drivers contain a bug/lack a feature/intentional cheating to get better reviews such that a certain calculation is not performed in SLI/xfire (aka, you set the game to do HDR, it will do HDR on single GPU, but NOT on SLI/xfire, giving you lower quality image but greater than 2x FPS on SLI/xfire).

if your single GPU is doing 8x AA and your SLI/xfire is doing no AA (or HDR, or or whatever) then you will see a disproportionate increase of SLI/xfire performance.

SlowSpyder · Dec 23, 2010

taltamir said:
I understand it perfectly, you are applying it incorrectly. You remind me of my highschool precalc teacher who once tried to show me with a limit function how over an infinite duration a 1 liter container will leak out 3 liters of water and when I pointed out that it fails basic sanity check thought I was trying to claim the limits don't work (I wasn't, I was pointing out that she is using the limit incorrectly).
You can solve an equation correctly and still be wrong if you set it up incorrectly or even used the wrong equation. This is exactly what you are doing, you are simply setting up the equation all wrong, and this is why you are getting a theoretical maximum with 2 GPUs of 8x of the performance of a single GPU instead of a little under 2x the performance of a single GPU.

I agree. You just don't get 8x the performance even in theory. I think the problem is that Seero is looking at the individual components of a video card as providing a certain number of FPS, when in actuality a video card is a system that provides a certain amount of FPS (and is part of the entire rig as a whole). When everything on your card is added up, your card can produce x amount of FPS. Adding another card can only produce x amount more.

Think about it, if you have a SlowSpyder GeRadeon GTX6970 and you get 30FPS with it in a game at your settings, by adding another you can maybe get 60FPS in theory. You don't get possibly 30 more FPS from the second memory, 30 more FPS from the second set of shaders, 30 more FPS from the second set of ROP's, etc.

Martimus · Dec 23, 2010

taltamir said:
@OP: I see it often enough myself. possible reasons vary from case to case but generally fall into:
1. The most common cause is that the drivers contain a bug/lack a feature/intentional cheating to get better reviews such that a certain calculation is not performed in SLI/xfire (aka, you set the game to do HDR, it will do HDR on single GPU, but NOT on SLI/xfire, giving you lower quality image but greater than 2x FPS on SLI/xfire).
2. the drivers/game contain a bug that cripples performance in single GPU (ex: NWN2 had a bug that caused an infinite loop when used with nvidia 8800 series cards, which caused extremely lower performance), that bug somehow does not get manifested in multi GPU.
3. There is always slight variations on repeated testings, it might be that those lead to it measuring a 100+% scaling when it is in fact actually 98 or 97 or some such.. This should not be significant though and requires a baseline of near 100% scaling which is not likely.
IIRC there are a few other reasons but I am really really tired right now and I can't concentrate enough to remember them.

You are oversimplifying it.

There is overhead that the GPU does that doesn't need to be done twice, and because of that the second GPU does not have that same overhead to overcome. So other than statistical error (which no-one seems to take into account on these reviews anyway), it is possible to get better than 100% scaling as two GPU's are sharing the overhead, versus one processing the entire amount.

Seero · Dec 23, 2010

taltamir said:
I understand it perfectly, you are applying it incorrectly. You remind me of my highschool precalc teacher who once tried to show me with a limit function how over an infinite duration a 1 liter container will leak out 3 liters of water and when I pointed out that it fails basic sanity check thought I was trying to claim the limits don't work (I wasn't, I was pointing out that she is using the limit incorrectly).
You can solve an equation correctly and still be wrong if you set it up incorrectly or even used the wrong equation. This is exactly what you are doing, you are simply setting up the equation all wrong, and this is why you are getting a theoretical maximum with 2 GPUs of 8x of the performance of a single GPU instead of a little under 2x the performance of a single GPU.

You preception to my post changes from "nonsense" to "highschool precalc teacher" while the post didn't change. It appears that your understanding to the post has changed, although still way off, you are on the right track.

First, I believe you reconstructed what your teacher was trying to tell you back then with your own understanding to the material, and since your understanding is limited (the fact that you were a student at the time), therefore the reconstructed story may have absolutely nothing to do with what your teacher was trying to tell you at the time. It can't be what your teacher was trying to say. I wasn't in your class at the time or have any other information about that, but based upon what you said, unless you are in some handicap school, you won't be the only one you spotted the problem.

Questioning is good, denial is bad when it comes to learning. If you question me, then I will try to answer you. I may not be capable of giving you the correct answer, but eventually, you will get your answer. This is called learning. Please do respect your teacher because if it wasn't him/her, you will never ask the right question. However, it isn't any of your teacher's fault that you don't ask questions or simply stop listening.

Maybe trying to look at the scenario where benchmark > 2x performance instead of tyring to dig your memory for something to put me down, you can actually end up gaining something from this thread.

Let P(t) represents the performance of 1 video card, let Q(t) represents the performance of 2 video cards, let t represents any given point in time during the benchmark, where 0<=t<=N, where N is the total duration of the benchmark.

You believe that Q(t) cannot possibly be bigger than 2*P(t). So 0<=Q(t)<=2*P(t). So
Summation of Q(t) over all t between 0 to N <= summation of 2*P(t) over all t between 0 to N. Okay, but we now have a case where
Summation of Q(t) over all t between 0 to N > summation of 2*P(t) over all t between 0 to N knowing that there exist some t there Q(t) <= P(t).

Yes, some may simply claim that the statistic/benchmark is wrong and call it a day, but some may simply challenge the fact that it is possible that Q(t) > 2*P(t) at some t. If that is the challenge, than what is an upperbound of Q(t) with respect to P(t)?

You can re-read the threads of how I come up with 8. If you believe that 8 is too big, then share your PoV of another theoretical maximum. So far, no one stated that it can even be possible get close to that, so it is a good sign, but doesn't mean that it is indeed the max as I could have missed some other independent factors which a)gets increased by adding a video card and b)has possitive impact on performance.

Some agrued that memory did not get doubled as data gets reprecated to the memories of video cards. However, the data does not assume all the memory on those video cards.

For example, let say X is the amount of data that are sent to video card, Y is the capacity of memory of the video card, and C is the available(unused) memory. When there is one card, c1 = y-x. When there are 2 cards, then c2 = 2y - 2x. Clearly c2 = 2*c1. Yes, there may be a W, which only occurs when there are more than one card to make SLI/CF to work. I can't say whether or not W has any relation to X, but theoretically speaking both W and X can be 0.

I am not saying 2xmem size = 2xperformance, I am saying that doubling memory does not scale more than 200% when it comes to performance theoretically.

Some claim that doubling each individual factor can only have 133% performance at best, not 200%. I didn't say that is wrong, because I really don't know enough about it. However, if the same rule applies to mem size and number of GPUs, then each can individually increase performance by up to 33%, which in total 133% * 133% * 133% = 235.26%. If 235.26% is indeed a theoretical maximum, meaning that there is no possible implementation which performance can exceed that, then it is a better theoretical max from mine. However, I believe he simply got the number by 1/3 and didn't realized the number he is looking for is cuberoot of 2. Maybe he believed that since those factors are dependent, and therefore should be addition instead of multiplication. I will let him explain.

My take on his comment is rather simple. Suppose i/o takes 1 cycle to send 1 unit of data and GPU takes 1 cycle to process data and require 2 units of data to begin the process, then clearly GPU can only be working 50% efficiency, and doubling i/o will double GPU's efficiency. Suppose GPU requires Y units of data to process, then its efficiency will be 1/Y, and doubling i/o will bring it up to 2/Y, or scaled 200%. I am not saying 200% is what we will get all the time, I am saying that, theoretically, it is the best it can get.

JAG87 · Dec 23, 2010

It's so funny how way off everyone is.

The concept is simple: no two frames are ever identical.

If one card renders 30 frames in one second, the second card is rendering a different set of frames, which might be a little easier to render, and might give the card the chance to render 32 for example instead of 30. Hence you end up with 30+32 = 62, greater than 100% scaling.

That's all there is to it.

Voo · Dec 23, 2010

taltamir said:
@Voo, what do you think of my theory to explain 150+% performance boost?
this one: the drivers contain a bug/lack a feature/intentional cheating to get better reviews such that a certain calculation is not performed in SLI/xfire (aka, you set the game to do HDR, it will do HDR on single GPU, but NOT on SLI/xfire, giving you lower quality image but greater than 2x FPS on SLI/xfire).

Falls under the "different code path" point in my opinion, but yeah that way it could explain larger differences as well.

@JAG87: So you're saying that almost identical frames (after all they're rendered successively, so we can assume they're most of the time pretty similar) would take a noticeable different amount to render?
I personally find that unlikely, but even if that was the case, you're assuming that the 2nd card gets the easier to render frames which the single card skips.. but it could be exactly the other way round.
So without further information I think we'd have to assume that both effects cancel each other out in a large enough sample (and tens of thousands of frames are a "large enough" sample imho)

@Seero: No it just doesn't work that way. Or would you like to argue that the 10k added cpu brings exponentially more than the 2nd one? Considering that adding other subsystems to your formula even further increases the growth (which just doesn't make any sense).
Let's say you've got 3 components A, B, C and all 3 bottleneck the computation equally (obviously the best case szenario).
If we double all resources equally the bottleneck will go from Min(A, B, C) to Min(A*2, B*2, C*2), not to Min(2^3*A, 2^3*B, 2^3*C).

The simple solution is stable, yours is not. We don't have to draw magic numbers out of our hats, when there are perfectly fine rational explanations.

Seero · Dec 23, 2010

Voo said:
@Seero: No it just doesn't work that way. Or would you like to argue that the 10k added cpu brings exponentially more than the 2nd one? Considering that adding other subsystems to your formula even further increases the growth (which just doesn't make any sense).
Let's say you've got 3 components A, B, C and all 3 bottleneck the computation equally (obviously the best case szenario).
If we double all resources equally the bottleneck will go from Min(A, B, C) to Min(A*2, B*2, C*2), not to Min(2^3*A, 2^3*B, 2^3*C).

Again, you have missed the 2nd sentense of my first post. Say a single CPU can process up to 1 TFlop, adding one more can't exceed 2 TFlop. However, it is not usual(IMO impossible in practice) that you can completely utilize a CPU due to bottlenecks, that means it is almost impossible to aquire 1 TFlop. However, doubling the architectural state (you can think of it as reduce i/o delay by half) of a CPU, the CPU can be utilize better. It can utilize twice as good, although the actually number of processor did not get doubled. Such technology is called Hyper-threading.
Now you can immediately claim that if the performance a single core CPU with HT is X, then 2 CPU with HT independently can't exceed 2*x. In practice, it does exceed 2*x. You should know that, in theory, dual core + HT is like having 4xcpu, but in practice, it is no where near that. But you have to becareful though, although the maximum possible performance does not exceed 2xCPUs, practically, it does do better than 2xCPU because in practice, CPU spend most of its time waiting. When people see all cores at 100%, they think that all cores are being completely utilized, and therefore running at maximum capacity, where all it means is there they are no available cores at the time when task manager does the scan, that doesn't mean all cores are working though. Because of that, although there are only 2 CPUs, performance can actually gets scale more than 200%. In theory, it can scale up to 400%. This is why you see 4 cores on a dual core chip.

Voo · Dec 23, 2010

Seero said:
Again, you have missed the 2nd sentense of my first post. Say a single CPU can process up to 1 TFlop, adding one more can't exceed 2 TFlop. However, it is not usual(IMO impossible in practice) that you can completely utilize a CPU due to bottlenecks, that means it is almost impossible to aquire 1 TFlop.

Oh so the time when the computation is CPU bound can be at most halfed with twice the CPU power.
And too bad that "time the computation is CPU bound" can be at most 1 (i.e. all the time). And that works out to: 1 * 2 * x.
In the end the union of all those "time compution is X bound" will be exactly 1, if you double every resource in the end you'll end up at 1 * 2 * old_val. That's the best case szenario.

But since you're now somehow stock on a "multiple steps with different bottlenecks"..
Say the computation is 30% bottlenecked by res. A, 60% by res. B and 10% by res. C and it's pipelined (though if you want 3 resources in parallel, hey that's even easier because that's just the longest execution of one part)
Doubling all ressources in that case would bring.. still not more than 2x the performance.

Seero said:
However, doubling the architectural state (you can think of it as reduce i/o delay by half) of a CPU, the CPU can be utilize better. It can utilize twice as good, although the actually number of processor did not get doubled. Such technology is called Hyper-threading

Uh, what? HT is something completely different, could you stop throwing stuff in this discussion?
HT can give at most as much performance increase as the not used resources by the first thread. If the first thread uses 70% of all units, you can at most increase the performance by 30% (not considering cache contention, scheduling overhead,.. which will lower it a bit).
The performance improvement by HT is easily explained without any weird exponential stuff in it (although I must applaud your fantasy, not many people would even think about bringing that up xX), really no reason to make it more complicated than it is.

JAG87 · Dec 23, 2010

Voo said:
@JAG87: So you're saying that almost identical frames (after all they're rendered successively, so we can assume they're most of the time pretty similar) would take a noticeable different amount to render?
I personally find that unlikely, but even if that was the case, you're assuming that the 2nd card gets the easier to render frames which the single card skips.. but it could be exactly the other way round.
So without further information I think we'd have to assume that both effects cancel each other out in a large enough sample (and tens of thousands of frames are a "large enough" sample imho)

So what if it's the other way around?

If a single card renders 30 fps, it has X work load. If two cards render 60 fps, even if the primary card is rendering 30 fps again it will never have the same X work load as before. And the second card will not have the same X work load either. You have workloads Y and Z now.

Work load Y on card 1 generates 32 fps, workload Z on card 2 generates 31fps, and you end up with 63 in total. And this is just one second, one second is an instance of a benchmark comprised of many seconds where an average is taken. That's how you end up with what looks like above 100% scaling.

Do you ever watch your frame rate while you play even when you aren't moving anything? Do you see how quickly it variates?

Voo · Dec 23, 2010

JAG87 said:
So what if it's the other way around?

If a single card renders 30 fps, it has X work load. If two cards render 60 fps, even if the primary card is rendering 30 fps again it will never have the same X work load as before. And the second card will not have the same X work load either. You have workloads Y and Z now.

Well you're saying that some frames take longer to render than others. They still render the same frames, just more of them.
Let's say there are two frames, one takes 5ms the second 7ms to render.
The sli configuration renders both frames, so on average it'll need 6ms to render a frame, the single GPU renders one of them which means it'll need either 5 or 7ms.
Statistically speaking I don't see any reason why any of the two variants would be more probable, so in the end the effect should cancel it out for a large enough sample.

digitaldurandal · Dec 23, 2010

JAG87 said:
It's so funny how way off everyone is.

The concept is simple: no two frames are ever identical.

If one card renders 30 frames in one second, the second card is rendering a different set of frames, which might be a little easier to render, and might give the card the chance to render 32 for example instead of 30. Hence you end up with 30+32 = 62, greater than 100% scaling.

That's all there is to it.

This would work if the benchmark lasted only for .5 seconds or less.

Fortunately reviewers are smarter than this and most benchmarks last one minute or longer. The times at which what you are saying would apply would be rare - as for a single card the frames wouldn't be the same difficulty either - so the same scene the card would speed up on those frames as well.

Also to the person who said the theoretical maximum was 8x including because of 2x memory - the memory of the second card does not work like RAID.

Seero · Dec 23, 2010

Voo said:
Oh so the time when the computation is CPU bound can be at most halfed with twice the CPU power.
And too bad that "time the computation is CPU bound" can be at most 1 (i.e. all the time). And that works out to: 1 * 2 * x.
In the end the union of all those "time compution is X bound" will be exactly 1, if you double every resource in the end you'll end up at 1 * 2 * old_val. That's the best case szenario.

But since you're now somehow stock on a "multiple steps with different bottlenecks"..
Say the computation is 30% bottlenecked by res. A, 60% by res. B and 10% by res. C and it's pipelined (though if you want 3 resources in parallel, hey that's even easier because that's just the longest execution of one part)
Doubling all ressources in that case would bring.. still not more than 2x the performance.

You are right about bottleneck, but bottleneck is only one of the factors performance depend upon, and most importantly, it isn't a linear relation.

Uh, what? HT is something completely different, could you stop throwing stuff in this discussion?

I am not the one who brought CPU into the discussion.

HT can give at most as much performance increase as the not used resources by the first thread. If the first thread uses 70% of all units, you can at most increase the performance by 30% (not considering cache contention, scheduling overhead,.. which will lower it a bit).

But that is not the best case. The best case is the time needed to compute, X, is much smaller than the time needed to pass data into CPU, Y. Depending on how much clock cycled needed to populate those registers, performance can increase by more than 30%, not concidering any other variables. In practice, X may or may not be smaller than Y, but when X is much bigger than Y, then next to no scaling occurs. So the theoretical maximum is 200% and the theoretical minimum is 100% assuming the implementation of HTT is perfect, otherwise, the theoretical min can be below 100%.

The performance improvement by HT is easily explained without any weird exponential stuff in it (although I must applaud your fantasy, not many people would even think about bringing that up xX), really no reason to make it more complicated than it is.

Now everyone understand why the sky is dark at night, until one day, someone actually challenge that and ended up with an expanding universe theory.

Again, you say for sure that it can't scale up more than 2 times, but have you explain why benchmarks done by different people shows otherwise? If it is due to margins of errors, what is the source? Lots of scienists simply remove statistics that isn't fit their model and claim that it is simply errors. This action lead to all kinds of disasters, and unfortunately, will continue to occur because people trust whatever they believe more than what happen.

Do you even have an idea how big the margin of error is needed to show more than 200% performance given that it is virtually impossible for perfect linear scaling? Throughout the benchmarks there must be some t where scaling does not occur, making it impossible to see 200% at the end. May be during the first run all reviewers coincidentally run anti-virus at the back, pushing down the numbers with a single card, therefore showing that doubling the video card actually show > 2x performance, but the chances of that is ... low.
May be it is due to round off error. But the chances of round down hurting performance of a single card and round up favoring performance of 2 cards to an extend it appears as if 2x card > 2x performance is low, plus the chance of those card actually scales perfectly is ... low.
What other kinds of errors can you think of that may contribute to this? If you can't think of a reasonable explanation that it is actually in the margin of errors, may be it that isn't an error.

Athadeus · Dec 23, 2010

Can anybody explain why on a stable system, a video/gaming benchmark result cannot be replicated perfectly? I would think it would mostly be because of how little control there is over the darn Windows background processes.

All the Prime95, Folding, SETI results are repeatable and hence can be used for stability testing against right?

WelshBloke · Dec 23, 2010

Seero said:
It is nonsense because you don't understand the difference between expected running time vs theoretical maximum. To give you an idea, you can look at Upper and lower bound.
Clearly, if it can't go beyond 8 times, than 8 is an upper bound. 8 is not an arbitrary number from thin air, but a product of 3 factors that are related in the given case. You can argue that it is too big, but unless you can find a smaller upperbound, or a theory or even prove that in can in fact go beyond 8, then you really can't say I am wrong. It is like I am saying 2 < 10 and you say "nonsense, 2 can never reach 10."

Yes it is.

Seero · Dec 23, 2010

digitaldurandal said:
Also to the person who said the theoretical maximum was 8x including because of 2x memory - the memory of the second card does not work like RAID.

Well, if the word RAID is your cup of tea, i'll use it. I didn't say it works like RAID0, but it doesn't work like RAID1 either.
I remember I said:

For example, let say X is the amount of data that are sent to video card, Y is the capacity of memory of the video card, and C is the available(unused) memory. When there is one card, c1 = y-x. When there are 2 cards, then c2 = 2y - 2x. Clearly c2 = 2*c1. Yes, there may be a W, which only occurs when there are more than one card to make SLI/CF to work. I can't say whether or not W has any relation to X, but theoretically speaking both W and X can be 0.

Now have you ever wonder why the memory size can't scale? I know the current implementation doesn't, but why not? If I/O is faster enough, then it is actually a good idea right?

Athadeus said:
Can anybody explain why on a stable system, a video/gaming benchmark result cannot be replicated perfectly? I would think it would mostly be because of how little control there is over the darn Windows background processes.

All the Prime95, Folding, SETI results are repeatable and hence can be used for stability testing against right?

Prime 95 stresses CPU, but when dual core first arrive, people actually need to open 2 instances of prime95 to stress both CPU, but even than, prime95 does not completely stress CPU due to bottlenecks. Note that the original purpose of prime95 is to calculate prime numbers. Using it to test the system is free, using it to calculate big prime numbers is not.

Now GPU is very different from CPU in terms of architecture. To completely stress GPU, the program must be capable of utilizes the fact that there are a lot of small cpus. Also, GPU isn't designed around having all cores doing something, so the GPU may not be able to take a new instruction while most of the cores are not doing anything.

Again, prime95 is not built to stress CPU, but people found that prime95 stress CPU very well. I don't remember names, but there exists programs that designed for this purpose. However, using those programs are known to be capable of killing a perfectly stable computer.

greater than 100% scaling? really?

Diamond Member

Lifer

Lifer

Golden Member

Lifer

Lifer

Lifer

Golden Member

Lifer

Golden Member

Lifer

Lifer

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Senior member

Lifer

Golden Member