Bulldozer Folding @ Home Performance Numbers

hamunaptra · Mar 7, 2011

JFAMD said:
Interlagos has 16 cores.

128 bit mode = 4 x 16 = 64 single precision operations per cycle
256 bit mode = 8 x 8 = 64 single precision operations per cycle

So, whether you are in legacy mode or AVX mode, the number of operations are the same.

Now, for the fun part.

If you are utilizing FMA4, you can do a fused multiply accumulate (a=b+c*d) all in one cycle, where it would take SB 2 cycles to achieve the same thing.

And here are a few of the things that BD's FMACs can do that SB can't:

1. Run a 128-bit AVX and an SSE operation on the same cycle
2. Run two 128-bit AVX operations on the same cycle
3. Run two 128-bit AVX and an SSE operation on the same cycle
4. Run a 256-bit AVX and an SSE on the same cycle

I know so little about the folding software, but if it utilizes a lot of SSE, you should see a big boost from BD because intel recommends recompiling for AVX and changing all SSE instructions to AVX-128. Plus, for AVX-128 they recommend actually padding the instruction (adding all zeros between 128 and 256) which means that even though you have 256-bit wide registers, you can only run one 128-bit through at a time.

Plus, with FMACs being more flexible (they can do an FADD or an FMUL) on any cycle, you are better off because you get higher efficiency.

Whoa, what it can run a 256bit AVX and SSE in parallel in a single clock cycle?? thats news to me!

I thought it wouldnt be able to do such a thing. I know there are 2 other pipes other than the FMAC but afaik those were for INT SIMD MMX type workloads...

How can the BD do a single 256bit AVX and an SSE alongside each other?

hamunaptra · Mar 7, 2011

JFAMD said:
Most of the compilers support FMA already today. I can think of one holdout and I wouldn't expect it for a few more years.

LOL intels compiler? since theyre chips dont have FMA4 yet...

Arkadrel · Mar 7, 2011

A Bulldozer is so my next CPU upgrade.

theAnimal · Mar 7, 2011

996GT2 said:
4x16 core Bulldozer: 780,000 PPD
4x8 core Magny Cours (Opteron 6134): 180,000 PPD

Not really that impressive. Those PPD numbers are not linear due to an early return bonus.

As an example, with 1 CPU my system would get around 60k PPD whereas with 2 CPUs I am around 160k PPD.

A system with quad 12 core Opterons @ 2.2GHz currently gets 357k PPD, and if we scale that to 64 cores it would be 550k PPD and further scale to 2.8GHz it would give 789k PPD.

Edit: This is of course assuming the same WU. If the BD number is with the dreaded 2684 WU (or at a much lower clockspeed), then it would be impressive.

JFAMD · Mar 7, 2011

hamunaptra said:
Whoa, what it can run a 256bit AVX and SSE in parallel in a single clock cycle?? thats news to me!

I thought it wouldnt be able to do such a thing. I know there are 2 other pipes other than the FMAC but afaik those were for INT SIMD MMX type workloads...

How can the BD do a single 256bit AVX and an SSE alongside each other?

Actually, I think you are right, the other 2 pipes are MMX. But the 128-bit AVX with SSE is possible for us.

Until I can get a clarification back from the engineer assume:

1. Run a 128-bit AVX and an SSE operation on the same cycle
2. Run two 128-bit AVX operations on the same cycle

Khato · Mar 7, 2011

JFAMD said:
And here are a few of the things that BD's FMACs can do that SB can't:

1. Run a 128-bit AVX and an SSE operation on the same cycle
2. Run two 128-bit AVX operations on the same cycle

Perhaps I'm remembering incorrectly, but isn't the difference that BD's FMACs can do two operations of the same kind per cycle, while SB can only do one add and one multiply? There's no question that this is an advantage for BD, but not quite so dramatic as would otherwise be the case.

Same is technically true for the 'advantage' that supporting the FMA4 instruction gives. Yes, it can complete a multiply-add in one cycle while SB takes two cycles... but SB can effectively 'pipeline' the operations to complete a multiply-add every cycle since it can do a 256 bit add and a 256 multiply in the same cycle.

IntelUser2000 · Mar 7, 2011

In the following scenario, AVX should still be somewhat faster in real world applications.

16x128-bit SSE
8x256-bit AVX

Pentium 4 moved to 128-bit with SSE2, but it still gained a decent amount even though the FPUs were 64-bit and needed 2 cycles to complete one 128-bit SSE2 instructions.

Imagine between 2x RAM capacity and RAM frequency. Roughly speaking its like that. Now that doesn't change anything theoretical or in linear gain code.

For SNB I'm still puzzled as to why Intel isn't touting AVX capabilities with new programs. Wasn't SP1 the version that brings AVX support?

Khato said:
Perhaps I'm remembering incorrectly, but isn't the difference that BD's FMACs can do two operations of the same kind per cycle, while SB can only do one add and one multiply? There's no question that this is an advantage for BD, but not quite so dramatic as would otherwise be the case.

I guess that's complicated by the whole module vs. cores issue.

Each module in Bulldozer has two 128-bit FPUs with FMA.

Each Sandy Bridge core has two 128-bit FPUs without FMA(two 256-bit with AVX).

Module-wise, Bulldozer has an advantage with 128-bit, and is identical to Sandy Bridge with 256-bit AVX.

hamunaptra · Mar 7, 2011

JFAMD said:
Actually, I think you are right, the other 2 pipes are MMX. But the 128-bit AVX with SSE is possible for us.

Until I can get a clarification back from the engineer assume:

1. Run a 128-bit AVX and an SSE operation on the same cycle
2. Run two 128-bit AVX operations on the same cycle

Ive always been curious as to why MMX has received 2 of its own dedicated pipes. Isnt MMX an afterthought pretty much nowadays? like noone uses it anymore do they?...
Wouldnt that just be wasted die space or .. does it literally not take up much at all...lol?

I thought it would have been piggy backed onto the other 2 FPU pipes...lol more efficiently.

drizek · Mar 7, 2011

So an overclocked 8-core bulldozer will get 100,000ppd?

Considering that I'm barely getting 2000ppd with my 3.6Ghz PII X3, thats pretty impressive.

HW2050Plus · Mar 7, 2011

There is some important thing about Bulldozer and FPU/SSE/AVX processing.

This comes into play when all cores of the module are used.

In reality there is no real pure FPU/SSE/AVX code. The code is mixed with integer instructions and/or memory operations.

That means if you have a core which could theoretically feed 2 SSE / cycle it will not do that all the time, because the code contains also more or less cycles with integer instructions. So in a real CPU with one thread feeding the FPU/SSE/AVX line you could not fill the pipelines.

Now here comes the beauty of Bulldozer. Not only that it can do the integer stuff mostly in parallel due to the independent scheduler. It can also fill up gaps in the FPU/SSE/AVX pipeline by code from the other thread.

Of course this effect is not a 100% speedup, but it is a speedup. The exact speedup is very difficult to predict because very code dependand and also not constant even with the same code. The effect could be even increased in future by increasing the scheduler depth.

Another important thing is memory bandwith. In Bulldozer it is possible to read and write each cycle all data and results to memory (L1/L2/L3/memory/disk), especially if the type of data can be cached well. That is very important, because otherwise your FPU/SSE/AVX gets memory stalls.

Therefore 2 128 Bit units per module gives significantly more performance than 2 seperate cores with 128 Bit units each, though on paper it might look that you have the same performance (because of same unit count)!

And float code is great for Bulldozer becauser it is mixed with integer and that means that Bulldozer can do both in parallel because they have separate schedulers. In Integer only code the additional FP scheduler etc. don't give any benefit.

In pure theory (means at peak performance) a Bulldozer module can do 8 Integer MacroOPS + 8 FPU ops (with FMAC, 4 without) means 16 operations per cycle (plus 6(!) 128 Bit memory operations). Of corse this peak performance is not sustained!

Therefore code with ~40% float/SSE and ~60% integer with many memory operations to L1 would be code where Bulldozer would shine at it's best.

For comparison 2 SB cores peak performance: 8 Integer yops or 6 Integer yops + 4 * FPU ops (10 ops/cycle). To be fair it must be said that SB peak capability is lower but their average throughput does not drop as much as that of Bulldozer. That means that in the vast majority of applications 2 SB cores are faster than 1M/2C of Bulldozer. However the more and the better the integer code is mixed with memory and/or FPU/SSE/AVX operations the more Bulldozer will shine.

Don't know if that is the case here since I do not know the code of Folding@Home.

I can now say for sure that F@H does not use AVX nor FMAC. The current version was compiled in november 2008. And I doubt that you get PPD score with beta software.

RobertPters77 · Mar 8, 2011

So what does this mean for the average Jim?

What performance can I expect relative to sandy bridge? +/- or = clock for clock?

Edit: What is sandy bridge's fpu performance ?

PreferLinux · Mar 8, 2011

Sandy bridge gets 80 GFLOPS with AVX at stock.

RobertPters77 · Mar 8, 2011

PreferLinux said:
Sandy bridge gets 80 GFLOPS with AVX at stock.

Which one? i7 2600?

bandgit · Mar 8, 2011

RobertPters77 said:
So what does this mean for the average Jim?

What performance can I expect relative to sandy bridge? +/- or = clock for clock?

Edit: What is sandy bridge's fpu performance ?

I'm with you. I got a migraine reading this thread, so what I want to know is what SB vs BD results average Joes can get on their Windows 7 rigs running regular, everyday apps like Office and Photoshop. :thumbsup:

BD231 · Mar 8, 2011

drizek said:
So an overclocked 8-core bulldozer will get 100,000ppd?

Considering that I'm barely getting 2000ppd with my 3.6Ghz PII X3, thats pretty impressive.

LOL jesus these threads make for so much miss information. OP is talking about server chips in clusters of 4.

4 x 8 = 32 cores at work and 4 x 16 = 64 cores at work. These numbers are far beyond what you'll be getting out of a single socket BullDozer platform.

hamunaptra · Mar 8, 2011

So, wouldnt it be fair to multiply the frame time by x8 and put it into the f@h calc and it would spit out the appropriate PPD including bonus for a single socket BD? (desktop 8 core)

drizek · Mar 8, 2011

BD231 said:
LOL jesus these threads make for so much miss information. OP is talking about server chips in clusters of 4.

4 x 8 = 32 cores at work and 4 x 16 = 64 cores at work. These numbers are far beyond what you'll be getting out of a single socket BullDozer platform.

That's why I divided by 8. I didn't realize that bonus points had such a big effect on it that it would end up being so non-linear though

I just did what humanaptra suggested and got 35797PPD.

According to the chiphell thread, turbo was not used. Also, since this is a server, I bet the frequencies were much lower than what an overclocker can get in a desktop.I used 30m56s/frame, but if you use 20m/frame then that is closer to about 70,000PPD.

HW2050Plus · Mar 8, 2011

bandgit said:
I'm with you. I got a migraine reading this thread, so what I want to know is what SB vs BD results average Joes can get on their Windows 7 rigs running regular, everyday apps like Office and Photoshop. :thumbsup:

If somebody did not tell already you should look at Solid State Disks (SSD) which are important regarding the scenario you describe. Don't see how your user experience scenario would be bound by CPUs anyway.

HW2050Plus · Mar 8, 2011

Some News about that:

According to some sources the AMD Folding@Home score was achieved with Interlagos parts running at only 1.8 GHz! Also that TURBO was switched off.

Now I did some math on that. That means that Interlagos performs 58% faster than Magny Cours at same core count at same clock and with TURBO switched off. That could be hefty as actual Interlagos has a large per core advanatage and has more cores and more clock and TURBO compared to actual Magny Cours. In numbers Interlagos will be around 175% faster than Magny Cours (in this benchmark)!

Now comparison with Intel Xeon X5650:
Interlagos is per clock per CMT core 27% faster than a Xeon x5650 core with(!) HT enabled which increases the per core throughput.
If you take into account that Interlagos has 16 cores and Xeon X5650 only 6 you get a whopping 240% gain over Xeon X5650 clock for clock.

And all that with old code from 2008 that does not use the new AMD feature of FMAC which could give another 30-50% gain.

Really whopping, that means that regarding FPU a BD CMT core will be faster than a SB real core including HT! And AMD will offer twice the amount of cores with their Vision Black FX8000 compared to i7 2600.

BD231 · Mar 8, 2011

drizek said:
That's why I divided by 8. I didn't realize that bonus points had such a big effect on it that it would end up being so non-linear though

I just did what humanaptra suggested and got 35797PPD.

According to the chiphell thread, turbo was not used. Also, since this is a server, I bet the frequencies were much lower than what an overclocker can get in a desktop.I used 30m56s/frame, but if you use 20m/frame then that is closer to about 70,000PPD.

You can't account for the number of FPU's lost in that equation.

hamunaptra · Mar 8, 2011

HW2050Plus said:
Some News about that:

According to some sources the AMD Folding@Home score was achieved with Interlagos parts running at only 1.8 GHz! Also that TURBO was switched off.

Now I did some math on that. That means that Interlagos performs 58% faster than Magny Cours at same core count at same clock and with TURBO switched off. That could be hefty as actual Interlagos has a large per core advanatage and has more cores and more clock and TURBO compared to actual Magny Cours. In numbers Interlagos will be around 175% faster than Magny Cours (in this benchmark)!

Now comparison with Intel Xeon X5650:
Interlagos is per clock per CMT core 27% faster than a Xeon x5650 core with(!) HT enabled which increases the per core throughput.
If you take into account that Interlagos has 16 cores and Xeon X5650 only 6 you get a whopping 240% gain over Xeon X5650 clock for clock.

And all that with old code from 2008 that does not use the new AMD feature of FMAC which could give another 30-50% gain.

Really whopping, that means that regarding FPU a BD CMT core will be faster than a SB real core including HT! And AMD will offer twice the amount of cores with their Vision Black FX8000 compared to i7 2600.

Whats the sources if I may ask? Cuz Ive been searching for this 1.8ghz source too as Ive seen it randomly pop up on other forums threads about the F@H scores...but the source of that score mentions nothing of clockspeed, unless he mentioned it later on his thread.

Also, did you take into account the crazy bonus being applied in this case?

If any of this is damn true...then its too good to be true but it might just be true lol in which case....Intel is in a WORLD of HURT, at least for FPU performance. Now we just gotta wait for INT performance...

Also, lets just not hope this is the only type of workload AMD is gonna shine in...lol , well at least it will make a section of the community happy = Distributed Computing.

HW2050Plus · Mar 8, 2011

hamunaptra said:
Whats the sources if I may ask? Cuz Ive been searching for this 1.8ghz source too as Ive seen it randomly pop up on other forums threads about the F@H scores...but the source of that score mentions nothing of clockspeed, unless he mentioned it later on his thread.

It comes from ChipHell itself. I cannot read the language there therefore I took that info from another translation (also not English but a language I can read). Therefore the 1.8 GHz statement is in the OP source itself. There is also the calculation I did, though their result of 62% gain per core per clock is higher than my calculation of 58%. Therefore all the numbers I calculated above are a bit lower than the results on ChipHell.

hamunaptra said:
Also, did you take into account the crazy bonus being applied in this case?

No. I used the calculation times not the PPD score, so the bonus is excluded.

hamunaptra said:
If any of this is damn true...then its too good to be true but it might just be true lol in which case....Intel is in a WORLD of HURT, at least for FPU performance. Now we just gotta wait for INT performance...

Integer will be definitely lower. But Bolldozer will be a float monster.

hamunaptra said:
Also, lets just not hope this is the only type of workload AMD is gonna shine in...lol , well at least it will make a section of the community happy = Distributed Computing.

For all pure integer workloads it will not look that amazing for AMD. But for all float type workloads it appears to be absolute amazing.

Let's hope that integer performance will at least do well though I am absolutly sure that Bulldozer will be slower in integer than SB per core. That is not bad as BD has more cores because of CMT.

bandgit · Mar 8, 2011

HW2050Plus said:
If somebody did not tell already you should look at Solid State Disks (SSD) which are important regarding the scenario you describe. Don't see how your user experience scenario would be bound by CPUs anyway.

I regularly manipulate files of over 1GB on Photoshop. I currently run an i7 920 with 12GB RAM and Velociraptor. Not only do I need lots of RAM and fast HD, but the quickest possible CPU as well.

HW2050Plus · Mar 8, 2011

bandgit said:
I regularly manipulate files of over 1GB on Photoshop. I currently run an i7 920 with 12GB RAM and Velociraptor. Not only do I need lots of RAM and fast HD, but the quickest possible CPU as well.

Okay. First to say you hopefully know that a Velociraptor is nothing compared to SSD. SSDs are up to 100 times faster than even a Velociraptor. With no CPU upgrade you can get such large performance improvements. I use Velociraptors as well but that did only little improvement.

On the other hand I would be interested by which operations you feel CPU bound and what time you wait for the result of the operation (half, one, 10 seconds?).

I mean a Core i7 920 is already a really fast CPU.

I still believe that you are not CPU bound especially if you say that you manipulate such large files.

Idontcare · Mar 8, 2011

The hype in here is going to 11.

Bulldozer Folding @ Home Performance Numbers

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Golden Member

Elite Member

Senior member

Golden Member

Member

Senior member

Senior member

Senior member

Member

Lifer

Senior member

Golden Member

Member

Member

Lifer

Senior member

Member

Member

Member

Elite Member