Bulldozer Folding @ Home Performance Numbers

996GT2 · Mar 7, 2011

Not sure if legit, but:

http://www.chiphell.com/thread-169088-1-1.html

Supposedly:

4x16 core Bulldozer: 780,000 PPD
4x8 core Magny Cours (Opteron 6134): 180,000 PPD

Arkadrel · Mar 7, 2011

Folding@home is VERY floating point dependent, I belive.

bulldozers are "floating point monsters" compaired to their older cpus so yeah it could be "legit".

I belive there is a guy on that site that explains the math and why the numbers add up.

Flexible FPU (flexible floating point Unit): 1 for each modual (has its own scheduler), a extensions called AVX, can handle 256-bit FP execution. Single precision commands are 32-bit and double precision are 64-bit. With today’s standard 128-bit FPUs, you execute four single precision commands or two double precision commands in parallel per cycle. With AVX you can double that, executing eight 32-bit commands or four 64-bit commands per cycle.

"The beauty of the Flex FP is that it is a single 256-bit FPU that is shared by two integer cores. With each cycle, either core can operate on 256 bits of parallel data via two 128-bit instructions or one 256-bit instruction, OR each of the integer cores can execute 128-bit commands simultaneously."

Flex FPU makes bulldozer a floating point monster.

Mopetar · Mar 7, 2011

Approximately 116% performance improvement per core; of course this doesn't take clock speed into account so it's likely somewhat less considering clock per clock per core.

Not bad, if it's true, of course.

Edrick · Mar 7, 2011

Mopetar said:
Approximately 116% performance improvement per core; of course this doesn't take clock speed into account so it's likely somewhat less considering clock per clock per core.

Not bad, if it's true, of course.

That is about right since AXV can almost double performance. We are seeing similar (80%-100%) improvements in SB with AVX FP tests.

Mopetar · Mar 7, 2011

Edrick said:
That is about right since AXV can almost double performance. We are seeing similar (80%-100%) improvements in SB with AVX FP tests.

Given that, napkin math suggests the alleged Bulldozer CPU has a clock speed around 3 GHz to 3.5 GHz.

HW2050Plus · Mar 7, 2011

Any link to that Folding@Home uses AVX? I would not be so sure that it does.

Therefore it would be interesting if they use FPU or SSE or AVX?

Edrick · Mar 7, 2011

Mopetar said:
Given that, napkin math suggests the alleged Bulldozer CPU has a clock speed around 3 GHz to 3.5 GHz.

Which sounds about right given what we have seen from 32nm thus far.

IntelUser2000 · Mar 7, 2011

Reading further says 12 core Opteron scores 400 seconds and the 16 core Interlagos score ~230 seconds. Given the FMA support and 1/3 more cores, its quite plausible.

Server part doesn't clock that high.

hamunaptra · Mar 7, 2011

F@H I doubt uses anything but normal SSE or FPU math. It doesnt use AVX or FMA AFAIK...If it did then F@H would be significantly faster on a SB, which I dont think it is...

They are always behind in code implementation.

OTOH, BD would perform the same with AVX code or non AVX right?
Either 2x128bit SSE math or 1x256bit AVX...correct?

So, all I can say is either the posted F@H results are a fake for BD....OR its real and BD is an FPU monster!!!
Now about the INT.....

Diogenes2 · Mar 7, 2011

A much more useful example would be PPD for an 8 core BD running Win7 or Linux..

Something much more likely to be in the hands of a possible user..

The OP is kind of like going on a Cessna forum and singing the praises of an F-22a...

Nemesis 1 · Mar 7, 2011

Thanks for that link. Not! Flags went up all over the place on my PC , Spyware

PreferLinux · Mar 7, 2011

Edrick said:
That is about right since AXV can almost double performance. We are seeing similar (80%-100%) improvements in SB with AVX FP tests.

That won't happen with Bulldozer, as using AVX effectively halves the number of FPUs.

Arkadrel · Mar 7, 2011

That won't happen with Bulldozer, as using AVX effectively halves the number of FPUs.

16 x 128-bit FPU units
vs
8 x 256-bit FPU units

I belive the point of this thread is just that.... AMD floating point performance exploded with the bulldozer, compaired to their older cpus.

Khato · Mar 7, 2011

Arkadrel said:
I belive the point of this thread is just that.... AMD floating point performance exploded with the bulldozer, compaired to their older cpus.

It did? Here I thought that for anything that doesn't make use of the new FMAD, FP performance was effectively halved per 'core' since it's the exact same FP resources as a Phenom II now shared between two integer cores - http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/5

Arkadrel · Mar 7, 2011

Khato said:
It did? Here I thought that for anything that doesn't make use of the new FMAD, FP performance was effectively halved per 'core' since it's the exact same FP resources as a Phenom II now shared between two integer cores - http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/5

Thats not true, the floating point units are TWICE as big now (256bit), so even if their shared (128bit pr each core), they still have the same amount as the old proccessor had pr core.

The differnce is now theres 16 units vs old cpus 6. (not sure but believe it to be like this)

AMD Phenom II x6 core processor:

6 x 128-bit FPU

AMD "interlagos" x16 core processor:

16 x 128-bit FPU
OR
8 x 256-bit FPU

6 x 128-bit floating point units ---> 16 x 128-bit Floating point units.
= big increase in Floating point performance.

jiffylube1024 · Mar 7, 2011

Khato said:
It did? Here I thought that for anything that doesn't make use of the new FMAD, FP performance was effectively halved per 'core' since it's the exact same FP resources as a Phenom II now shared between two integer cores - http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/5

That's exactly what I thought too. I thought that Bulldozer's strength was that it's supposed to be an Integer monster, because it's effectively got 8 cores for Int, while for FP it's more like a quad core...

Arkadrel · Mar 7, 2011

scenario 1:

2 "guys" walk into a shop, buy 1 bag of chips each, each bag weighs 100 grams.

2 bags / 2 guys = 1 bag pr person, 100 gram of chips pr person.

scenario 2:

2 "guys" walk into a shop, buy 1 bag of chips, they "share" this bag, it weighs 200 grams.

1 bag / 2 guys = 1/2 bag pr person (of 200grams), 100 grams of chips pr person.

It doesnt matter if they share, if what they share is TWICE as big, their still getting the same thing.

Phenom II x6 has 6 units, that are 128bit.

Interlagos x16 has 8 units, that are 256bit, that can work like 16 units.

we can all agree that 16 is a bigger number than 6 right?

yes the bulldozer is gonna have better floating point performance, than the phenoms.

Mopetar · Mar 7, 2011

What if one of the blokes knobs the other guy over the head and takes all the chips? Then it's one guy with twice as many chips which is better, unless you're the guy that got robbed.

Arkadrel · Mar 7, 2011

@Mopetar,

That only happends when "The store?"(software) isnt working well with "many" blokes (threads).
But even then he "eats?" his chips just as fast as 2 normal "phenom" blokes do.

Also I heard this "bulldozer" bloke is a athlete or something, appearntly he "runs" faster too, than that other guy.

hamunaptra · Mar 7, 2011

Another thing to keep in mind the FPU in BD module is WAY more flexible then previous generation. It can do different types of float math no matter which half of the pipe it goes down. FMUL and FADD can be handled by both pipes. Whereas on stars core, I believe one pipe did FMUL the other FADD and the other FMISC.
So, BD is more flexible ..

At least thats how I see it based on my very limited uarch knowledge =P and just going off uarch diagrams =)

Khato · Mar 7, 2011

hamunaptra said:
Another thing to keep in mind the FPU in BD module is WAY more flexible then previous generation. It can do different types of float math no matter which half of the pipe it goes down. FMUL and FADD can be handled by both pipes. Whereas on stars core, I believe one pipe did FMUL the other FADD and the other FMISC.
So, BD is more flexible ..

At least thats how I see it based on my very limited uarch knowledge =P and just going off uarch diagrams =)

Correct. The Phenom/Phenom II architecture had fixed FPpipelines for multiply, add, and misc operations. So theoretically, it could actually have higher throughput than the two 'general purpose' FMAC pipelines of Bulldozer - depends entirely upon the workload.

As an interesting aside since FMA was mentioned earlier... Whatever happened to volume 6 of AMD's architecture programmer's manual that documented both FMA4 and other bulldozer instruction set additions? Combined with the fact that no presentations that I recall actually promised FMA4 makes me wonder if it didn't make it.

JFAMD · Mar 7, 2011

Interlagos has 16 cores.

128 bit mode = 4 x 16 = 64 single precision operations per cycle
256 bit mode = 8 x 8 = 64 single precision operations per cycle

So, whether you are in legacy mode or AVX mode, the number of operations are the same.

Now, for the fun part.

If you are utilizing FMA4, you can do a fused multiply accumulate (a=b+c*d) all in one cycle, where it would take SB 2 cycles to achieve the same thing.

And here are a few of the things that BD's FMACs can do that SB can't:

1. Run a 128-bit AVX and an SSE operation on the same cycle
2. Run two 128-bit AVX operations on the same cycle
3. Run two 128-bit AVX and an SSE operation on the same cycle
4. Run a 256-bit AVX and an SSE on the same cycle

I know so little about the folding software, but if it utilizes a lot of SSE, you should see a big boost from BD because intel recommends recompiling for AVX and changing all SSE instructions to AVX-128. Plus, for AVX-128 they recommend actually padding the instruction (adding all zeros between 128 and 256) which means that even though you have 256-bit wide registers, you can only run one 128-bit through at a time.

Plus, with FMACs being more flexible (they can do an FADD or an FMUL) on any cycle, you are better off because you get higher efficiency.

sandorski · Mar 7, 2011

I have just calculated 8 WTFs, 3 TSiRs, and 1 FML reading this thread. :sneaky:

I just hope it's good. Don't plan on an upgrade for a few years...but if BD is really good I might be tempted to do so earlier.

IntelUser2000 · Mar 7, 2011

hamunaptra said:
F@H I doubt uses anything but normal SSE or FPU math. It doesnt use AVX or FMA AFAIK...If it did then F@H would be significantly faster on a SB, which I dont think it is...

SpecCPU results show approximately same gains for Sandy Bridge compared to Westmere. Yet, Bulldozer is supposed to show far higher gain for FP compared to its predecessor. It means its easier to optimize for FMA or it works with certain applications already.

JFAMD · Mar 7, 2011

Most of the compilers support FMA already today. I can think of one holdout and I wouldn't expect it for a few more years.

Bulldozer Folding @ Home Performance Numbers

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Member

Golden Member

Elite Member

Senior member

Platinum Member

Lifer

Senior member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Golden Member

Senior member

No Lifer

Elite Member

Senior member