Bulldozer Folding @ Home Performance Numbers

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
Folding@home is VERY floating point dependent, I belive.

bulldozers are "floating point monsters" compaired to their older cpus so yeah it could be "legit".

I belive there is a guy on that site that explains the math and why the numbers add up.






Flexible FPU (flexible floating point Unit): 1 for each modual (has its own scheduler), a extensions called AVX, can handle 256-bit FP execution. Single precision commands are 32-bit and double precision are 64-bit. With today’s standard 128-bit FPUs, you execute four single precision commands or two double precision commands in parallel per cycle. With AVX you can double that, executing eight 32-bit commands or four 64-bit commands per cycle.

"The beauty of the Flex FP is that it is a single 256-bit FPU that is shared by two integer cores. With each cycle, either core can operate on 256 bits of parallel data via two 128-bit instructions or one 256-bit instruction, OR each of the integer cores can execute 128-bit commands simultaneously."

Flex FPU makes bulldozer a floating point monster.
 
Last edited:

Mopetar

Diamond Member
Jan 31, 2011
8,416
7,593
136
Approximately 116% performance improvement per core; of course this doesn't take clock speed into account so it's likely somewhat less considering clock per clock per core.

Not bad, if it's true, of course.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Approximately 116% performance improvement per core; of course this doesn't take clock speed into account so it's likely somewhat less considering clock per clock per core.

Not bad, if it's true, of course.

That is about right since AXV can almost double performance. We are seeing similar (80%-100%) improvements in SB with AVX FP tests.
 

Mopetar

Diamond Member
Jan 31, 2011
8,416
7,593
136
That is about right since AXV can almost double performance. We are seeing similar (80%-100%) improvements in SB with AVX FP tests.

Given that, napkin math suggests the alleged Bulldozer CPU has a clock speed around 3 GHz to 3.5 GHz.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
Reading further says 12 core Opteron scores 400 seconds and the 16 core Interlagos score ~230 seconds. Given the FMA support and 1/3 more cores, its quite plausible.

Server part doesn't clock that high.
 

hamunaptra

Senior member
May 24, 2005
929
0
71
F@H I doubt uses anything but normal SSE or FPU math. It doesnt use AVX or FMA AFAIK...If it did then F@H would be significantly faster on a SB, which I dont think it is...

They are always behind in code implementation.

OTOH, BD would perform the same with AVX code or non AVX right?
Either 2x128bit SSE math or 1x256bit AVX...correct?


So, all I can say is either the posted F@H results are a fake for BD....OR its real and BD is an FPU monster!!!
Now about the INT.....
 

Diogenes2

Platinum Member
Jul 26, 2001
2,151
0
0
A much more useful example would be PPD for an 8 core BD running Win7 or Linux..

Something much more likely to be in the hands of a possible user..

The OP is kind of like going on a Cessna forum and singing the praises of an F-22a...
 

PreferLinux

Senior member
Dec 29, 2010
420
0
0
That is about right since AXV can almost double performance. We are seeing similar (80%-100%) improvements in SB with AVX FP tests.
That won't happen with Bulldozer, as using AVX effectively halves the number of FPUs.
 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
That won't happen with Bulldozer, as using AVX effectively halves the number of FPUs.

16 x 128-bit FPU units
vs
8 x 256-bit FPU units


I belive the point of this thread is just that.... AMD floating point performance exploded with the bulldozer, compaired to their older cpus.
 

Khato

Golden Member
Jul 15, 2001
1,251
321
136
I belive the point of this thread is just that.... AMD floating point performance exploded with the bulldozer, compaired to their older cpus.

It did? Here I thought that for anything that doesn't make use of the new FMAD, FP performance was effectively halved per 'core' since it's the exact same FP resources as a Phenom II now shared between two integer cores - http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/5
 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
It did? Here I thought that for anything that doesn't make use of the new FMAD, FP performance was effectively halved per 'core' since it's the exact same FP resources as a Phenom II now shared between two integer cores - http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/5


Thats not true, the floating point units are TWICE as big now (256bit), so even if their shared (128bit pr each core), they still have the same amount as the old proccessor had pr core.

The differnce is now theres 16 units vs old cpus 6. (not sure but believe it to be like this)



AMD Phenom II x6 core processor:

6 x 128-bit FPU


AMD "interlagos" x16 core processor:

16 x 128-bit FPU
OR
8 x 256-bit FPU



6 x 128-bit floating point units ---> 16 x 128-bit Floating point units.
= big increase in Floating point performance.
 
Last edited:

jiffylube1024

Diamond Member
Feb 17, 2002
7,430
0
71
It did? Here I thought that for anything that doesn't make use of the new FMAD, FP performance was effectively halved per 'core' since it's the exact same FP resources as a Phenom II now shared between two integer cores - http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/5


That's exactly what I thought too. I thought that Bulldozer's strength was that it's supposed to be an Integer monster, because it's effectively got 8 cores for Int, while for FP it's more like a quad core...
 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
scenario 1:

2 "guys" walk into a shop, buy 1 bag of chips each, each bag weighs 100 grams.

2 bags / 2 guys = 1 bag pr person, 100 gram of chips pr person.



scenario 2:

2 "guys" walk into a shop, buy 1 bag of chips, they "share" this bag, it weighs 200 grams.

1 bag / 2 guys = 1/2 bag pr person (of 200grams), 100 grams of chips pr person.






It doesnt matter if they share, if what they share is TWICE as big, their still getting the same thing.


Phenom II x6 has 6 units, that are 128bit.

Interlagos x16 has 8 units, that are 256bit, that can work like 16 units.


we can all agree that 16 is a bigger number than 6 right?

yes the bulldozer is gonna have better floating point performance, than the phenoms.
 
Last edited:

Mopetar

Diamond Member
Jan 31, 2011
8,416
7,593
136
What if one of the blokes knobs the other guy over the head and takes all the chips? Then it's one guy with twice as many chips which is better, unless you're the guy that got robbed.

 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
@Mopetar,

That only happends when "The store?"(software) isnt working well with "many" blokes (threads).
But even then he "eats?" his chips just as fast as 2 normal "phenom" blokes do.

Also I heard this "bulldozer" bloke is a athlete or something, appearntly he "runs" faster too, than that other guy.
 
Last edited:

hamunaptra

Senior member
May 24, 2005
929
0
71
Another thing to keep in mind the FPU in BD module is WAY more flexible then previous generation. It can do different types of float math no matter which half of the pipe it goes down. FMUL and FADD can be handled by both pipes. Whereas on stars core, I believe one pipe did FMUL the other FADD and the other FMISC.
So, BD is more flexible ..

At least thats how I see it based on my very limited uarch knowledge =P and just going off uarch diagrams =)
 

Khato

Golden Member
Jul 15, 2001
1,251
321
136
Another thing to keep in mind the FPU in BD module is WAY more flexible then previous generation. It can do different types of float math no matter which half of the pipe it goes down. FMUL and FADD can be handled by both pipes. Whereas on stars core, I believe one pipe did FMUL the other FADD and the other FMISC.
So, BD is more flexible ..

At least thats how I see it based on my very limited uarch knowledge =P and just going off uarch diagrams =)

Correct. The Phenom/Phenom II architecture had fixed FPpipelines for multiply, add, and misc operations. So theoretically, it could actually have higher throughput than the two 'general purpose' FMAC pipelines of Bulldozer - depends entirely upon the workload.

As an interesting aside since FMA was mentioned earlier... Whatever happened to volume 6 of AMD's architecture programmer's manual that documented both FMA4 and other bulldozer instruction set additions? Combined with the fact that no presentations that I recall actually promised FMA4 makes me wonder if it didn't make it.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Interlagos has 16 cores.

128 bit mode = 4 x 16 = 64 single precision operations per cycle
256 bit mode = 8 x 8 = 64 single precision operations per cycle

So, whether you are in legacy mode or AVX mode, the number of operations are the same.

Now, for the fun part.

If you are utilizing FMA4, you can do a fused multiply accumulate (a=b+c*d) all in one cycle, where it would take SB 2 cycles to achieve the same thing.

And here are a few of the things that BD's FMACs can do that SB can't:

1. Run a 128-bit AVX and an SSE operation on the same cycle
2. Run two 128-bit AVX operations on the same cycle
3. Run two 128-bit AVX and an SSE operation on the same cycle
4. Run a 256-bit AVX and an SSE on the same cycle

I know so little about the folding software, but if it utilizes a lot of SSE, you should see a big boost from BD because intel recommends recompiling for AVX and changing all SSE instructions to AVX-128. Plus, for AVX-128 they recommend actually padding the instruction (adding all zeros between 128 and 256) which means that even though you have 256-bit wide registers, you can only run one 128-bit through at a time.

Plus, with FMACs being more flexible (they can do an FADD or an FMUL) on any cycle, you are better off because you get higher efficiency.
 

sandorski

No Lifer
Oct 10, 1999
70,670
6,246
126
I have just calculated 8 WTFs, 3 TSiRs, and 1 FML reading this thread. :sneaky:

I just hope it's good. Don't plan on an upgrade for a few years...but if BD is really good I might be tempted to do so earlier.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
F@H I doubt uses anything but normal SSE or FPU math. It doesnt use AVX or FMA AFAIK...If it did then F@H would be significantly faster on a SB, which I dont think it is...

SpecCPU results show approximately same gains for Sandy Bridge compared to Westmere. Yet, Bulldozer is supposed to show far higher gain for FP compared to its predecessor. It means its easier to optimize for FMA or it works with certain applications already.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Most of the compilers support FMA already today. I can think of one holdout and I wouldn't expect it for a few more years.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |