Bulldozer Folding @ Home Performance Numbers

bandgit · Mar 8, 2011

HW2050Plus said:
Okay. First to say you hopefully know that a Velociraptor is nothing compared to SSD. SSDs are up to 100 times faster than even a Velociraptor. With no CPU upgrade you can get such large performance improvements. I use Velociraptors as well but that did only little improvement.

On the other hand I would be interested by which operations you feel CPU bound and what time you wait for the result of the operation (half, one, 10 seconds?).

I mean a Core i7 920 is already a really fast CPU.

I still believe that you are not CPU bound especially if you say that you manipulate such large files.

Yes, I'm well aware of the speed diff between Velociraptor and SSD but when I purchased the rig SSDs were just coming onto the market and they were still having a fair amount of controller problems. Applying filters onto large files is just one of the functions where Photoshop is looking for a hefty CPU, and there are some functions where it can pass off processing to GPUs as well. The biggest Photoshop file I've worked with to date was just over 3GB and I was maxing out my 12GB RAM. I'm looking to populate the 8 slots on the LGA2011 SB with 4GB sticks, so at least I'll be able to toss more than enough RAM at that Photoshop hog. Don't get me wrong, my i7 is plenty fast in these processes, but I'm a speed freak so I want the max.

Mopetar · Mar 8, 2011

Idontcare said:
The hype in here is going to 11.

Didn't you read JFAMD's blog post from a month ago? This has been known for a while

Arkadrel · Mar 8, 2011

Idontcare said:
The hype in here is going to 11.

This had me laughing nice, I like the humor in this thread, its very light hearted.

JF-AMD would be proud of that one... yes they do indeed go to 11! ^-^

*edit

how did this get to a SSD vs HDD thingy? also... 100x faster? I know SSD have good reads decent write speeds... but 100x? that SSD is clearly going to 11 too then.

hamunaptra · Mar 8, 2011

HW2050Plus said:
It comes from ChipHell itself. I cannot read the language there therefore I took that info from another translation (also not English but a language I can read). Therefore the 1.8 GHz statement is in the OP source itself. There is also the calculation I did, though their result of 62% gain per core per clock is higher than my calculation of 58%. Therefore all the numbers I calculated above are a bit lower than the results on ChipHell.

No. I used the calculation times not the PPD score, so the bonus is excluded.

Integer will be definitely lower. But Bolldozer will be a float monster.

For all pure integer workloads it will not look that amazing for AMD. But for all float type workloads it appears to be absolute amazing.

Let's hope that integer performance will at least do well though I am absolutly sure that Bulldozer will be slower in integer than SB per core. That is not bad as BD has more cores because of CMT.

Well I find that quite odd, because if BD was designed to be a INT throughput CPU (for server workloads). Why would they let it shine in FPU yet be sucky in INT? ...

I think JF has stated somewhere, that they focused more on INT throughput than FPU throughput and talked above how server workloads are more INT than anything while usually the FPU sits around not doing much. So, why is FPU so much better? LOL!

OR could it be that the advantage in FPU throughput is all the parts around the FPU which have gotten so much more effecient in which case, this should translate to the INT as well?
Such as the FE effeciency and so forth..

BD231 · Mar 8, 2011

HW2050Plus said:
For all pure integer workloads it will not look that amazing for AMD. But for all float type workloads it appears to be absolute amazing.

Let's hope that integer performance will at least do well though I am absolutly sure that Bulldozer will be slower in integer than SB per core. That is not bad as BD has more cores because of CMT.

AMD believes that 80%+ of all normal server workloads are purely integer operations

http://www.anandtech.com/show/2881

Stop posting.

Mopetar · Mar 8, 2011

hamunaptra said:
I think JF has stated somewhere, that they focused more on INT throughput than FPU throughput and talked above how server workloads are more INT than anything while usually the FPU sits around not doing much. So, why is FPU so much better?
..

Because in the absence of established fact, minds and mouths will run wild. By next week I fully expect a thread about how Bulldozer has cured cancer.

hamunaptra · Mar 8, 2011

Mopetar said:
Because in the absence of established fact, minds and mouths will run wild. By next week I fully expect a thread about how Bulldozer has cured cancer.

Now that would be freakin sweet! Hey ya never know. Maybe a cure will come from F@H, which was from TONS of BD's crunching away!!! MUAHAHAAH LOL!

IntelUser2000 · Mar 8, 2011

hamunaptra said:
Well I find that quite odd, because if BD was designed to be a INT throughput CPU (for server workloads). Why would they let it shine in FPU yet be sucky in INT? ...

They really did much less work on FP than Integer. Otherwise we might have seen two 128-bit FMA FPU units per core(or 2 256-bit regular FPUs), rather than module.

You could also say by that reasoning that Intel worked much more on FPUs too. AVX optimized workloads would show a big gain.

It's like saying the CPU guys are lazy and GPU guys have put lot more effort into increasing performance because GPUs gain 2x performance per generation, while CPU might be half or even third of that. 50% gain in Integer is big.

Arkadrel said:
how did this get to a SSD vs HDD thingy? also... 100x faster? I know SSD have good reads decent write speeds... but 100x? that SSD is clearly going to 11 too then.

It is in certain metrics, like random write 4K numbers, and also IOPS. That doesn't mean programs will suddenly be 100x faster, since random write and IOPS are only small part of the equation.

HW2050Plus · Mar 8, 2011

BD231 said:
AMD believes that 80%+ of all normal server workloads are purely integer operations

http://www.anandtech.com/show/2881

hamunaptra said:
Well I find that quite odd, because if BD was designed to be a INT throughput CPU (for server workloads). Why would they let it shine in FPU yet be sucky in INT? ...

Almost all server workloads are integer that is absolutly correct. They focused on integer regarding servers right.

They doubled the integer performance so there is nothing wrong with integer. But regarding the float performance they achieved even more than doubling, near tripling the performance.

For server market most customers just don't care about the float performance. But in the consumer space it looks somewhat different.

Nobody at AMD ever said this CPU was designed for server market only.

And also there is a niche which is quite important for AMD and that is computing clusters. And those computing clusters just don't care about integer performance. As you know AMD is quite strong in this niche.

So especially as the server department emphasized the integer performance, you could be afraid that it will suck in float. But that will obviously not happen. And that is a very good news.

hamunaptra said:
OR could it be that the advantage in FPU throughput is all the parts around the FPU which have gotten so much more effecient in which case, this should translate to the INT as well?
Such as the FE effeciency and so forth..

No. I give you a guarantee that the integer gains are significantly lower than the gains in float. That is also inline with AMD slides.

There is a comparison of Interlagos and MagnyCourse regarding integer and float performance. There you see Interlagos much better in integer and almost twice of that better in float. The slide represents the 50% faster in integer statement but you see as well that they claim near 100% faster in float. And both are server only parts.

Sure that AMD does not advertize that in server space because those are not interested in.

From the architecture you can quite easily see that they can come close to SB but cannot surpass SB in integer. But as already said that does not matter as they surpass SB in integer by the CMT over HT advantage.

I mean the main news of this Folding@Home is that they did float performance exceptionally well. Nothing more nothing less and without any implications of that to integer performance.

Idontcare · Mar 8, 2011

Mopetar said:
Because in the absence of established fact, minds and mouths will run wild. By next week I fully expect a thread about how Bulldozer has cured cancer.

It's not enough to simply cure cancer, if it doesn't do it with record low power-consumption then no one will be interested in the results.

HW2050Plus · Mar 8, 2011

bandgit said:
Yes, I'm well aware of the speed diff between Velociraptor and SSD but when I purchased the rig SSDs were just coming onto the market and they were still having a fair amount of controller problems. Applying filters onto large files is just one of the functions where Photoshop is looking for a hefty CPU, and there are some functions where it can pass off processing to GPUs as well. The biggest Photoshop file I've worked with to date was just over 3GB and I was maxing out my 12GB RAM. I'm looking to populate the 8 slots on the LGA2011 SB with 4GB sticks, so at least I'll be able to toss more than enough RAM at that Photoshop hog. Don't get me wrong, my i7 is plenty fast in these processes, but I'm a speed freak so I want the max.

Anyway when you buy your LGA2011 Sb rig with 32 GByte just don't forget to order SSDs as well!

I still believe that you are IO limited

IntelUser2000 said:
It is in certain metrics, like random write 4K numbers, and also IOPS. That doesn't mean programs will suddenly be 100x faster, since random write and IOPS are only small part of the equation.

Right, that's why I have written "up to" so it was okay.

HW2050Plus · Mar 8, 2011

Idontcare said:
It's not enough to simply cure cancer, if it doesn't do it with record low power-consumption then no one will be interested in the results.

As they only cure mad cow desease and CJS it is even not worth mentioning regardless how low power it needed.

Mopetar · Mar 8, 2011

Idontcare said:
It's not enough to simply cure cancer, if it doesn't do it with record low power-consumption then no one will be interested in the results.

I just heard a rumor from the brother of the guy who mows the lawn for a third cousin of a gal that moonlights a waitress gig where two people who may have been from AMD were discussing how Bulldozer draws negative power. If I get three or four I may be able to power my house.

Idontcare · Mar 8, 2011

Mopetar said:
I just heard a rumor from the brother of the guy who mows the lawn for a third cousin of a gal that moonlights a waitress gig where two people who may have been from AMD were discussing how Bulldozer draws negative power. If I get three or four I may be able to power my house.

Ah yes, the oft overlooked potential of harnessing the Casimir effect.

Ironically, if we are to ever leverage the physics that belay the Casimir effect it will only happen by way of us re-purposing our semiconductor manufacturing methods towards such an endeavor.

hamunaptra · Mar 9, 2011

HW2050Plus said:
Almost all server workloads are integer that is absolutly correct. They focused on integer regarding servers right.

They doubled the integer performance so there is nothing wrong with integer. But regarding the float performance they achieved even more than doubling, near tripling the performance.

For server market most customers just don't care about the float performance. But in the consumer space it looks somewhat different.

Nobody at AMD ever said this CPU was designed for server market only.

And also there is a niche which is quite important for AMD and that is computing clusters. And those computing clusters just don't care about integer performance. As you know AMD is quite strong in this niche.

So especially as the server department emphasized the integer performance, you could be afraid that it will suck in float. But that will obviously not happen. And that is a very good news.

No. I give you a guarantee that the integer gains are significantly lower than the gains in float. That is also inline with AMD slides.

There is a comparison of Interlagos and MagnyCourse regarding integer and float performance. There you see Interlagos much better in integer and almost twice of that better in float. The slide represents the 50% faster in integer statement but you see as well that they claim near 100% faster in float. And both are server only parts.

Sure that AMD does not advertize that in server space because those are not interested in.

From the architecture you can quite easily see that they can come close to SB but cannot surpass SB in integer. But as already said that does not matter as they surpass SB in integer by the CMT over HT advantage.

I mean the main news of this Folding@Home is that they did float performance exceptionally well. Nothing more nothing less and without any implications of that to integer performance.

So, why then is there such a huge FP increase since .. its roughly a single 128bit pipe for each core ... ? given they both can do FADD and FMUL...
stars core had 3 FP pipes FADD or FMUL or FMISC...
Is it possibly because F@H type workloads are only one of those types of instructions?
Therefore since BD can run both in the 2 128bit pipes?
But that still gives it equivilant of only 1 128bit FPU per "core"....

Plz explain...

IntelUser2000 · Mar 9, 2011

hamunaptra said:
Therefore since BD can run both in the 2 128bit pipes?
But that still gives it equivilant of only 1 128bit FPU per "core"....

Plz explain...

Bulldozer's FPU brings FMA instructions. Previous Stars core FPU could do Multiply or Add, one unit for each. FMA allows one unit to do both Multiply and Add in a single cycle.

In addition to the other parts of the architecture that brings further improvements, its "easier" to gain more in FP code than on Integer.

So say the overall gain of going from 12 cores to 16 cores is x %. Enhancing the FPU would be on top of the x % gain.

PreferLinux · Mar 9, 2011

RobertPters77 said:
Which one? i7 2600?

i5 2500K.

JFAMD · Mar 9, 2011

IntelUser2000 said:
They really did much less work on FP than Integer. Otherwise we might have seen two 128-bit FMA FPU units per core(or 2 256-bit regular FPUs), rather than module.

This is not really true. We did a lot of work on both integer and FP. Just because we have a different FP arrangement doesn't mean that "if we had done more work" there would have been more FP resources.

In reality the FP:integer ratio is based on workload needs far more than on how much focus was put on each.

Think about it this way, in the same die space you could have 6 old cores with 6 FPUs or 8 new cores and 8 FPUs. But to double the size of the FPUs you would probably have to fall back to 6 integer cores. Then you would have 256-bit FP for the 10-20% of the work that is FP-bound and you would have the same essential resources for integer which is 80-90% of the work. Doesn't make sense.

Think about all that we did with the FPU. AVX, SSSE3, SSE4.1, SSE4.2, FMA4, XOP and more. This is a really optimized design.

Dresdenboy · Mar 9, 2011

hamunaptra said:
So, why then is there such a huge FP increase since .. its roughly a single 128bit pipe for each core ... ? given they both can do FADD and FMUL...
stars core had 3 FP pipes FADD or FMUL or FMISC...
Is it possibly because F@H type workloads are only one of those types of instructions?
Therefore since BD can run both in the 2 128bit pipes?
But that still gives it equivilant of only 1 128bit FPU per "core"....

Let me try:
Stars:
FUs: 1x 128b dedicated FADD, 1x 128b dedicated FMUL, 1x FMISC
L/S: 2x 128b load/cycle OR 2x 64b store/cycle (or 1x 128b)
Execution: 1 thread per FPU, suffering from "bubbles" -> FP instructions have longer latencies (x cycles), causing "bubbles" - cycles, where no FP op is ready for execution because it has to wait for ready operands.

So while raw throughput for FMUL/FADD looks similar to a BD module, the latter's situation is more like this:

BD FPU (for 1 module):
FUs: 2x 128b FMAC (could do either FADD, FMUL or FMAC per op), 2x integer SIMD/misc pipelines
L/S: 2x 128b load/cycle AND 1x 128b store/cycle
Execution: 2 threads per FPU, ability to fill "bubbles" happening to 1 thread by FP instructions from the other thread - at reduced area and power cost.

So even without using FMAC instructions, there is an inherent advantage.

Arkadrel · Mar 9, 2011

I never really looked at the tech talk about phenom when that was released, but bulldozer talk all sounds good.

I like that they have a blog with info on the design, and design decissions and explain why they went 1 way instead of another ect. And the logic they use to argument their design reasons all sound so good when they present them.

I really think AMD pulled a rabbit out of a hat, with this one.

hamunaptra · Mar 9, 2011

IntelUser2000 said:
Bulldozer's FPU brings FMA instructions. Previous Stars core FPU could do Multiply or Add, one unit for each. FMA allows one unit to do both Multiply and Add in a single cycle.

In addition to the other parts of the architecture that brings further improvements, its "easier" to gain more in FP code than on Integer.

So say the overall gain of going from 12 cores to 16 cores is x %. Enhancing the FPU would be on top of the x % gain.

WHOA, ok so...yeah now I see how its drastically improved. the new FPU can do both multiply and add in a single cycle....Is this per 128b pipe or is this taking into account both pipes? Because 1 pipe cant do a multiply and add per cycle right?
Unless written in FMA4 or w/e?

And to DRESDEN boy...thanks for further explaining it. So, it looks like the FPU itself is capable of more LD / ST per clockcycle meaning it can get data in and out faster than stars core right?

So with the above I now can understand how the FPU has increased throughput / more effecient.

JFAMD · Mar 9, 2011

FMA is not a multiply and an add, it is a single operation that does a multiply and an add together.

Multiply and add (i.e. a+b and d*c) are two different operations and would be handled as such by both.

FMA is a+b*c.

We can do that in one cycle. SB will do a+b in one cycle, then take that result and multiply it by c.

SB also has an FADD and a FMUL pipe, so it could do an FADD and an FMUL in the same cycle. We have FMACs that can do the same. But the 2 FMACs also allow us to do 2 FADDs or 2 FMULs on the same cycle. For instance if you queue has 10 FADDs in a row that would be 10 cycles on SB and 5 cycles on BD.

To do an FMA operation you do have to compile to do FMA4; but to take advantage of running 2 FADDs or 2 FMULs you don't need anything special.

drizek · Mar 9, 2011

JFAMD said:
FMA is a+b*c.

We can do that in one cycle. SB will do a+b in one cycle, then take that result and multiply it by c.

SB has a PMDAS bug?

hamunaptra · Mar 9, 2011

OK, awesome BUT I see a limitation which still doesnt make sense.
BD can do 2 FADD or 2 FMUL per FLEX FPU module either combination in the same clock per pipe. But it can only do one in each pipe per clock right? Like 1 128b pipe cant handle more than one FADD or FMUL in a clock right?

So meaning.... if there are 4 threads on a 2 module BD (1/core) and 4threads on a i2600.

It would take the same cycles for 4 FADDs on each wouldnt it?

Or are you indeed saying each 128bit FMAC pipe can do 2 FADDS or 2 FMULS per clock? but not a mix?(2x as much as SB)

gdansk · Mar 9, 2011

drizek said:
SB has a PMDAS bug?

You know what he meant... SB will do b*c in one cycle then add a in the next.

Anyway, will Haswell be Intel's first micro-architecture with FMA3? or are they planning it for Ivy Bridge now?

Bulldozer Folding @ Home Performance Numbers

Member

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Senior member

Elite Member

Member

Elite Member

Member

Member

Diamond Member

Elite Member

Senior member

Elite Member

Senior member

Senior member

Golden Member

Diamond Member

Senior member

Senior member

Golden Member

Senior member

Diamond Member