Bulldozer Folding @ Home Performance Numbers

AMDPhenomX4 · Mar 9, 2011

HW2050Plus said:
Okay. First to say you hopefully know that a Velociraptor is nothing compared to SSD. SSDs are up to 100 times faster than even a Velociraptor. With no CPU upgrade you can get such large performance improvements. I use Velociraptors as well but that did only little improvement.

On the other hand I would be interested by which operations you feel CPU bound and what time you wait for the result of the operation (half, one, 10 seconds?).

I mean a Core i7 920 is already a really fast CPU.

I still believe that you are not CPU bound especially if you say that you manipulate such large files.

What world do you live in? A Raptor gets around 120MB/s read average. Find me an SSD that can do 5GB/s. Never mind 12GB/s. Not even a RAM based SSD can do that, and even if they could, the Sata controller would limit it to 6GB/s.

And yes, I signed up just to point this out.

Idontcare · Mar 9, 2011

AMDPhenomX4 said:
What world do you live in? A Raptor gets around 120MB/s read average. Find me an SSD that can do 5GB/s. Never mind 12GB/s. Not even a RAM based SSD can do that, and even if they could, the Sata controller would limit it to 6GB/s.

And yes, I signed up just to point this out.

Show me a raptor that gets around 120MB/s random read for 4KB files.

AMDPhenomX4 · Mar 9, 2011

Here, which also includes comparison to other SSD's. Either way, unless your looking at purely 4k read, your still not 50x faster than an SSD in most cases.

grimpr · Mar 9, 2011

I wouldn't buy a topbin 8core Zambezi without a screaming fast sata 3.0 SSD.

DrJohnFever · Mar 9, 2011

AMDPhenomX4 said:
Here, which also includes comparison to other SSD's. Either way, unless your looking at purely 4k read, your still not 50x faster than an SSD in most cases.

I see the 4k random read coming in at 0.82MB/s.
If you look at the brand new generation of sandforce drives, they beat a raptor by ~100x in some things, but overall speed (and the speed the user feels) is more like 5-10x.

IntelUser2000 · Mar 9, 2011

gdansk said:
You know what he meant... SB will do b*c in one cycle then add a in the next.

Anyway, will Haswell be Intel's first micro-architecture with FMA3? or are they planning it for Ivy Bridge now?

I explained it in one of the threads(too many now) that radical changes like FMA is reserved for new microarchitectures, so yes Haswell.

gdansk · Mar 9, 2011

IntelUser2000 said:
I explained it in one of the threads(too many now) that radical changes like FMA is reserved for new microarchitectures, so yes Haswell.

Ah, thanks. I had Sandy Bridge confused for a tick, oops. I really need to read the reviews. I am waiting until Bulldozer comes out to read both reviews at the same time and reach my own conclusions.

TuxDave · Mar 10, 2011

drizek said:
SB has a PMDAS bug?

lol nice catch

hamunaptra · Mar 10, 2011

But WAIT, FMA is only possible if actually doing the FMA4 instruction correct?
Or can toe FP pipes do a fmul fadd regardless if its FMA4 or not, like can it do that with "legacy" FP code as well?

HW2050Plus · Mar 10, 2011

hamunaptra said:
OK, awesome BUT I see a limitation which still doesnt make sense.
BD can do 2 FADD or 2 FMUL per FLEX FPU module either combination in the same clock per pipe. But it can only do one in each pipe per clock right? Like 1 128b pipe cant handle more than one FADD or FMUL in a clock right?

Right (for code not using FMA4).

hamunaptra said:
So meaning.... if there are 4 threads on a 2 module BD (1/core) and 4threads on a i2600.

It would take the same cycles for 4 FADDs on each wouldnt it?

Yes (if you consider how many FADDS can be started in parallel in a cycle).

hamunaptra said:
Or are you indeed saying each 128bit FMAC pipe can do 2 FADDS or 2 FMULS per clock? but not a mix?(2x as much as SB)

No (if not FMA4 code is used, with FMA4 code you can do but only in the exact combination of mul followed by an add).

hamunaptra said:
But WAIT, FMA is only possible if actually doing the FMA4 instruction correct?
Or can toe FP pipes do a fmul fadd regardless if its FMA4 or not, like can it do that with "legacy" FP code as well?

No, see above.

But there are a lot more considerations. What you look at is at a cycle where everything is ready for execution. That really happens, but there are also other conditions.

It is difficult to understand. First because of AMD Stars had the same capability. But again that was peak. However in any real program or benchmark you do not measure the theoretical peak performance but the real performance.

And that is where things like separate FP scheduler, sharing of FPU unit and branch predictors and so on come into play.

Practically Bulldozer doubled the FPU resources per core and vastly improved the average throughput on top though on paper the PEAK performance was unchanged. And even on top of all that improvements they added FMAC operations. Again those FMAC give 100% more peak but in real code they give only ~30-50%.

So especially for float operations there is a heavy deviation between what the execution units peak ability is and what you get on average.

Also there are some features AMD downplayed so far in my opinion. It is because obviously AMD has not only 2 FPU pipes and 2 MMX pipes. Those MMX pipes don't do MMX they are full 128 Bit integer SSE pipelines. So all register moves and load/stores can be executed also in those two pipelines. I recently read a source that those two don't do 64 Bit MMX but 128 Bit SSE! Really don't know why AMD was so quiet about that so far and obfuscated that by using the wrong term "MMX". Therefore AMD can do 4 * 128 Bit SSE/cycle!

To speak in Stars terms: there are two "FMISC" in addition regarding float.

And I am really looking forward to those two "MMX" pipelines because in the past AMD really sucked in integer SSE operations! I am hoping that they made heavy improvements there as well.

Explanation about what FMA4 does:

SSE-code:
1. cycle y = a * x
2. cycle z = b + y
Registers consumed: 5
FMA4-code
1. cycle z = b + a * x
Registers consumed: 4

Therefore with FMA4 you do not only double those type of operations you also reduce register pressure (that's why AMD wantedt to invent SSE5 btw). And in addition some note about the operation, since you might ask, why such an optimization for exactly this combination. That is because that is the standard operation of fourier transformation and you will find that every available signal processor already can do those fused multiply-add. So this a very frequent combination in actual SSE code.

hamunaptra · Mar 10, 2011

HW2050Plus said:
Right.

Yes.

No.

But there are a lot more considerations. What you look at is at a cycle where everything is ready for execution. That really happens, but there are also other conditions.

It is difficult to understand. First because of AMD Stars had the same capability. But again that was peak. However in any real program or benchmark you do not measure the theoretical peak performance but the real performance.

And that is where things like separate FP scheduler, sharing of FPU unit and branch predictors and so on come into play.

Practically Bulldozer doubled the FPU resources per core and vastly improved the average throughput on top though on paper the PEAK performance was unchanged. And even on top of all that improvements they added FMAC operations. Again those FMAC give 100% more peak but in real code they give only ~30-50%.

So especially for float operations there is a heavy deviation between what the execution units peak ability is and what you get on average.

Also there are some features AMD downplayed so far in my opinion. It is because obviously AMD has not only 2 FPU pipes and 2 MMX pipes. Those MMX pipes don't do MMX they are full 128 Bit integer SSE pipelines. So all register moves and load/stores can be executed also in those two pipelines. I recently read a source that those two don't do 64 Bit MMX but 128 Bit SSE! Really don't know why AMD was so quiet about that so far and obfuscated that by using the wrong term "MMX". Therefore AMD can do 4 * 128 Bit SSE/cycle!

To speak in Stars terms: there are two "FMISC" in addition regarding float.

And I am really looking forward to those two "MMX" pipelines because in the past AMD really sucked in integer SSE operations! I am hoping that they made heavy improvements there as well.

Whoa, so wait whats your source for the last 2 pipes being SSE units? By SSE does that mean all SSE instruction sets like 1-4 or w/e lol I dont know all ther versions, or is it just like an SSE2 set of pipes.

And as far as the FMAC thing...ok so, my original assumption was correct. Each 128bit pipe can only do one FMUL or FADD per clock cycle unless its compiled using the newer FMA4 instruction set then it can do both in same clock.

So, I agree like you said on paper the FPU pretty much has the same throughput capability of the stars FPU, but in practicality its actually more because of the enhancements to the things done around the pipelines. Such as schedulers and cache and LSU.

So in essence my original assumption was right, everything around the FPU has been done more effeciently to allow the FPU to do more useful work per clock(keep the pipes more full, less bubbles).

If those extra 2 pipes are indeed SSE 128bit INT pipes, how does that allow the FPU to do 4x 128bit INT ops/clock...the other 2 FMAC pipes only do FP code right? They dont do INT code...or can SSE run INT code through an FPU pipe?
Or is it that SSE is both FP and INT code and it can run up to 2 FP instructions on the FMAC pipes and run 2 INT instructions on the 2 "FMISC" pipes...is that what you meant?

Martimus · Mar 10, 2011

hamunaptra said:
Whoa, so wait whats your source for the last 2 pipes being SSE units? By SSE does that mean all SSE instruction sets like 1-4 or w/e lol I dont know all ther versions, or is it just like an SSE2 set of pipes.

And as far as the FMAC thing...ok so, my original assumption was correct. Each 128bit pipe can only do one FMUL or FADD per clock cycle unless its compiled using the newer FMA4 instruction set then it can do both in same clock.

So, I agree like you said on paper the FPU pretty much has the same throughput capability of the stars FPU, but in practicality its actually more because of the enhancements to the things done around the pipelines. Such as schedulers and cache and LSU.

So in essence my original assumption was right, everything around the FPU has been done more effeciently to allow the FPU to do more useful work per clock(keep the pipes more full, less bubbles).

If those extra 2 pipes are indeed SSE 128bit INT pipes, how does that allow the FPU to do 4x 128bit INT ops/clock...the other 2 FMAC pipes only do FP code right? They dont do INT code...or can SSE run INT code through an FPU pipe?
Or is it that SSE is both FP and INT code and it can run up to 2 FP instructions on the FMAC pipes and run 2 INT instructions on the 2 "FMISC" pipes...is that what you meant?

Here is my top-level understanding of the changes made in Bulldozer from STARS:

The front end has been completely overhauled, including the branch prediction which probably is the most improved part of this architecture (although it was a weakness for the STARS architecture, so how improved this is will have a big impact on the Bulldozer performance since the new architecture has deeper pipelines.) The Branch target buffer now uses a two level hierarchy, just like Intel does on Nehalem and Sandybridge. Plus, now a mispredicted branch will no longer corrupt the entire stack, which means that the penalties for a misprediction are far less than in the STARS architecture. (Nehalem also has this feature, so it brings Bulldozer to parity with Nehalem wrt branch mispredictions)

Decoding has improved, but not nearly as much as the fetching on the processor. Bulldozer can now decode up to four (4) instructions per cycle (vs. 3 for Istanbul). This brings Bulldozer to parity with Nehalem, which can also decode four (4) instructions per cycle. Bulldozer also brings branch fusion to AMD, which is a feature that Intel introduced with C2D. This allows for some instructions to be decoded together, saving clock cycles. Again, this seems to bring Bulldozer into parity with Nehalem (although this is more cloudy, as there are restrictions for both architectures, and since Intel has more experience with this feature they are likely to have a more robust version of branch fusion.)

Bulldozer can now retire up to 4 Macro-ops per cycle, up from 3 in the STARS architecture. It is difficult for me to compare the out-of-order engine between STARS and Bulldozer, as they seem so dissimilar. I can say that it seems a lot more changed than just being able to retire 33% more instructions per cycle. Mostly the difference seems to be moving from dedicated lanes using dedicated ALUs and AGUs, to a shared approach.

Another major change is in the Memory Subsystem. AMD went away from the two-level load-store queue (where different functions were performed in in each level), and adopted a simple 40 entry entry load queue, with a 24 entry store queue. This actually increases the memory operations by 33% over STARS, but still keeps it ~20% less than Nehalem. The new memory subsystem also has an out-of-order pipeline, with a predictor that determines which loads can pass stores. (STARS had a *mostly* in-order memory pipeline) This brings Bulldozer to parity with Nehalem, as Intel has used this technique since C2D. Another change is that L1 cache is now duplicated in L2 cache (which Intel has been doing as long as I remember). Although L3 cache is still exclusive.

Bulldozer now implements true power gating. Although unlike Intel who gates at each core, they power gate at the module level. This shouldn't really effect IPC, but might effect the max frequency so it is a point to bring up when discussing changes to performance. The ability to completely shut off modules should allow higher turbo frequencies than we saw in Thuban, but we won't know what they are until we see some reviews.

Well, those are the main differences that I know of. Add that to the fact that this processor was actually designed to work on a 32nm process versus a 130nm process like STARS, and you should see additional efficiencies. I expect a good IPC improvement, along with a large clockspeed boost. Although I can't say how much, and I really am looking more for parity with Nehalem based processors than I am with Sandybridge based processors.

References:
Butler, Mike. "Bulldozer" A new approach to multithreaded compute performance. Hot Chips XXII, August 2010.

Also since you are mostly interested in Floating Point performance, here is a good write up of the floating point changes made in Bulldozer: http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=7

Bulldozers floating point cluster does away with the notion of dedicated schedulers and lanes and uses the more flexible unified approach. The four pipelines (P0-P3), are fed from a shared 60 entry scheduler. This is roughly 50% larger than the reservation stations for Istanbul (42 entries) and almost double Barcelona's (36 entries). The heart of the cluster is a pair of 128-bit wide floating point multiply-accumulate (FMAC) units on P0 and P1. Each FMAC unit also handles division and square root operations with variable latency.

The two FMAC units execute FADD and FMUL instructions, although this obviously leaves performance on the table compared to using the combined FMAC. The first pipeline includes a 128-bit integer multiply-accumulate unit, primarily used for instructions in AMDs XOP. Additionally, the hardware for converting between integer and floating point data types is tied to pipeline 0. Pipeline 1 also serves double duty and contains the crossbar hardware (XBAR) used for permutations, shifts, shuffles, packing and unpacking.
Another question regarding Bulldozer is how 256-bit AVX instructions are handled by the execution units. One option is to treat each half as a totally independent macro-op, as the K8 did for 128-bit SSE, and let the schedulers sort everything out. However, it is possible that Bulldozer's two symmetric FMAC units could be ganged together to execute both halves of an AVX instruction simultaneously to reduce latency.
The other half of the floating point clusters execution units actually have little to do with floating point data at all. Bulldozer has a pair of largely symmetric 128-bit integer SIMD ALUs (P2 and P3) that execute arithmetic and logical operations. P3 also includes the store unit (STO) for the floating point cluster (this was called the FMISC unit in Istanbul). Despite the name, it does not actually perform stores rather it passes the data for the store to the load-store unit, thus acting as a conduit to the actual store pipeline. In a similar fashion, there is a small floating point load buffer (not shown above) which acts as an analogous conduit for loads between the load-store units and the FP cluster. The FP cluster can execute two 128-bit loads per cycle, and one of the purposes of the FP load buffer is to smooth the bandwidth between the two cores. For example, if the two cores simultaneously send data for four 128-bit loads to the FP cluster, the buffer would release 256-bits of data in the first cycle, and then 256-bits of data in the next cycle.

hamunaptra · Mar 10, 2011

Yeahp read that, the RWT article was something Ive read many times over and used it as my reference for many posts people are making. But the details being touched on in the FP cluster are beyond what that article shows...such as the claims of the 3 and 4 pipes being able to do SSE, the article claims only MMX.
But thanks for posting those comparisons. Im just trying to understand it beyond the article at this point =)

Edrick · Mar 10, 2011

I have never owned an AMD product in my life. But if these number hold up, I may get myself a BD for my folding machine. The FPU (which is what counts in folding) looks rather impressive so far on BD.

HW2050Plus · Mar 10, 2011

hamunaptra said:
Whoa, so wait whats your source for the last 2 pipes being SSE units? By SSE does that mean all SSE instruction sets like 1-4 or w/e lol I dont know all ther versions, or is it just like an SSE2 set of pipes.

And as far as the FMAC thing...ok so, my original assumption was correct. Each 128bit pipe can only do one FMUL or FADD per clock cycle unless its compiled using the newer FMA4 instruction set then it can do both in same clock.

So, I agree like you said on paper the FPU pretty much has the same throughput capability of the stars FPU, but in practicality its actually more because of the enhancements to the things done around the pipelines. Such as schedulers and cache and LSU.

So in essence my original assumption was right, everything around the FPU has been done more effeciently to allow the FPU to do more useful work per clock(keep the pipes more full, less bubbles).

If those extra 2 pipes are indeed SSE 128bit INT pipes, how does that allow the FPU to do 4x 128bit INT ops/clock...the other 2 FMAC pipes only do FP code right? They dont do INT code...or can SSE run INT code through an FPU pipe?
Or is it that SSE is both FP and INT code and it can run up to 2 FP instructions on the FMAC pipes and run 2 INT instructions on the 2 "FMISC" pipes...is that what you meant?

I do not remember the source, sorry but Martimus gave one already (see above):
"The other half of the floating point cluster’s execution units actually have little to do with floating point data at all. Bulldozer has a pair of largely symmetric 128-bit integer SIMD ALUs (P2 and P3) that execute arithmetic and logical operations."

Regarding the second question, the FlexFPU can do 2 INT SSE + 2 FP SSE in parallel. Neither can the INT SSE pipe execute FP nor vice versa. But there are many move instructions which are neither int nor FP and those can be executed on all four pipelines (the sort that could also be executed in FMISC in Stars).

Therefore you can "only" do 2 FP 128 Bit calculations per cycle but you could do 2 moves on XMM registers in parallel.

I make another statement here:
I believe that Stars and Intel do handle FP SSE only as 64 Bit, means that they need to process those in two subsequent operations. Not that a x64 SSE instruction is decoded into 2 Mops/yops but it consumes the resources for one cycle more. And I say, that they removed that with BD and BD needs only 1 cycle. Usually for FP that was not a too big issue but for BD with shared FP this makes a great difference!

Furthermore as Intel did INT SSE already (since Conroe at least) in 128 Bit in a single cycle, Stars did also INT SSE in two cycles which is really bad for INT and that is why I said that AMD sucked in INT SSE (though they could do 2 in parallel on FMUL and FADD). There as well I believe that they now do it in a single cycle with BD which is really great (doubling performance and not only peak but average). Whereas the doubling of units for INT SSE with Stars gave almost no performance gain. I am sure and I hope that this old achilles heel of AMD compared to Intel has been solved with BD.

If you compare with either Stars or any Intel, AMD executes INT SSE and FP SSE with Bulldozer in seperate pipelines whereas Intel and Stars use mixed pipelines/execution units/issue ports.

Martimus said:
Decoding has improved, but not nearly as much as the fetching on the processor. Bulldozer can now decode up to four (4) instructions per cycle (vs. 3 for Istanbul). This brings Bulldozer to parity with Nehalem, which can also decode four (4) instructions per cycle.

That is not correct, this decoding thing is really complex to understand. So first AMD BD can decode 32 Bytes / cycle where Intel SB can decode only 16 Bytes / cycle. Furthermore you have to understand differences in how decoding works and to what it is decoded. BD has at least 5 decoders so one more than Intel and the at least 4 non-vector path decoders can issue up to 2 Mops/cycle.

This decoding advantage was the base for that AMD can feed two integer cores with that decoders (each can receive 4 Mops/cycle) whereas Intel just feed one core (receiving also 4 yops/cycle).

There are even more complex things which support that as there is a instruction start tag stored in L1I cache on AMD CPUs, while Intel Sandy Bridge uses a loop cache to overcome it's decoder bottleneck.

HW2050Plus · Mar 10, 2011

Martimus said:
Another major change is in the Memory Subsystem. AMD went away from the two-level load-store queue (where different functions were performed in in each level), and adopted a simple 40 entry entry load queue, with a 24 entry store queue. This actually increases the memory operations by 33% over STARS, but still keeps it ~20% less than Nehalem.

This is not less than Nehalem this is much more than Nehalem.
AMD BD has 2 128 Bit load/s + 1 128 Bits store/s per core (means double of that per module). Therefore Bulldozer has double the load power compared to Nehalem. Only Sandy Bridge is able to use their store path for doing loads as well but that is still a 128 Bit path less than Bulldozer (though still an important improvement).

The features of Nehalem/SB regarding load/store are more in the ability to do e.g. speculative read before write. They had some other features over stars (load/store queue) but all those are now also available in AMD and as said the additional 128 Bit path of BD vs. SB is a feature that is only available with BD.

Martimus said:
Well, those are the main differences that I know of. Add that to the fact that this processor was actually designed to work on a 32nm process versus a 130nm process like STARS, and you should see additional efficiencies.

Stars are originally designed for 45 nm. So you meant K7 (the base design) I think.

IntelUser2000 · Mar 10, 2011

HW2050Plus said:
I make another statement here:
I believe that Stars and Intel do handle FP SSE only as 64 Bit, means that they need to process those in two subsequent operations.

Ok, you are way behind the times here. Core Microarchitecture in Core 2 CPUs brought single cycle 128-bit execution capability for Intel, and Barcelona/Agena brought the same for AMD. That's for FP.

Furthermore as Intel did INT SSE already (since Conroe at least) in 128 Bit in a single cycle, Stars did also INT SSE in two cycles which is really bad for INT and that is why I said that AMD sucked in INT SSE (though they could do 2 in parallel on FMUL and FADD).

Execution units aren't a big bottleneck for Integer, which is why people discredited SSE5's Integer multiply-add capability.

BD has at least 5 decoders so one more than Intel and the at least 4 non-vector path decoders can issue up to 2 Mops/cycle.

Again, another misunderstanding. Bulldozer can decode "up to" 5 instructions at a time because it supports Macro Op Fusion. That's what Intel chips had since Core 2.

The high level decoder throughput is exactly the same for Bulldozer/Core 2/Core i7/Core i7 Sandy Bridge.

This decoding advantage was the base for that AMD can feed two integer cores with that decoders (each can receive 4 Mops/cycle) whereas Intel just feed one core (receiving also 4 yops/cycle).

Also known as multi-threading, irrelevant for single threads.

Tuna-Fish · Mar 10, 2011

IntelUser2000 said:
Again, another misunderstanding. Bulldozer can decode "up to" 5 instructions at a time because it supports Macro Op Fusion. That's what Intel chips had since Core 2.

The high level decoder throughput is exactly the same for Bulldozer/Core 2/Core i7/Core i7 Sandy Bridge.

This is only true for simple instructions. All four of the decoders on SB (and all three of the decoders on stars) can decode up to two AMD 'Macro ops' per clock, while SB and Nehalem only has a single decoder that can decode more than one uop per clock. In real life, a Phenom has more decode throughput than Nehalem, and if it wasn't for the uop cache, a Phenom would beat SB on decode as well.

Of course, the uop cache in SB is a really great solution to Intel's decode bottleneck, and Phenom is so bottlenecked elsewhere that it can never actually use the decode bandwidth it has. Regardless, BD has much more raw decode throughput than SB. The question is, can it put it to use unlike AMD's previous models?

IntelUser2000 · Mar 10, 2011

Tuna-Fish said:
This is only true for simple instructions. All four of the decoders on SB (and all three of the decoders on stars) can decode up to two AMD 'Macro ops' per clock, while SB and Nehalem only has a single decoder that can decode more than one uop per clock. In real life, a Phenom has more decode throughput than Nehalem, and if it wasn't for the uop cache, a Phenom would beat SB on decode as well.

Almost every microprocessor(exceptions like Itanium and Atom) from Intel since Core Duo(Yonah) has the ability to support full decode in SSE code. Ones that decode into more uops comprise only 2-3% of average instructions.

Martimus · Mar 10, 2011

HW2050Plus said:
This is not less than Nehalem this is much more than Nehalem.
AMD BD has 2 128 Bit load/s + 1 128 Bits store/s per core (means double of that per module). Therefore Bulldozer has double the load power compared to Nehalem. Only Sandy Bridge is able to use their store path for doing loads as well but that is still a 128 Bit path less than Bulldozer (though still an important improvement).

The features of Nehalem/SB regarding load/store are more in the ability to do e.g. speculative read before write. They had some other features over stars (load/store queue) but all those are now also available in AMD and as said the additional 128 Bit path of BD vs. SB is a feature that is only available with BD.

I was talking about the load/store queue, where BD has a large improvement from STARS, but is still smaller than Nehalem. (Westmere has a 48 entry load buffer, and a 32 entry store buffer, versus 40 and 20 respectively for Bulldozer.) You are right that Bulldozer has twice the bandwidth for loading as Westmere however (128-bit x2, versus a single 128-bit load in Westmere, however BD only has two AGUs, so it can't actually use all of its available bandwidth at once, but it could do 1 load and 1 store or 2 loads) Because of the AGU deficiency, the SB functional bandwidth is actually the same as the BD functional bandwidth.

HW2050Plus said:
Stars are originally designed for 45 nm. So you meant K7 (the base design) I think.

Yes, I was talking about the original K7 design, which while modified through the STARS generation, is still there at the base level. There are issues in the architecture that are based on the fact that it was originally designed when Wires were faster than Transistors (or so I have been told by someone much more involved than I am in uProcessor design.)

PreferLinux · Mar 10, 2011

AMDPhenomX4 said:
What world do you live in? A Raptor gets around 120MB/s read average. Find me an SSD that can do 5GB/s. Never mind 12GB/s. Not even a RAM based SSD can do that, and even if they could, the Sata controller would limit it to 6GB/s.

And yes, I signed up just to point this out.

No, 6 Gb/s (Gbits, not GBytes), so either 600 MB/s or 750 MB/s, depending on the encoding used, which I can't remember.

HW2050Plus · Mar 10, 2011

IntelUser2000 said:
Ok, you are way behind the times here. Core Microarchitecture in Core 2 CPUs brought single cycle 128-bit execution capability for Intel, and Barcelona/Agena brought the same for AMD. That's for FP.

Okay neither of those can even execute 64 Bit FP in single cycle why should they be able to do that for 128 Bit? They can do 128 Bit with one instruction but that doesn't mean the unit does it in one step.

IntelUser2000 said:
The high level decoder throughput is exactly the same for Bulldozer/Core 2/Core i7/Core i7 Sandy Bridge.

That is just incorrect - very obviously btw, I already wrote why. Just easy, 32 Bytes vs. 16 Bytes if you like to simplify it. And yes AMD BD has at least 5 decoders and the vector one decodes instructions as well, again that is one more. So more decoders, more Mops output of decoders and more instruction Bytes to decode. There is a large advantage to BD decoder. Of course it needs this advantage because it has to feed two cores.

JFAMD · Mar 10, 2011

OK, talked to my FPU guy and this is the scoop:

We have a 256b FP datapath (pipes 0 and 1) AND a 256b INT datapath (pipes 2 and 3), so

2 128b FP + 2 128b INT
or
1 256b FP + 2 128b INT
or
1 256b FP + 1 256b INT
or
2 128b FP + 1 256b INT

The INT here is an integer unit for doing the integer portion of math inside an SSE instruction, that is not the integer clusters that you would commonly call cores.

Plus there is a really cool feature around moves. Technically, we can do 4 128b SSE moves per cycle with a ZERO cycle latency. This is known as MOVE ELIMINATION.

HW2050Plus · Mar 10, 2011

Martimus said:
however BD only has two AGUs, so it can't actually use all of its available bandwidth at once, but it could do 1 load and 1 store or 2 loads) Because of the AGU deficiency, the SB functional bandwidth is actually the same as the BD functional bandwidth.

That is right (regarding AGUs), but you forget that you have also an independend FP/SSE/AVX scheduler which can as well create load/stores.

And even for integer that means that the queue can be emptied faster. Maybe that's why they do not need larger queue.

IntelUser2000 · Mar 10, 2011

PreferLinux said:
No, 6 Gb/s (Gbits, not GBytes), so either 600 MB/s or 750 MB/s, depending on the encoding used, which I can't remember.

In the SATA and PCI Express world, Gbit to GByte relation is 10:1.

hw2050plus said:
Okay neither of those can even execute 64 Bit FP in single cycle why should they be able to do that for 128 Bit? They can do 128 Bit with one instruction but that doesn't mean the unit does it in one step.

Oh come on, no CPU can execute every instructions of FP in a single cycle. The important ones however, the Core 2 and later can do single cycle 128-bit. And yes it is a single cycle.

That is just incorrect - very obviously btw, I already wrote why. Just easy, 32 Bytes vs. 16 Bytes if you like to simplify it. And yes AMD BD has at least 5 decoders and the vector one decodes instructions as well, again that is one more. So more decoders, more Mops output of decoders and more instruction Bytes to decode. There is a large advantage to BD decoder. Of course it needs this advantage because it has to feed two cores.

I could walk you through everything, but I shouldn't. For now I will. You got the numbers right, but tied it to the wrong metric.

32 bytes vs. 16 bytes is a fetch capability not decode capability. Also to make it clear for everyone who doesn't know:

Vectorpath for AMD = Microcode
Directpath for AMD = Regular decoders

Vectorpath/Microcode is only used for legacy instructions, and is barely used. For the sake of discussion, the only relevant one is Directpath/regular decoders.

Bulldozer Folding @ Home Performance Numbers

Junior Member

Elite Member

Junior Member

Golden Member

Junior Member

Elite Member

Diamond Member

Lifer

Senior member

Member

Senior member

Diamond Member

Senior member

Golden Member

Member

Member

Elite Member

Golden Member

Elite Member

Diamond Member

Senior member

Member

Senior member

Member

Elite Member