Whoa, so wait whats your source for the last 2 pipes being SSE units? By SSE does that mean all SSE instruction sets like 1-4 or w/e lol I dont know all ther versions, or is it just like an SSE2 set of pipes.
And as far as the FMAC thing...ok so, my original assumption was correct. Each 128bit pipe can only do one FMUL or FADD per clock cycle unless its compiled using the newer FMA4 instruction set then it can do both in same clock.
So, I agree like you said on paper the FPU pretty much has the same throughput capability of the stars FPU, but in practicality its actually more because of the enhancements to the things done around the pipelines. Such as schedulers and cache and LSU.
So in essence my original assumption was right, everything around the FPU has been done more effeciently to allow the FPU to do more useful work per clock(keep the pipes more full, less bubbles).
If those extra 2 pipes are indeed SSE 128bit INT pipes, how does that allow the FPU to do 4x 128bit INT ops/clock...the other 2 FMAC pipes only do FP code right? They dont do INT code...or can SSE run INT code through an FPU pipe?
Or is it that SSE is both FP and INT code and it can run up to 2 FP instructions on the FMAC pipes and run 2 INT instructions on the 2 "FMISC" pipes...is that what you meant?
Here is my top-level understanding of the changes made in Bulldozer from STARS:
The front end has been completely overhauled, including the branch prediction which probably is the most improved part of this architecture (although it was a weakness for the STARS architecture, so how improved this is will have a big impact on the Bulldozer performance since the new architecture has deeper pipelines.) The Branch target buffer now uses a two level hierarchy, just like Intel does on Nehalem and Sandybridge. Plus, now a mispredicted branch will no longer corrupt the entire stack, which means that the penalties for a misprediction are far less than in the STARS architecture. (Nehalem also has this feature, so it brings Bulldozer to parity with Nehalem wrt branch mispredictions)
Decoding has improved, but not nearly as much as the fetching on the processor. Bulldozer can now decode up to four (4) instructions per cycle (vs. 3 for Istanbul). This brings Bulldozer to parity with Nehalem, which can also decode four (4) instructions per cycle. Bulldozer also brings branch fusion to AMD, which is a feature that Intel introduced with C2D. This allows for some instructions to be decoded together, saving clock cycles. Again, this seems to bring Bulldozer into parity with Nehalem (although this is more cloudy, as there are restrictions for both architectures, and since Intel has more experience with this feature they are likely to have a more robust version of branch fusion.)
Bulldozer can now retire up to 4 Macro-ops per cycle, up from 3 in the STARS architecture. It is difficult for me to compare the out-of-order engine between STARS and Bulldozer, as they seem so dissimilar. I can say that it seems a lot more changed than just being able to retire 33% more instructions per cycle. Mostly the difference seems to be moving from dedicated lanes using dedicated ALUs and AGUs, to a shared approach.
Another major change is in the Memory Subsystem. AMD went away from the two-level load-store queue (where different functions were performed in in each level), and adopted a simple 40 entry entry load queue, with a 24 entry store queue. This actually increases the memory operations by 33% over STARS, but still keeps it ~20% less than Nehalem. The new memory subsystem also has an out-of-order pipeline, with a predictor that determines which loads can pass stores. (STARS had a *mostly* in-order memory pipeline) This brings Bulldozer to parity with Nehalem, as Intel has used this technique since C2D. Another change is that L1 cache is now duplicated in L2 cache (which Intel has been doing as long as I remember). Although L3 cache is still exclusive.
Bulldozer now implements true power gating. Although unlike Intel who gates at each core, they power gate at the module level. This shouldn't really effect IPC, but might effect the max frequency so it is a point to bring up when discussing changes to performance. The ability to completely shut off modules should allow higher turbo frequencies than we saw in Thuban, but we won't know what they are until we see some reviews.
Well, those are the main differences that I know of. Add that to the fact that this processor was actually designed to work on a 32nm process versus a 130nm process like STARS, and you should see additional efficiencies. I expect a good IPC improvement, along with a large clockspeed boost. Although I can't say how much, and I really am looking more for parity with Nehalem based processors than I am with Sandybridge based processors.
References:
Butler, Mike. "Bulldozer" A new approach to multithreaded compute performance. Hot Chips XXII, August 2010.
Also since you are mostly interested in Floating Point performance, here is a good write up of the floating point changes made in Bulldozer:
http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=7
Bulldozers floating point cluster does away with the notion of dedicated schedulers and lanes and uses the more flexible unified approach. The four pipelines (P0-P3), are fed from a shared 60 entry scheduler. This is roughly 50% larger than the reservation stations for Istanbul (42 entries) and almost double Barcelona's (36 entries). The heart of the cluster is a pair of 128-bit wide floating point multiply-accumulate (FMAC) units on P0 and P1. Each FMAC unit also handles division and square root operations with variable latency.
The two FMAC units execute FADD and FMUL instructions, although this obviously leaves performance on the table compared to using the combined FMAC. The first pipeline includes a 128-bit integer multiply-accumulate unit, primarily used for instructions in AMDs XOP. Additionally, the hardware for converting between integer and floating point data types is tied to pipeline 0. Pipeline 1 also serves double duty and contains the crossbar hardware (XBAR) used for permutations, shifts, shuffles, packing and unpacking.
Another question regarding Bulldozer is how 256-bit AVX instructions are handled by the execution units. One option is to treat each half as a totally independent macro-op, as the K8 did for 128-bit SSE, and let the schedulers sort everything out. However, it is possible that Bulldozer's two symmetric FMAC units could be ganged together to execute both halves of an AVX instruction simultaneously to reduce latency.
The other half of the floating point clusters execution units actually have little to do with floating point data at all. Bulldozer has a pair of largely symmetric 128-bit integer SIMD ALUs (P2 and P3) that execute arithmetic and logical operations. P3 also includes the store unit (STO) for the floating point cluster (this was called the FMISC unit in Istanbul). Despite the name, it does not actually perform stores rather it passes the data for the store to the load-store unit, thus acting as a conduit to the actual store pipeline. In a similar fashion, there is a small floating point load buffer (not shown above) which acts as an analogous conduit for loads between the load-store units and the FP cluster. The FP cluster can execute two 128-bit loads per cycle, and one of the purposes of the FP load buffer is to smooth the bandwidth between the two cores. For example, if the two cores simultaneously send data for four 128-bit loads to the FP cluster, the buffer would release 256-bits of data in the first cycle, and then 256-bits of data in the next cycle.