The problem is small units scale better.Internally the FMA units in Bulldozer to Steamroller are all 64-bit in size. Smaller units are more efficient and less power hungry than bigger units.
A 4x128b FMA will have 8x double precision (64b) FMA units. 2x256b FMA will have 8x double precision FMA units. So 'scaling' looks pretty equal so far. You're not going to build a new IEEE extended precision 256-bit FP format.
- A fully connected 4x 128-bits execution unit means you need to schedule 4 uops, and any of those uops can take data from the result of any of the other FMAs (including itself).
- A fully connected 2x 256-bit execution unit means you need to schedule 2 uops and any of those uops can take data from any of the other FMAs (including itself). So you see, if your workloads can readily take advantage of 256-bit vectors, 2x256b scales better than 4x128b.
15h is already 4 wide and 30h-4Fh made it 8-wide.
I believe if you're talking about Steamroller/Bulldozer, maybe you're thinking it as 4-wide x 2 threads, not 8-wide x 1 thread.
The 2x2x128b FMA that Bulldozer has is a pretty nice concept to deal with the AVX to AVX2 transition (or apps that can't even use AVX2) but it definitely is not logically equivalent to a 4x128b FMA.
Sorry to everyone else for veering way off topic. If you still disagree, I can take this to PM.