2x 256b = 2 x 128b or 2 x 256b; It isn't efficient or effective.
4x 128b = 4x 128b, 2x 256b; It is efficient and effective.
HPC applications and or widely parallel applications are more worried about core scaling than internal FPU decode/retirement/data bus.
And widening the decode/retirement/data bus limits the core scaling. The way I see it, the area growth (front end, retirement, even the bypass between the FMAs) required from moving from a 2x128b to a "true" (see below) 4x128b design is significantly larger, more power hungry, probably less frequency than moving from 2x128b to a 2x256b design. So you can more FP units in a much narrower core... hence better core scaling.
2x 256b = 2 x 128b or 2 x 256b
4x 128b = 4x 128b, 2x 256b; It is efficient and effective
Edit: Ah I see what you mean. So what you're calling a 4x128b, you actually mean a reconfigurable machine that tries to get the best of both worlds. I would have two points to make:
1) If you want 4x 128b to act like a TRUE 4x 128b, you will have significantly more area/power/less frequency. It would impact the design even when you all you really want is 2x256b. So if you really wanted to go for that reconfigurable option without tanking your whole design, you would have to make some tradeoffs similar to what Bulldozer did in how the execution units can get their data.
Now my question is, what's the front end you would put in to feed this CPU? If you say 4 wide, then for applications that can scale to 256b, you have excess decode width. If you say 2 wide, then you don't get the full utilization of a 4x 128b. Flexibility comes with some overhead. I like reconfigurable stuff, but if you want maximum perf/power and hence efficiency, you have to start targeting.