Indeed, but what's your point? Physical register files are more about
reducing power consumption than anything else. How does it relate to the discussion of increasing throughput?
What I means is just even we can't improve it, we should think of ways to not to reduce it a lot...
Why would Intel reduce it in any way? They've got a good architecture to build on. I'm not expecting them to radically change anything that would jeopardize IPC. They are more likely to make incremental changes, like extending macro-op fusion to create 3-operand instructions.
Isn't Intel's AVX unit made by 2 128 unit, means legacy code can be also supported?
Yes, there are two 128-bit "lanes" which perform the same operation, so SSE instructions use only the lower half.
But as TuxDave already indicated that's not what I meant. Not having two FMA units will result in no performance gain for older code which only contains MUL and ADD instructions (and for numerical consistency reasons even new code may opt not to use FMA since it rounds the result differently). So there's good incentive for Intel to not promote one unit to FMA, but both.
Also note that even with one FMA unit and 256-bit integer operations they are forced to double the cache bandwidth(s). Without it there would hardly be any gains. But it would be a shame to still only partially take advantage of that extra bandwidth with a crippled floating-point implementation. I really don't see why on a 22 nm process they would hold back and make damaging compromises like that.
Two FMA units simply results in the best cost/gain ratio.
BTW, there is another opinion
"Anyway my opinion is that the optimal design choice is probably FMA+ADD. This is not just for power/cost reasons but also performance: a FMA unit will have very similar latency for MUL but significantly higher latency for ADD than a standalone unit. AMD had to compromise ADD latency on Bulldozer which doesn't seem ideal to me at this point in time."
Yes, FMA+ADD is probably a better alternative than FMA+MUL. I've done a bit of research and found some interesting data in Figure 3 (a) and Table 1 of this paper:
Latency Sensitive FMA Design. The average instruction mix contains 10.7% FMA, 12.5% ADD and 7.7% MUL. So if 100 instructions consisted of this distribution it would take FMA+FMA units 15.45 cycles, FMA+ADD units 18.4 cycles, and FMA+MUL units 23.2 cycles to execute the floating-point instructions. Note that this assumes there are no dependencies between them, which is obviously wrong. But note that in practice with dependencies FMA+FMA units would perform better than the alternatives because there are more combinations of instructions that can execute simultaneously. Furthermore this paper shows it's highly application-specific. For instance 173.applu, which is a physics application, would theoretically execute the floating-point instructions 50% faster on FMA+MUL versus FMA+ADD, and about the same on FMA+FMA (again, only in theory). FMA+FMA also runs ADD-heavy cases optimally while FMA+MUL would not. So which ever asymmetric design you choose a significant number of applications won't benefit.
But here's the most perplexing result of all: On a MUL+ADD configuration the average instruction mix would ideally execute in 17.85 cycles. That's faster than the FMA+ADD configuration! In other words, more hardware is leading to lower performance. To understand this you need to realize that FMA instructions can't be split between the two units; you're
forcing them to use a single unit, the same unit responsible for executing MUL, leaving the ADD unit idle for part of the time. So the only guarantee for higher performance is to have two FMA units.
Another vital result of this paper is their comparison of a "cascaded" FMA implementation, where executing an ADD takes a shortcut so it doesn't suffer from the high latency. The area and power consumption are practically the same, while total performance improves by 4-6% over an non-cascaded implementation. So higher latency is no argument against replacing an ADD unit with an FMA unit!
I rest my case. Haswell has to have dual 256-bit FMA units.