I think in reality what would happen is that the two data streams would have to "take turns" using the FPU, in a sense. It won't be like SMT in that there will not be a performance boost of any kind. If anything it will be detrimental.
That's pretty much what SMT is. Compared to two independent cores, there isn't any performance boost, but compared to one core there IS an increase in throughput. Keep in mind that a Bulldozer
module has about as much floating-point capability as a Sandy Bridge
core (both in 256 bit mode).
The clever bit is that there really isn't any 256bit AVX code floating around in the real world, so instead of having an FPU capable of 256bit or 128bit (or less) code (like Sandy Bridge), AMD put two 128bit FPUs in a module and allows them to be ganged together for 256bit instructions.
So when we have two threads, each using the FPU in 128 bit mode, they can execute at the same time, but if one of them is executing 256-bit code, they need to resort to SMT -- which will decrease performance.
In other words, for integer ops a module looks like two cores, for FP 128 bit and less a module looks like two cores, but for FP 256 bit a module looks like one core.
Please someone correct me if I am misunderstanding how the Flex-FP works. :biggrin:
To clarify, SickBeast I'm not disagreeing with you (I think everything you said was correct), I'm just viewing the module as less than 2 full-blown cores capable of 256bit FP instructions, which is why they have to resort to something at least similiar to SMT for FP). I have a feeling that SMT for FP will be more effective than SMT for integer+FP, at least for FP instructions that wouldn't benefit too much from GPGPU.