Based on what I read about HPC workloads, it seems that non-destructive operations where something to desire.
this is a common misconception about FMA4, look at actual FMA3 code below, extra moves are generally not required, and the seldom extra moves are handled at the rename stage anyway on modern CPUs
Code:
vfmsub231ps ymm9, ymm10, ymm14 ;470.23
vfmsub231ps ymm8, ymm15, ymm12 ;470.23
vmulps ymm4, ymm4, ymm6 ;471.7
vfmsub231ps ymm7, ymm11, ymm13 ;470.23
vfmadd231ps ymm14, ymm11, ymm5 ;471.7
vfmadd231ps ymm13, ymm12, ymm5 ;471.7
vfmadd213ps ymm5, ymm10, ymm15 ;471.7
vfmadd231ps ymm14, ymm8, ymm4 ;471.7
vfmadd231ps ymm13, ymm9, ymm4 ;471.7
vfmadd213ps ymm4, ymm7, ymm5 ;471.7
vmulps ymm9, ymm14, ymm14 ;471.7
vfmadd231ps ymm9, ymm13, ymm13 ;471.7
add r12d, 8 ;466.30
vfmadd231ps ymm9, ymm4, ymm4 ;471.7
FMA4 will be just more readable in assembly with no real performance advantage