I am getting 250 GFLOPS (FP32) doing vector FMA on M4's SME unit. The hardware is, in principle, capable of 2TFLOPS, but there is not enough register file bandwidth, which means that you are limited to 2x 512b SIMD slices out of the available 16x. If Apple cares about vector performance, they...