Isn't this just catching up with the AI accelerators we have in modern ARM SoC's (like Apple's Bionic)?
(but using CPU instructions instead of seperate accelerator on the die)
Uh, Apple's A13 has something apparently very close to this; even called AMX (Apple Matrix Extensions). We know very little about these (beyond a claim of "one trillion 8-bit operations per second").
It's interesting to note that at WWDC we were not given further details about this, even as an aside, or as part of a talk on Accelerate (Apple's general framework for accelerating various types of numerical code). On the other hand, it's also interesting to note that the effort in LLVM to define a native matrix type, and to optimize various types of code to utilize that matrix type (which can then be mapped onto TPU's in various target CPU's or GPU's) is being led by Apple folk...
Quite what AMX is remains unclear. Apple described it as part of the CPU (ie NOT an accelerator like an NPU). It could be proprietary instructions; alternatively it could be an implementation of the Matrix instructions added to ARMv8.6 -- this is my bet.
The Arm Architecture is continually evolving to meet the needs of our ecosystem partners. This blog gives a high-level overview of some of the changes being introduced in Armv8.6-A. The enhancements to the architecture provide more efficient processi...
community.arm.com
(If this is ARMv8.6, it's even possible that Apple has been delaying details as part of an agreement with ARM as ARM finalizes the precise exact details of every aspect of 8.6 and it's documentation. Presumably this will all match Apple, but there may be parts, like the 32-bit behavior, or some interactions with the OS, hypervisor, debugging, and performance registers, where Apple doesn't care much about the ARM details, they have already done things their way.)
You could dismiss this as a fail by Apple, but I'd describe it more as "it's very difficult to get all of HW, compiler, client SW, etc) absolutely synchronized".
It would be very interesting to see the performance of various machine learning benchmarks (both inference and learning) on an A12 compared to an A13 to see what the differences are... (The NPU on the A13 is about 15..20% faster than on the A12; but it's probably optimized for inference. The new A13 AMX blocks are apparently for learning; and it's possible that Apple hasn't even yet hooked them up to Apple frameworks.
ie maybe the big reveal, requiring also some OS support [context swapping] and new APIs, will come with iOS14 in September?)
My guess is that, regardless of the details, these will be visible and in use (certainly via Apple frameworks, perhaps as direct LLVM compile targets) by the time Intel ships their AMX.