Discussion [WikiChip Fuse]The x86 Advanced Matrix Extension (AMX) Brings Matrix Operations; To Debut with Sapphire Rapids

tamz_msc · Jul 3, 2020

The x86 Advanced Matrix Extension (AMX) Brings Matrix Operations; To Debut with Sapphire Rapids

Intel publishes details of its upcoming Advanced Matrix Extension (AMX), an x86 extension set to debut with Sapphire Rapids that introduces a new matrix register file and accompanying matrix operations.

fuse.wikichip.org

Looks like Intel is stepping up the instruction set game. No support for regular data types though. More details to be found here.

Markfw · Jul 3, 2020

If you can't beat AMD at a plain x86 implementation, change the rules by adding new instructions, like AVX512 did. Then make the software vendors use it.

Back in the 70s, early 80's Tektronix had a matrix chip in their 4051/4052. It really did speed things up a LOT, like 1000 fold.
Then just after 8086 came out, was the 8087 co-processor, same thing.

SAAA · Jul 3, 2020

You know, I don't care what vendor does it as long as software speeds up and I can enjoy it as end user. Still this is mostly aimed at AI stuff and probably some big customer asked for so they delivered. Who knows when it will make its appearance on client cores.

moinmoin · Jul 3, 2020

Years ago I was told Sapphire Rapids will be the next chance of seeing Intel completely overhaul its Core design. These look more like Intel's usual MO though, some evolutionary changes to the existing design and a couple all new (and as such initially completely unsupported) instructions.

jpiniero · Jul 3, 2020

moinmoin said:
Years ago I was told Sapphire Rapids will be the next chance of seeing Intel completely overhaul its Core design. These look more like Intel's usual MO though, some evolutionary changes to the existing design and a couple all new (and as such initially completely unsupported) instructions.

Yup, gotta figure that was scrapped.

gorobei · Jul 3, 2020

interestingly from what little info is out there on what exactly a tensor core is, the takeaway i get is they are just dedicated matrix array computation units. if that is true it means you can do path tracing on the cpu side rather than the gpu, which jibes with the next unreal engine not using nvidia's or amd's dedicated hardware for global illumination.

jpiniero · Jul 3, 2020

SAAA said:
You know, I don't care what vendor does it as long as software speeds up and I can enjoy it as end user. Still this is mostly aimed at AI stuff and probably some big customer asked for so they delivered. Who knows when it will make its appearance on client cores.

Well, it's not in Alder Lake.

dmens · Jul 4, 2020

gorobei said:
interestingly from what little info is out there on what exactly a tensor core is, the takeaway i get is they are just dedicated matrix array computation units. if that is true it means you can do path tracing on the cpu side rather than the gpu, which jibes with the next unreal engine not using nvidia's or amd's dedicated hardware for global illumination.

LOL no, this hack doesn't have anywhere close to enough threads to do that.

gorobei · Jul 4, 2020

dmens said:
LOL no, this hack doesn't have anywhere close to enough threads to do that.

im not saying this iteration, but in a gen or two after it game devs could think about doing the sampling for path tracing on whatever matrix instruction set intel/amd end up using.

teejee · Jul 4, 2020

Isn't this just catching up with the AI accelerators we have in modern ARM SoC's (like Apple's Bionic)?
(but using CPU instructions instead of seperate accelerator on the die)

beginner99 · Jul 4, 2020

gorobei said:
im not saying this iteration, but in a gen or two after it game devs could think about doing the sampling for path tracing on whatever matrix instruction set intel/amd end up using.

Note that sapphire rapids is the name for a server platform. So like with all incarnations of AVX, it will trickle down slowly and the low end will not get it anytime soon. I'm not even sure of newest Pentiums support AVX2 nowadays or still have it fused off.

darkswordsman17 · Jul 4, 2020

gorobei said:
interestingly from what little info is out there on what exactly a tensor core is, the takeaway i get is they are just dedicated matrix array computation units. if that is true it means you can do path tracing on the cpu side rather than the gpu, which jibes with the next unreal engine not using nvidia's or amd's dedicated hardware for global illumination.

Aren't they just doing a shortcut version using GPU compute? Which I think they had been developing it for years, before they even knew what AMD or Nvidia ray-tracing hardware would look like, so we'll see if they change things to utilize that hardware as it becomes more common.

Richie Rich · Jul 4, 2020

tamz_msc said:
Looks like Intel is stepping up the instruction set game. No support for regular data types though. More details to be found here.

AMX tile is 16x 64 bytes (1 kB) ..... which is 8,192-bit.

Why Intel announces this just one year before releasing Sapphire Rapids? The development such a AMX-capable FPU unit will take 4 years for AMD and VIA. Why Intel isn't releasing specs well ahead that everybody can collaborate, bring some improvements? This Intel's monopolistic behavior damages whole platform.

tamz_msc · Jul 4, 2020

Richie Rich said:
AMX tile is 16x 64 bytes (1 kB) ..... which is 8,192-bit.

Why Intel announces this just one year before releasing Sapphire Rapids? The development such a AMX-capable FPU unit will take 4 years for AMD and VIA. Why Intel isn't releasing specs well ahead that everybody can collaborate, bring some improvements? This Intel's monopolistic behavior damages whole platform.

Intel can do whatever they want with respect to instruction sets because they're the market leader in x86 CPUs. That isn't going to change anytime soon, and unless the likes of Agner Fog with his idealized instruction set are taken seriously by the manufacturers you can forget about having them work together for any sort of universally supported instruction set.

gorobei · Jul 4, 2020

darkswordsman17 said:
Aren't they just doing a shortcut version using GPU compute? Which I think they had been developing it for years, before they even knew what AMD or Nvidia ray-tracing hardware would look like, so we'll see if they change things to utilize that hardware as it becomes more common.

dont really know as there hasnt been a huge amount of details out of epic about the engine. i would assume that stuff would come out at next years gdc if we arent still under quarantine.

but i have been seeing a few other game engine news blurbs about not using gpu but cpu for GI. for reflection, refraction, area lights, and shadows. i imagine dxr is able to handle passing that stuff off to the gpu with no issues for the dev, but GI has a bunch of variant methods that might be uniquely coded/formated to the game engine.

cortex found a nv paper(patent?) on something called a traversal accelerator, that is supposed to be used to speed up the montecarlo type sampling for path traced GI. while that custom module might be nice to have available as a dev, i cant imagine unreal engine which has to run on consoles, pc, and mobile wanting to optimize on such a niche hardware possibility when tons of spare cpu threads will be ubiquitous across all platforms with each new hardware generation.

dmens · Jul 4, 2020

Richie Rich said:
AMX tile is 16x 64 bytes (1 kB) ..... which is 8,192-bit.

Why Intel announces this just one year before releasing Sapphire Rapids? The development such a AMX-capable FPU unit will take 4 years for AMD and VIA. Why Intel isn't releasing specs well ahead that everybody can collaborate, bring some improvements? This Intel's monopolistic behavior damages whole platform.

Because it is a hack that no developer will use and its main purpose is so marketing can claim "AI leadership" in a slide deck.

jpiniero · Jul 4, 2020

dmens said:
Because it is a hack that no developer will use and its main purpose is so marketing can claim "AI leadership" in a slide deck.

Facebook will probably be pretty happy with it.

name99 · Jul 4, 2020

teejee said:
Isn't this just catching up with the AI accelerators we have in modern ARM SoC's (like Apple's Bionic)?
(but using CPU instructions instead of seperate accelerator on the die)

Uh, Apple's A13 has something apparently very close to this; even called AMX (Apple Matrix Extensions). We know very little about these (beyond a claim of "one trillion 8-bit operations per second").

It's interesting to note that at WWDC we were not given further details about this, even as an aside, or as part of a talk on Accelerate (Apple's general framework for accelerating various types of numerical code). On the other hand, it's also interesting to note that the effort in LLVM to define a native matrix type, and to optimize various types of code to utilize that matrix type (which can then be mapped onto TPU's in various target CPU's or GPU's) is being led by Apple folk...

Quite what AMX is remains unclear. Apple described it as part of the CPU (ie NOT an accelerator like an NPU). It could be proprietary instructions; alternatively it could be an implementation of the Matrix instructions added to ARMv8.6 -- this is my bet.

Arm A profile architecture update 2019

The Arm Architecture is continually evolving to meet the needs of our ecosystem partners. This blog gives a high-level overview of some of the changes being introduced in Armv8.6-A. The enhancements to the architecture provide more efficient processi...

community.arm.com

(If this is ARMv8.6, it's even possible that Apple has been delaying details as part of an agreement with ARM as ARM finalizes the precise exact details of every aspect of 8.6 and it's documentation. Presumably this will all match Apple, but there may be parts, like the 32-bit behavior, or some interactions with the OS, hypervisor, debugging, and performance registers, where Apple doesn't care much about the ARM details, they have already done things their way.)

You could dismiss this as a fail by Apple, but I'd describe it more as "it's very difficult to get all of HW, compiler, client SW, etc) absolutely synchronized".
It would be very interesting to see the performance of various machine learning benchmarks (both inference and learning) on an A12 compared to an A13 to see what the differences are... (The NPU on the A13 is about 15..20% faster than on the A12; but it's probably optimized for inference. The new A13 AMX blocks are apparently for learning; and it's possible that Apple hasn't even yet hooked them up to Apple frameworks.
ie maybe the big reveal, requiring also some OS support [context swapping] and new APIs, will come with iOS14 in September?)

My guess is that, regardless of the details, these will be visible and in use (certainly via Apple frameworks, perhaps as direct LLVM compile targets) by the time Intel ships their AMX.

Richie Rich · Jul 6, 2020

dmens said:
Because it is a hack that no developer will use and its main purpose is so marketing can claim "AI leadership" in a slide deck.

That's the major difference in philosophy. Intel releases ISA extension to screw AMD up and increase their monopoly in x86 world. But they forgot that ARM CPUs have a SVE vectors which is much more powerful extension including matrix multiplication. Japanese Fugaku supercomputer (CPU only) based on ARM CPU with SVE vectors clearly demonstrated that they are able to beat GPU based SC (big problem for AMD's CDNA line up). ARM vendors collaborated and came up with SVE2 which will replace old NEON vectors and will be available in every smart phone since next year 2021 (as a part of ARMv9 Matterhorn).

AMD will need whole decade to implement AMX in CPUs (AMD has no AVX512 support after 7 years, they will go around using GPUs CDNA). As a result almost no body will use Intel's AMX extension same way as AVX512 never been widely used. That's the way into the hell.

mikk · Jul 6, 2020

Richie Rich said:
As a result almost no body will use Intel's AMX extension same way as AVX512 never been widely used. That's the way into the hell.

Widely used has nothing to do with AMD because AMD is so minor in the market it doesn't matter what they have or not have, it wouldn't have changed the AVX 512 usefulness in consumer apps. The bigger issue is Intels AVX512 abstinence in their mainstream CPUs which didn't support AVX512 up to Icelake-U which is minor as well. And also you could say the same about AVX2, it isn't widely used. Consumer productivity apps with real AVX2 usage are generaly very rare. I believe the most prominent consumer productivity application is probably x265 for HEVC encoding which supports AVX512 as well by the way: https://www.prlog.org/12701604-mult...ntel-avx-512-instructions-on-4k-encoding.html

samboy · Jul 6, 2020

Anyone know if the AMX extensions are IEEE754 floating point compliant?

That is, if you used standard IEEE754 floating point operations to calculate the dot product of a vector; will you get the same answer/error as using the AMX dot product implementation?

Doesn't matter too much for a game engine; but for other applications this can be a big deal if you want the same deterministic behavior on a number of platforms (including Intel chips without the AMX extension).

I hope that Intel has not ignored this aspect...... if so, then the extension is not usable for some applications.

tamz_msc · Jul 6, 2020

AVX512 is finding applications in various AI workloads, for example this one. So it is premature to declare that AMX wouldn't find any applications.

beginner99 · Jul 6, 2020

tamz_msc said:
AVX512 is finding applications in various AI workloads, for example this one. So it is premature to declare that AMX wouldn't find any applications.

Intel MKL uses AVX512 and if you use a numpy version compiled against intel mkl, you profit from AVX512, basically double speed compared to AVX2 already for matrix stuff. (with the downside you get half the cores compared to AMD, so not really worth it)

darkswordsman17 · Jul 9, 2020

gorobei said:
dont really know as there hasnt been a huge amount of details out of epic about the engine. i would assume that stuff would come out at next years gdc if we arent still under quarantine.

but i have been seeing a few other game engine news blurbs about not using gpu but cpu for GI. for reflection, refraction, area lights, and shadows. i imagine dxr is able to handle passing that stuff off to the gpu with no issues for the dev, but GI has a bunch of variant methods that might be uniquely coded/formated to the game engine.

cortex found a nv paper(patent?) on something called a traversal accelerator, that is supposed to be used to speed up the montecarlo type sampling for path traced GI. while that custom module might be nice to have available as a dev, i cant imagine unreal engine which has to run on consoles, pc, and mobile wanting to optimize on such a niche hardware possibility when tons of spare cpu threads will be ubiquitous across all platforms with each new hardware generation.

They said something about it with the release of that tech demo here fairly recently. They seemed to indicate to me they were doing a lot less traces or something to begin with (hence my "shortcut" comment).

Which Nvidia GPUs have Tensor cores, but for whatever reason seems that most aren't using them despite claims they'd boost things (think there was talk of them using them for DLSS, but at least some of the games doing DLSS aren't using them, they're doing it some other manner, which is weird since they were going to the work of utilizing the ray-tracing bits - although that even seems like a bit of a sham, recalling how DICE scaled back the tracing a lot and doing other things in order to keep performance from tanking on BF1 or whatever game it was).

DrMrLordX · Jul 9, 2020

tamz_msc said:
AVX512 is finding applications in various AI workloads, for example this one. So it is premature to declare that AMX wouldn't find any applications.

Is AMX going to deprecate bfloat16 and VNNIwhatevergodIcan'trememberallthoseAVX512extensionsarghhhh

Discussion [WikiChip Fuse]The x86 Advanced Matrix Extension (AMX) Brings Matrix Operations; To Debut with Sapphire Rapids

Diamond Member

Moderator Emeritus, Elite Member

Senior member

Diamond Member

Lifer

Diamond Member

Lifer

Platinum Member

Diamond Member

Senior member

Diamond Member

Lifer

Senior member

Diamond Member

Diamond Member

Platinum Member

Lifer

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Lifer

Lifer