Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

randomhero · Dec 21, 2023

MI300 and next gen MIXXX just won two high profile contracts for supercomputers in Germany, Hunter and Herder. Latter one is exascale.

blckgrffn · Dec 21, 2023

moinmoin said:
RDNA2 is XSX actually. PS5 is somewhere between RDNA1 and 2. Sony seems to be more eager to have its own input in the silicon (think Kraken) whereas Microsoft's console chips are closer to what AMD puts on the market as well.

Based on what I have read, its definitely closer to RDNA2 than RNDA1. There is what, one feature discrepancy where Sony did it "their way" and its the difference of the standard RDNA2 "Mesh Shaders" as used in the PC parts and the XSX vs the Sony implementation of "Primitive Shaders" used only in the PS5. And even then, it appears that Mesh Shaders are an abstraction, built on Primitive Shaders. Mesh Shaders appear to be part of DX12U and that's likely why we see this advertised on PC and XSX whereas the PS5 presumably isn't using DX12U (ha).

The PS5 GPU is essentially a 6700 non-xt. Heck, it probably even has the ability to use mesh shaders but its not exposed via an API you'd have to put it together on top of the Primitive Shaders which the AMD drivers apparently do. That's likely what Alan Wake 2 did.

What other major differences are there?

Thread for reference:

Analysis - Game Dev - Hardware - AMD - Primitive Shaders vs Mesh Shaders

A pretty interesting interview with AMD over the future of gaming tech, and how Sony adopted the older Primitive Shaders and MS adopted the newer Mesh Shaders as it became the industry norm moving forward. While the XSX has Mesh Shaders, it can also utilise Primitive Shaders as well. Of...

www.neogaf.com

"Primitive shader is not the older version of Mesh Shaders. Primitive shader was proposed as the standard by AMD in 2017 while 2018 Nvidia proposed their implementation which Microsoft adopted in 2019 into DX12U in the form of Mesh Shaders. Primitive shaders still exist in AMD GPUs starting from Vega to RDNA 3.

On AMD GPUs Primitive shaders is what enables Mesh Shaders. How it functions depends on what API you are using it with. In DX12 it functions as Mesh Shaders, but it is the same Primitive Shaders in all AMD GPUs.

Mr. Wang
Certainly, Mesh Shader was adopted as standard in DirectX 12. However, the new geometry pipeline concept originally started with the concept of tidying up the complicated geometry pipeline, making it easier for game developers to use, and to make it easier to extract performance. In other words, it can be said that both AMD and NVIDIA had the same goal as the starting point of the idea. To put it bluntly, Primitive Shader and Mesh Shader have many similarities in terms of functionality, although there are differences in implementation.
So did AMD abandon the Primitive Shader? As for hardware, Primitive Shader still exists, and how to use Mesh Shader is realized with Primitive Shader , it corresponds to Mesh Shader with such an image.

Mr. Wang
Primitive Shader as hardware exists in everything from Radeon RX Vega to the latest RDNA 3-based GPU. When viewed from DirectX 12, Radeon GPU's Primitive Shader is designed to work as a Mesh Shader."

As for next gen implications, it will be interesting to see what's truly in the PS5 Pro. It's looking like a more custom, hyrbrid of RDNA3 and RDNA4, pulling in some RT specific RDNA4 hardware on what otherwise appears to be downclocked 7800XT in the dev kits. That will be significantly more custom that the GPU in the PS5, IMO. Since it might be coming out about the same time as RNDA4 is launching, it will be nice to have those bits being as advanced possible.

I am also interested if there will be any IC, based on the memory bandwidth numbers on the dev kits it doesn't seem like it, which is a bit of a mystery since its such a great performance per watt uplift and these consoles seem pretty optimized on that front in other ways.

6700 --> 7800XT+ might not set anyone's hair on fire here, but it should really bring native 4K/30FPS/RT gaming to some sort of reality.

TESKATLIPOKA · Dec 21, 2023

Chips and Cheese

Article about CDNA3 based on available whitepapers.

DisEnchantment · Dec 21, 2023

[AMDGPU] Remove GDS and GWS for GFX12 by jayfoad · Pull Request #76148 · llvm/llvm-project

github.com

GDS finally removed in GFX12 (not in GFX11, as per rumors)
Significant changes in ISA incoming due to this only. Now Shaders can only export to RB+ or to Primitive Units. ROOE should finally work this time. It's mostly deactivated on GFX11 due to texture corruption.
New Cache policy dependent load store is the other interesting change thus far.

branch_suggestion · Dec 22, 2023

DisEnchantment said:
[AMDGPU] Remove GDS and GWS for GFX12 by jayfoad · Pull Request #76148 · llvm/llvm-project

github.com

GDS finally removed in GFX12 (not in GFX11, as per rumors)
Significant changes in ISA incoming due to this only. Now Shaders can only export to RB+ or to Primitive Units. ROOE should finally work this time. It's mostly deactivated on GFX11 due to texture corruption.
New Cache policy dependent load store is the other interesting change thus far.

This is one area where AMD/NV have different ideas, both in regards for client GPU and DC accelerator GPU. Some convergence and some divergence. AMD has gone to a very simple solution of MALL being the point of coherence.

soresu · Dec 22, 2023

blckgrffn said:
Mesh Shaders appear to be part of DX12U

An equivalent function now exists in Vulkan too, albeit not part of any fixed Vulkan version increment as yet.

This coming January will be 2 years since the release of VK 1.3 so we may see some action on that front vis a vis DX12U equivalence in a fixed standard.

moinmoin · Dec 22, 2023

blckgrffn said:
Based on what I have read, its definitely closer to RDNA2 than RNDA1.

It certainly is. I guess a better way to put the difference between Sony's and Microsoft's approach is that Sony appears to make itself much more involved in the development, sometimes making slightly different choices (like sticking with primitive shaders as you mentioned), whereas Microsoft gladly takes whatever the end result (which "naturally" will be built around a DX implementation anyway).

What's telling is the GFX ID which are chronological. PS5 should be GFX1013, so it originally started as an RDNA1 implementation that then went through the development of all the newer higher IDs. XSX has GFX1020 (based on RDNA2), so Microsoft likely picked a ready and done implementation instead.

DisEnchantment · Dec 22, 2023

branch_suggestion said:
This is one area where AMD/NV have different ideas, both in regards for client GPU and DC accelerator GPU. Some convergence and some divergence. AMD has gone to a very simple solution of MALL being the point of coherence.

Device level coherence surely based on MALL (at least for MI300A) and system level coherence using IF/MALL.
But the regular Graphics pipeline relies on L1 and L2 crossbars mainly at SE level.
Removal of GDS is interesting because there is simply a lot of opcodes involving GDS.
There are new instructions for sync/barrier/fence in the patches though.

The dynamic VGPR allocation would be interesting if true.
VOPD hardly changed, same restrictions like GFX11. Was hoping for additional register banks for true dual issue but seems not the case.

Ajay · Dec 22, 2023

DisEnchantment said:
The dynamic VGPR allocation would be interesting if true.
VOPD hardly changed, same restrictions like GFX11. Was hoping for additional register banks for true dual issue but seems not the case.

Hmm, seems like AMD is leaving some 'free' performance on the table for some reason. I wonder what is restricting the dual issue some much - NV doesn't seem to have this problem.

soresu · Dec 22, 2023

DisEnchantment said:
Device level coherence surely based on MALL (at least for MI300A) and system level coherence using IF/MALL.

Actual discussion in the AMD driver team group Zoom:

😂

Saylick · Dec 22, 2023

DisEnchantment said:
Device level coherence surely based on MALL (at least for MI300A) and system level coherence using IF/MALL.
But the regular Graphics pipeline relies on L1 and L2 crossbars mainly at SE level.
Removal of GDS is interesting because there is simply a lot of opcodes involving GDS.
There are new instructions for sync/barrier/fence in the patches though.

The dynamic VGPR allocation would be interesting if true.
VOPD hardly changed, same restrictions like GFX11. Was hoping for additional register banks for true dual issue but seems not the case.

This is a bummer. I was reading through C&C's RDNA 3 microbenchmarking article and they state that VOPD instructions are not used as often as they could be due to an unoptimized compiler. In the example code they presented, there were some obvious situations where a human would easily see the dual-issue opportunity that the compiler missed. Knowing that AMD's software team is generally not comparable to the competition, I am not hopeful for future VOPD optimizations.

Tuna-Fish · Dec 22, 2023

DisEnchantment said:
The dynamic VGPR allocation would be interesting if true.
VOPD hardly changed, same restrictions like GFX11. Was hoping for additional register banks for true dual issue but seems not the case.

I don't think we can derive info from what's unchanged yet. It seems to me that they started by copying the GFX11 stuff, and now are gradually doing changes to it. Any part that is as of yet unchanged might just be something they haven't gotten to yet, while anything that has been changed probably reflects an actual real change.

PJVol · Dec 25, 2023

Meanwhile...
.

So, assuming this isn't just someone's hopeful fantasy, could it be that the ASIC known as "Navi4C" (or whatever it called now) was originally planned for the RDNA5 launch schedule?

TESKATLIPOKA · Dec 25, 2023

PJVol said:
Meanwhile...
.
View attachment 90955

Is It even worth It to post It here? It's pure speculation based on N4C and not even a good one.

Just looking at the second one It's clear he is talking nonsense. There is no reason for AMD to have two different SEDs, or rather for that second one to be using 9 smaller SEDS instead of 7 bigger ones in my opinion.

Regardless, If AMD releases such a product, I have to wonder how the other models would look like. 3-6-9 SEDS? Other combinations? Based on performance or?
BTW, even 3 SEDs would have 25% more WGPs compared to N31 in case It's 20WGPs per SED.

PJVol · Dec 25, 2023

TESKATLIPOKA said:
Is It even worth It to post It here? It's pure speculation based on N4C and not even a good one.

Why not? Isn't that what this thread is for?

TESKATLIPOKA said:
Just looking at the second one It's clear he is talking nonsense. There is no reason for AMD to have two different SEDs, or rather for that second one to be using 9 smaller SEDS instead of 7 bigger ones in my opinion.

Are you sure that your opinion is backed up enough tech-wise so as not to look plain stupid later?

Tuna-Fish · Dec 25, 2023

PJVol said:
Are you sure that your opinion is backed up enough tech-wise so as not to look plain stupid later?

I'd back him up on this. The big win of having a separate SED that you tile to make products is that on the leading edge nodes, chip design is freaking expensive. By only having a single such design and then duplicating it in products, you save tons of money. There's no way they are making two fairly similar ones.

If they are unsure about how high to scale the high-end product at this point, what they are uncertain about is how many of the SEDs they are going to pack in, not about how much resources each one should have.

TESKATLIPOKA · Dec 25, 2023

PJVol said:
Why not? Isn't that what this thread is for?

Is RTG even worth our time? How many times was he correct? BTW, he is already backpedaling in what you posted in case he is wrong.

PJVol said:
Are you sure that your opinion is backed up enough tech-wise so as not to look just stupid later?

Are you saying It makes more sense to use 9 smaller SEDs than 7 bigger SEDs for basically the same amount of WGP(135 vs 140)? And AMD would also need to design another SED for that, that's not cheap and doesn't look like it's really needed.
So is what I wrote really so stupid? To me It looks more realistic than what he wrote.

PJVol · Dec 25, 2023

Tuna-Fish said:
There's no way they are making two fairly similar ones.

TESKATLIPOKA said:
And AMD would also need to design another SED for that, that's not cheap and doesn't look like it's really needed.

If you both bothered to watch the video, it says that the info comes from different sources, or rather, one of them is more up-to-date.

Is RTG even worth our time? How many times was he correct?

Our? Anyway, it's your own business. Personally, I don't see how his info is less relevant than most people's guesswork here.

TESKATLIPOKA · Dec 25, 2023

PJVol said:
If you both bothered to watch the video, it says that the info comes from different sources, or rather, one of them is more up-to-date.

RTG has sources?

I am very skeptical. I wouldn't be surprised If his sources are actually tech forums like this one.

PJVol said:
Our? Anyway, it's your own business

Didn't mean exactly you but others. I don't think I am the only one who doesn't believe in him having any real sources.

You can believe him If you want, It's your own business, I said what I think about him or this info of his.

Glo. · Dec 25, 2023

TESKATLIPOKA said:
So is what I wrote really so stupid? To me It looks more realistic than what he wrote.

There is plenty of possibilities from engineering perspective.

9 SEDs gives you 3x3 array. What gives you 7 SEDs? How would you arrange them?

PJVol · Dec 25, 2023

TESKATLIPOKA said:
RTG has sources?

Of course not, didn't you hear how he invented the term "infinity cache" which AMD stole and used in RDNA 2?

TESKATLIPOKA · Dec 25, 2023

Glo. said:
There is plenty of possibilities from engineering perspective.

9 SEDs gives you 3x3 array. What gives you 7 SEDs? How would you arrange them?

With 2 different SED It would look like this:
9 big SEDs in a 3x3 array -> 180 WGPs (+33.3%)
9 small SEDs in a 3x3 array -> 135 WGPs (+12.5%)
6 big SEDs in a 3x2 or 2x3 array -> 120 WGPs (+33.3%)
6 small SEDs in a 3x2 or 2x3 array -> 90 WGPs (+12.5%)
4 big SEDs in a 2x2 array -> 80 WGPs (+33.3%)
4 small SEDs in a 2x2 array -> 60 WGPs (duplicate)
3 big SEDs in a 3x1 or 1x3 array -> 60 WGPs (+33.3%)
3 small SEDs in a 3x1 or 1x3 array -> 45 WGPs (+12.5%)
2 big SEDs in a 2x1 or 1x2 array -> 40 WGPs (+33.3%)
2 small SEDs in a 2x1 or 1x2 array -> 30 WGPs (100%)
What SKUs would you make out of this? Performance jump would be uneven, either too big or pretty small, there is even a duplicate there. I personally wouldn't bother designing another SED for this.

BTW, why should we be limited to 3x3, 2x3, 1x3, 2x2, 2x1 arrays?
There is nowhere written that you can't have 1,2,3,4,5,6,7,8 or 9 SEDs.
Why can't I use 7 SEDs for example? Because It doesn't look pretty? 7900XT or 7700XT also have an uneven number of MCDs and no one cares.
It's not like you can't have a single SED in the last row(column) and instead you are forcing yourself to design another SED.

Glo. · Dec 25, 2023

TESKATLIPOKA said:
BTW, why should we be limited to 3x3, 2x3, 1x3, 2x2, 2x1 arrays?
There is nowhere written that you can't have 1,2,3,4,5,6,7,8 or 9 SEDs.
Why can't I use 7 SEDs for example? Because It doesn't look pretty? 7900XT or 7700XT also have an uneven number of MCDs and no one cares.
It's not like you can't have a single SED in the last row(column) and instead you are forcing yourself to design another SED.

Most likely - interconnects, and how they are placed on the dies.

If I understand this correctly - its possible to reuse the SEDs on different types of products, so AMD would want to aim for best possible interconnect for the dies.

Single die could be then used in an APU, the same way they could be used for dGPUs. So AMD has to design ONE die to scale it from APUs to the dGPUs. But it will require stupidly complex interconnect.

TESKATLIPOKA · Dec 25, 2023

Glo. said:
Most likely - interconnects, and how they are placed on the dies.

A 7 SED GPU would have the same placement as a 9 SED one, but 2 chips will be missing in one row or column.
Basically the same as we already have with RDNA3 MCDs.
If the problem is with the AID under those SEDs, then that would mean you are basically limited to 3-6-9 SEDs, then why not do just SEDs with 3x more WGPs for 1-2-3 SEDs in total?

For any different combination you would need a different AID(active interposer die) and If you add a different SED, then even for that.

Glo. · Dec 25, 2023

TESKATLIPOKA said:
A 7 SED GPU would have the same placement as a 9 SED one, but 2 chips will be missing in one row or column.
Basically the same as we already have with RDNA3 MCDs.
If the problem is with the AID under those SEDs, then that would mean you are basically limited to 3-6-9 SEDs, then why not do just SEDs with 3x more WGPs for 1-2-3 SEDs in total?
View attachment 90970
For any different combination you would need a different AID(active interposer die) and If you add a different SED, then even for that.

I presume you need coherency of data, so the layout has to be symmetric.

Potentially this may be requirement for scalable geometry, if you think about it.

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Member

Diamond Member

Platinum Member

Chips and Cheese​

Golden Member

Senior member

Diamond Member

Diamond Member

Golden Member

Lifer

Diamond Member

Diamond Member

Golden Member

Senior member

Platinum Member

Senior member

Golden Member

Platinum Member

Senior member

Platinum Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Chips and Cheese