Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

MoogleW · Dec 4, 2023

TESKATLIPOKA said:
Read It once more. It's about WGP/CU/SP count for a single chip.
Can you guess It correctly? That's what is challenging.

Navi 48 = 32 WGP, 64 CU, 48MB L3, GDDR7 memory, 192-bit memory bus, PCIe 5.0 x16?

Navi 44 = 16 WGP, 32 CU, 32MB L3, GDDR7 memory, 128-bit memory bus + PCIe 5.0 x8?

TESKATLIPOKA · Dec 4, 2023

MoogleW said:
Navi 48 = 32 WGP, 64 CU, 48MB L3, GDDR7 memory, 192-bit memory bus, PCIe 5.0 x16?

Navi 44 = 16 WGP, 32 CU, 32MB L3, GDDR7 memory, 128-bit memory bus + PCIe 5.0 x8?

And this is from where?
I won't say It can't be true, but then 3 shader arrays per shader engine can't be true. It would need to be 2SA/SE or 4SA/SE.

BTW, bandwidth for N48 could be a problem even with 32gbps GDDR7.
RX 7800XT: 64MB IC and 256-bit 19.5gbps -> 624GB/s BW
N48: 48MB IC and 192-bit 32gbps -> 768GB/s BW
That's only 23% more and If you include the smaller IC then even less effective BW.
Yet the GPU has 6.7% more CU and I personally expect at least 3GHz. So this BW doesn't look enough to me.

SolidQ · Dec 4, 2023

TESKATLIPOKA said:
And this is from where?

Red tech gaming, this one saying ~7900GRE
Mlid saying mostly between 7900XT... 7900XTX
My personal opinion 7900XT+ can be with 256 bit and gddr7
That screen

MrTeal · Dec 4, 2023

You guys aren't hyping right.
It's pretty obvious he's saying N48 is two GCD, each 44WGP/88CU. 512-bit GDDR7 BUS, 32GB with 192MB Infinity Cache. Performance 2x 4090.

tajoh111 · Dec 4, 2023

TESKATLIPOKA said:
And this is from where?
I won't say It can't be true, but then 3 shader arrays per shader engine can't be true. It would need to be 2SA/SE or 4SA/SE.

BTW, bandwidth for N48 could be a problem even with 32gbps GDDR7.
RX 7800XT: 64MB IC and 256-bit 19.5gbps -> 624GB/s BW
N48: 48MB IC and 192-bit 32gbps -> 768GB/s BW

MoogleW said:

Navi 48 = 32 WGP, 64 CU, 48MB L3, GDDR7 memory, 192-bit memory bus, PCIe 5.0 x16?

Navi 44 = 16 WGP, 32 CU, 32MB L3, GDDR7 memory, 128-bit memory bus + PCIe 5.0 x8?

Click to expand...

That's only 23% more and If you include the smaller IC then even less effective BW.
Yet the GPU has 6.7% more CU and I personally expect at least 3GHz. So this BW doesn't look enough to me.

If Navi 48 is indeed less than 250 mm2, along with it being monolithic and made on 4nm, don't expect a big cu bump and clock speed bump. I think similar CU and 10% higher clocks along with a architecture improvements would be something more realistic. Something like this might yield a 20% increase in performance.

4nm is only 6% denser than 5nm and the performance characteristics aren't that much better either.

A 20% performance increase vs a 7800xt would actually be a good result considering the modest die size regression, and modest manufacturing node improvement.

If you want to look at something with a similar increase in specs and a modest die shrink, look at navi 23 vs navi 33.

7nm to 6nm yields a greater transistor density improvement vs 5nm to 4nm and considering the regression in die size will likely be similar, a 20% increase in performance would be great since the increase in performance of navi 33 was like 7% or 8% vs navi 23.

TESKATLIPOKA · Dec 4, 2023

tajoh111 said:
If Navi 48 is indeed less than 250 mm2, along with it being monolithic and made on 4nm, don't expect a big cu bump and clock speed bump. I think similar CU and 10% higher clocks along with a architecture improvements would be something more realistic. Something like this might yield a 20% increase in performance.

I wasn't talking about a big CU bump, but a significant clock bump.
It was already shown that RDNA3 was supposed to clock higher. So I don't think 3GHz is such an overkill for RDNA4, when the cutdown N32 can already be OC-ed past 3GHz. The full N32 has worse OC, I think It's due to power limit.

Let's be honest, as @PJVol showed, RDNA3 has a problem with power draw for their chiplet GPUs, so RDNA4 by being monolith will save some power, which can be used for higher clocks.

You don't really need a big die increase for higher clocks. Just look at RDNA1 vs RDNA2, clock speeds despite the same process were raised significantly and If you exclude Infinity cache, then die size didn't increase by much, which can be attributed to added RT support.

BTW, do you know which RDNA3 GPU is most efficient when you limit It to 60FPS in cyberpunk at Full HD? RX 7600

Aapje · Dec 4, 2023

TESKATLIPOKA said:
You don't really need a big die increase for higher clocks. Just look at RDNA1 vs RDNA2, clock speeds despite the same process were raised significantly and If you exclude Infinity cache, then die size didn't increase by much, which can be attributed to added RT support.

It's the opposite. Smaller dies are easier to clock higher. They often reduce clocks of higher tier video cards. The 4080 runs at higher clocks than the 4090.

Anyway, I'm hoping that AMD at least is able to match the power efficiency improvements in Ada to a decent extent.

TESKATLIPOKA · Dec 4, 2023

Aapje said:
It's the opposite. Smaller dies are easier to clock higher. They often reduce clocks of higher tier video cards. The 4080 runs at higher clocks than the 4090.

It's the opposite? What are you talking about?

Btw, the reason why 4090 has lower clock is due to power limit.

Aapje · Dec 5, 2023

TESKATLIPOKA said:
It's the opposite? What are you talking about?

Btw, the reason why 4090 has lower clock is due to power limit.

The bigger the chip, the more weak spots you will have on the chip, that have to either be disabled or require the chip to run with a lower clock.

TESKATLIPOKA · Dec 5, 2023

Aapje said:
The bigger the chip, the more weak spots you will have on the chip, that have to either be disabled or require the chip to run with a lower clock.

What I was actually asking about in my previous post was about your response.
Because what you quoted from my post and your response are not about the same thing.

PJVol · Dec 5, 2023

Aapje said:
The bigger the chip, the more weak spots you will have on the chip, that have to either be disabled or require the chip to run with a lower clock

The chip frequency (made on a certain tech process) is purely design thing and depends only on the worst case timing path.
And there are so many factors affecting it, that it makes no sense in trying to make any conclusions based on the previous uarch or generation.

DisEnchantment · Dec 5, 2023

[AMDGPU] Update hardware registers for GFX12 by jayfoad · Pull Request #74445 · llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - [AMDGPU] Update hardware registers for GFX12 by jayfoad · Pull Request #74445 · llvm/llvm-project

github.com

[AMDGPU] Add new 64-bit SALU instructions by jayfoad · Pull Request #74449 · llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - [AMDGPU] Add new 64-bit SALU instructions by jayfoad · Pull Request #74449 · llvm/llvm-project

github.com

DVGPR = Dynamic VGPR allocation ?
Seems some folks are right SALU in GFX12 is similar to GFX11.5 with addition of new 64 bit ops to handle new masks etc.

[AMDGPU][MC] Add GFX12 VIMAGE and VSAMPLE encodings by mbrkusanin · Pull Request #74062 · llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - [AMDGPU][MC] Add GFX12 VIMAGE and VSAMPLE encodings by mbrkusanin · Pull Request #74062 · llvm/llvm-project

github.com

New VIMAGE ops, which are MALL aware.
Looks like MALL will be doing a lot of heavy lifting.

interesting cache policies

C:

  // Below are GFX12+ cache policy bits

  // Temporal hint
  TH = 0x7,      // All TH bits
  TH_RT = 0,     // regular
  TH_NT = 1,     // non-temporal
  TH_HT = 2,     // high-temporal
  TH_LU = 3,     // last use
  TH_RT_WB = 3,  // regular (CU, SE), high-temporal with write-back (MALL)
  TH_NT_RT = 4,  // non-temporal (CU, SE), regular (MALL)
  TH_RT_NT = 5,  // regular (CU, SE), non-temporal (MALL)
  TH_NT_HT = 6,  // non-temporal (CU, SE), high-temporal (MALL)
  TH_NT_WB = 7,  // non-temporal (CU, SE), high-temporal with write-back (MALL)
  TH_BYPASS = 3, // only to be used with scope = 3

adroc_thurston · Dec 5, 2023

DisEnchantment said:
DVGPR = Dynamic VGPR allocation ?

Yea for ughhh interesting reasons.

DisEnchantment said:
Seems some folks are right SALU in GFX12 is similar to GFX11.5 with addition of new 64 bit ops to handle new masks etc.

Kinda the opposite, GFX1150 loaned it from 1200/1201.

Saylick · Dec 5, 2023

Hmm, looks like RDNA 4 is going more and more towards compiler-based scheduling, which mirrors Nvidia's pivot away from hw-based scheduling since Kepler. It's ironic because AMD itself pivoted away from a compiler-based scheduler in Terascale towards a hw-scheduler in GCN. I guess what's new is old, and what's old is new.

adroc_thurston · Dec 5, 2023

Saylick said:
Hmm, looks like RDNA 4 is going more and more towards compiler-based scheduling, which mirrors Nvidia's pivot away from hw-based scheduling since Kepler.

No it mirrors Terascale 3.

Saylick · Dec 5, 2023

adroc_thurston said:
No it mirrors Terascale 3.

Back to VLIW? If so, blehhh.

adroc_thurston · Dec 5, 2023

Saylick said:
Back to VLIW?

I mean they already kinda sorta did with RDNA3 which needs to pack workitems for max occupancy.

Abwx · Dec 6, 2023

A 32 CUs GPU is unlikely, that s the same CU count as a RX 7600 and this GPU was undersized in respect of what was sought by the market.

A 40CUs GPU would have been way more relevant since it would have commanded a higher price while still allowing a stepped down 32-36CUs version for harvesting purposes.

Keeping the same CU count for a next gen that is supposed to perform significantly better would be quite a blunder, technically and financiarly.

adroc_thurston · Dec 6, 2023

Abwx said:
Keeping the same CU count for a next gen that is supposed to perform significantly better would be quite a blunder, technically and financiarly.

Those are tiny mainstream parts.

Abwx · Dec 6, 2023

adroc_thurston said:
Those are tiny mainstream parts.

Too tiny to make sense, there s a perf threshold that make your product successfull of just a marginal solution, and i dont even talk of sales where it is overhelmed by the competing cards that are just a little above or even by the RX 6600 that sell much better aparently, so it wasnt worth the effort.

jpiniero · Dec 6, 2023

Abwx said:
Too tiny to make sense, there s a perf threshold that make your product successfull of just a marginal solution

MLID tho. Selling the old stuff while marketing the new stuff is what AMD has to do basically.

soresu · Dec 6, 2023

Abwx said:
Too tiny to make sense

Given it matches the max rumoured CU count of Strix Halo it does sound odd.

Unless perhaps they will also make a semi custom chiplet APU with RDNA4 instead of RDNA3.5 at some point in 2025.

Basically Strix Halo with a GPU µArch upgrade.

Abwx · Dec 6, 2023

soresu said:
Given it matches the max rumoured CU count of Strix Halo it does sound odd.

Unless perhaps they will also make a semi custom chiplet APU with RDNA4 instead of RDNA3.5 at some point in 2025.

Basically Strix Halo with a GPU µArch upgrade.

That s not even enough in respect of a 12 CUs APU that is 178mm2, let alone one that has 32CUs as well, indeed German etailer Mindfactory sales numbers are indicative that the 7600 lack perfs to be competitive, at 40CUs it would have sold much better while at 32 it s buried within a plethorous offering.

In this segment people get either a RX 6600 wich is cheaper and perform not that much worse or RTX 3060/4060/4060TI wich are either price competitive for the weakest or somewhat faster for the others.

https://twitter.com/x/status/1731450376677622246

adroc_thurston · Dec 6, 2023

Abwx said:
Too tiny to make sense

No.
Tiny is good.

TESKATLIPOKA · Dec 6, 2023

Abwx said:
Too tiny to make sense, there s a perf threshold that make your product successfull of just a marginal solution, and i dont even talk of sales where it is overhelmed by the competing cards that are just a little above or even by the RX 6600 that sell much better aparently, so it wasnt worth the effort.

The amount of CU is not a problem per se.
It's only a problem If clocks or IPC stays the same, that was the problem with RX 7600 along with the small benefit from dual-issue.

If this 32CU RDNA GPU allows higher clockspeed by 25%, 2655Mhz * 1.25 = ~3325MHz, would you care that It has only 32CU?
I wouldn't If It is sold for a reasonable price and not just 8GB Vram.

P.S. But considering Strix Halo has 40CU, I would expect at least the same amount for the weaker chip.

edit:
Maybe It's not impossible to pack 40CU in a 160mm2 chip and the bigger one will be 250mm2.
I mean 40CU,2560SP,160TMU,64ROPs,32MB,128-bit GDDR7 for the smaller one.
Clock It to 3.5GHz and you should be close to 7700XT, of course with 128-bit 30-32gbps GDDR7.
BTW, I got ~220mm2 for RX 7600 with 40CU using 6nm.

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Member

Platinum Member

Senior member

Diamond Member

Senior member

Platinum Member

Golden Member

Platinum Member

Golden Member

Platinum Member

Senior member

Golden Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Lifer

Platinum Member

Lifer

Lifer

Platinum Member

Lifer

Platinum Member

Platinum Member