Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

Vikv1918 · Apr 15, 2025

Saylick said:
C&C has an article on RDNA4’s RT improvements:

RDNA 4’s Raytracing Improvements

Raytraced effects have gained increasing adoption in AAA titles, adding an extra graphics quality tier beyond traditional “ultra” settings.

chipsandcheese.com

Elden Ring is a strange game to test for this article. Its one of the worst RT implementations, the performance tanks a lot for only minimal RT effects. It runs just as bad on RDNA4 as it does on 3 and 2, or even on nvidia for that matter. In fact, if we look at Techpowerup benchmarks RDNA4 runs worse than RDNA3 lol.

marees · Apr 15, 2025

Saylick said:
C&C has an article on RDNA4’s RT improvements:

RDNA 4’s Raytracing Improvements

Raytraced effects have gained increasing adoption in AAA titles, adding an extra graphics quality tier beyond traditional “ultra” settings.

chipsandcheese.com

RDNA 4’s Raytracing Improvements

GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy.

https://chipsandcheese.com/p/rdna-4s-raytracing-improvements

In a frame captured from 3DMark’s DXR feature test, which raytraces an entire scene with minimal rasterization, the Radeon RX 9070 sustained 111.76G and 19.61G box and triangle tests per second, respectively. For comparison the RDNA 2 based Radeon RX 6900XT did 38.8G and 10.76G box and triangle tests per second. Ballparking Ray Accelerator utilization is difficult due to variable clock speeds on both cards. But assuming 2.5 GHz gives 24% and 10.23% utilization figures for RDNA 4 and RDNA 2’s Ray Accelerators. RDNA 4 is therefore able to feed its bigger Ray Accelerator better than RDNA 2 could. AMD has done a lot since their first generation raytracing implementation, and the cumulative progress is impressive.
Click to expand...

Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies.

eek2121 · Apr 15, 2025

marees said:
RDNA 4’s Raytracing Improvements

GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy.

https://chipsandcheese.com/p/rdna-4s-raytracing-improvements

Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies.

I’m actually a huge fan of the way AMD is approaching both RT and FSR. Rather than throwing tensor cores/fixed function hardware at the issue, they are simply expanding the capabilities of the architecture itself.

I haven’t paid close attention to Intel’s implementation of RT, but I think NVIDIA is the one that is doing it wrong. It is going to bite them in the rear end at some point. Quite a few devs want a fully programmable RT pipeline, and NVIDIA will be forced to do that in a very suboptimal way, or perhaps, not support it al all with older hardware.

Regarding FSR4, The same hardware that powers it can be used for other things as well. We probably won’t see much until a PS6 release, however, I expect we will see some stuff in the future.

The real issue is, of course, Microsoft. They should be launching new versions of DirectX with new features on a regular basis and then using that as a carrot on a stick to help accelerate GPU development. If they had been leading the way, FSR4, DLSS, etc would not exist, and RT implementation would be significantly improved.

GodisanAtheist · Apr 15, 2025

Explain it to me like I'm 5: what would a fully programmable RT pipeline do? I'm guessing the usual answers more efficient, more performant RT calculations, but would it allow for more effects as well?

Programmable shaders sort of made sense since shaders are used for basically every visual element in the scene, but a programmable RT pipeline... lighting is lighting right?

Seems like a very specific task to make fully programable.

igor_kavinski · Apr 15, 2025

GodisanAtheist said:
Seems like a very specific task to make fully programable.

Maybe because full RT is insanely hard on compute so they want to control light rays per object as you may not want 100 light rays falling on something unimportant to gameplay and even the scene itself.

DisEnchantment · Apr 15, 2025

igor_kavinski said:
Maybe because full RT is insanely hard on compute so they want to control light rays per object as you may not want 100 light rays falling on something unimportant to gameplay and even the scene itself.

It is insanely hard on the memory and cache subsystem rather. Memory and cache subsystem has not been evolving at the same rate as on the DC for Client graphics.
Need a lot of Investment in all levels of the cache hierarchy. The stalls during BVH traversal are all memory bound.

poke01 · Apr 16, 2025

eek2121 said:
I’m actually a huge fan of the way AMD is approaching both RT and FSR. Rather than throwing tensor cores/fixed function hardware at the issue, they are simply expanding the capabilities of the architecture itself.

I haven’t paid close attention to Intel’s implementation of RT, but I think NVIDIA is the one that is doing it wrong. It is going to bite them in the rear end at some point. Quite a few devs want a fully programmable RT pipeline, and NVIDIA will be forced to do that in a very suboptimal way, or perhaps, not support it al all with older hardware.

Regarding FSR4, The same hardware that powers it can be used for other things as well. We probably won’t see much until a PS6 release, however, I expect we will see some stuff in the future.

The real issue is, of course, Microsoft. They should be launching new versions of DirectX with new features on a regular basis and then using that as a carrot on a stick to help accelerate GPU development. If they had been leading the way, FSR4, DLSS, etc would not exist, and RT implementation would be significantly improved.

Fixed function hardware does have its place but it’s often the easy solution.

What AMD is doing is a broader approach. AMD is unique here and that’s perfectly fine.

TESKATLIPOKA · Apr 16, 2025

techjunkie123 said:
3.2-3.4 GHz is right about where I'd expect it to land. Pretty impressive clocks when pushed hard.

And It will be pushed hard. I wouldn't be surprised about 190-200W TBP to be honest.
Perf/W will be pretty bad, especially against RTX 9070 unless you play with underclocking/undervolting or power limit.

But why is there still no info about launch? 5060Ti's reviews will be out today.

marees · Apr 16, 2025

TESKATLIPOKA said:
And It will be pushed hard. I wouldn't be surprised about 190-200W TBP to be honest.
Perf/W will be pretty bad, especially against RTX 9070 unless you play with underclocking/undervolting or power limit.

But why is there still no info about launch? 5060Ti's reviews will be out today.

AMD is probably waiting to get a 5060 & 5060 ti on hand so that they know how to price n44

My guess:

N48 12gb = $400
N44 16gb = $350
N44 8gb = $300
N44 cut down? = $250??

Vikv1918 · Apr 16, 2025

GodisanAtheist said:
Explain it to me like I'm 5: what would a fully programmable RT pipeline do? I'm guessing the usual answers more efficient, more performant RT calculations, but would it allow for more effects as well?

Programmable shaders sort of made sense since shaders are used for basically every visual element in the scene, but a programmable RT pipeline... lighting is lighting right?

Seems like a very specific task to make fully programable.

Maybe to make it future proof? Next-gen consoles released in 2027 will need to run ray traced games that are released as far in the future as 2034. A closed box solution will prevent devs from optimizing for the console architecture.

TESKATLIPOKA · Apr 16, 2025

marees said:
AMD is provably waiting to get a 5060 & 5060 ti on hand so that they know how to price n44

My guess:

N48 12gb = $400
N44 16gb = $350
N44 8gb = $300
N44 cut down? = $250??

That's today. And It was known for weeks Nvidia will release It today, yet AMD si still silent about the date of presenting N44 to the public. They can adjust the price at the last minute, so this wasn't really an issue to begin with.

I don't expect a cutdown N48 with 12GB anytime soon.
N44 with 16GB for $349 would be a very good price, 7600XT 16GB was sold for $329 MRSP.
But I think It will cost $379.

tsamolotoff · Apr 16, 2025

eek2121 said:
It is going to bite them in the rear end at some point

How so, it's actually a good thing for Mr Leather Jacket - old GPUs will become obsolete, so their owners would have to upgrade to a shiny new RTX 9060 with 8gb of blazing fast gddr9xtxt vram

marees · Apr 16, 2025

TESKATLIPOKA said:
I don't expect a cutdown N48 with 12GB anytime soon.

It (9070 gre) will have a double release. Just the 7900 gre

When it (eventually) gets a worldwide release the prices should have settled down. So its final msrp should be $400

dacostafilipe · Apr 16, 2025

GodisanAtheist said:
Explain it to me like I'm 5: what would a fully programmable RT pipeline do? I'm guessing the usual answers more efficient, more performant RT calculations, but would it allow for more effects as well?

Programmable shaders sort of made sense since shaders are used for basically every visual element in the scene, but a programmable RT pipeline... lighting is lighting right?

Seems like a very specific task to make fully programable.

Optimisation. If you read the "RDNA 4’s Raytracing Improvements" article above you will see that AMD tries to find "tricks" to improve RT performance, but those have different impacts depending on the game. If the game-dev can themself select how the RT is handled, it could improve performance. On RDN4 they can already choose the bounding boxes angle, but what if it could go even further?

TESKATLIPOKA · Apr 16, 2025

marees said:
It (9070 gre) will have a double release. Just the 7900 gre

When it (eventually) gets a worldwide release the prices should have settled down. So its final msrp should be $400

But this is still months away even in China let alone global release.

Not like It would be particularly interesting one with only 12GB, although performance would be pretty nice, a bit better than 7800XT in raster or 20-25% over 9060XT.

@marees then It's much sooner than previously

zorilted · Apr 16, 2025

Kepler_L2 said:
Octobre 2026 à ma connaissance.

la date de sortie of rdna 5

marees · Apr 16, 2025

TESKATLIPOKA said:
But this is still months away even in China let alone global release.

Not like It would be particularly interesting one with only 12GB, although performance would be pretty nice, a bit better than 7800XT in raster or 20-25% over 9060XT.

Next month in China as per Chinese forums (but msrp would be high likely close to $500)

There should be a $100 cut when it gets released worldwide eventually

Bryo4321 · Apr 16, 2025

Yall seeing what I’m seeing? Is this a mistake?

jpiniero · Apr 16, 2025

marees said:
AMD is probably waiting to get a 5060 & 5060 ti on hand so that they know how to price n44

My guess:

N48 12gb = $400
N44 16gb = $350
N44 8gb = $300
N44 cut down? = $250??

Keep in mind it seems that the Real MSRP for the 5060 Ti is $479 for 16 and $419 for 8. So unless N44 is a lot slower than the 5060 Ti, they could go higher.

jpiniero · Apr 16, 2025

Bryo4321 said:
Yall seeing what I’m seeing? Is this a mistake?

VRAM size matters a lot with AI.

Mopetar · Apr 16, 2025

jpiniero said:
VRAM size matters a lot with AI.

I think he's referring to the 32 GB on the 9070 XT. Unless I missed something AMD hasn't announced anything beyond a 16 GB model.

Bryo4321 · Apr 16, 2025

Mopetar said:
I think he's referring to the 32 GB on the 9070 XT. Unless I missed something AMD hasn't announced anything beyond a 16 GB model.

Yeah, it looks like they are referring to system RAM based off the AMD blog. The chart stability put up just wasn’t very clear.

PJVol · Apr 16, 2025

Just glanced at the hwub 5060ti review and was "surprised" by their final graphs.
Is the 9070XT really only 2% faster than the 7900xt @1440p ?

SolidQ · Apr 16, 2025

PJVol said:
Is the 9070XT really only 2% faster than the 7900xt @1440p ?

not 1440, but some games with FSR quality

Don't know why Hub results lower

GTracing · Apr 16, 2025

PJVol said:
Just glanced at the hwub 5060ti review and was "surprised" by their final graphs.
Is the 9070XT really only 2% faster than the 7900xt @1440p ?

Hardware Unboxed is the only reviewer who has them that close. Most reviewers have the 9070XT more like 8-10% faster.

Some users here have postulated that Hardware Unboxed got worse 9070 XT results because they intentionally find demanding scenes to test. In-game benchmarks and opening levels tend to be less demanding.

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Junior Member

Golden Member

RDNA 4’s Raytracing Improvements​

Diamond Member

RDNA 4’s Raytracing Improvements​

Diamond Member

Lifer

Golden Member

Diamond Member

Platinum Member

Golden Member

Junior Member

Platinum Member

Senior member

Golden Member

Senior member

Platinum Member

Junior Member

Golden Member

Member

Lifer

Lifer

Diamond Member

Member

Senior member

Golden Member

Senior member

RDNA 4’s Raytracing Improvements

RDNA 4’s Raytracing Improvements