Discussion RDNA4 + CDNA3 Architectures Thread

Page 428 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,774
6,757
136





With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.



Previous thread on CDNA2 and RDNA3 here

 
Last edited:

Vikv1918

Junior Member
Mar 12, 2025
12
26
46
C&C has an article on RDNA4’s RT improvements:
Elden Ring is a strange game to test for this article. Its one of the worst RT implementations, the performance tanks a lot for only minimal RT effects. It runs just as bad on RDNA4 as it does on 3 and 2, or even on nvidia for that matter. In fact, if we look at Techpowerup benchmarks RDNA4 runs worse than RDNA3 lol.
 

marees

Golden Member
Apr 28, 2024
1,001
1,340
96
C&C has an article on RDNA4’s RT improvements:

RDNA 4’s Raytracing Improvements​


GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy.

https://chipsandcheese.com/p/rdna-4s-raytracing-improvements

In a frame captured from 3DMark’s DXR feature test, which raytraces an entire scene with minimal rasterization, the Radeon RX 9070 sustained 111.76G and 19.61G box and triangle tests per second, respectively. For comparison the RDNA 2 based Radeon RX 6900XT did 38.8G and 10.76G box and triangle tests per second. Ballparking Ray Accelerator utilization is difficult due to variable clock speeds on both cards. But assuming 2.5 GHz gives 24% and 10.23% utilization figures for RDNA 4 and RDNA 2’s Ray Accelerators. RDNA 4 is therefore able to feed its bigger Ray Accelerator better than RDNA 2 could. AMD has done a lot since their first generation raytracing implementation, and the cumulative progress is impressive.
Click to expand...

Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies.
 

eek2121

Diamond Member
Aug 2, 2005
3,318
4,880
136

RDNA 4’s Raytracing Improvements​


GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy.

https://chipsandcheese.com/p/rdna-4s-raytracing-improvements



Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies.
I’m actually a huge fan of the way AMD is approaching both RT and FSR. Rather than throwing tensor cores/fixed function hardware at the issue, they are simply expanding the capabilities of the architecture itself.

I haven’t paid close attention to Intel’s implementation of RT, but I think NVIDIA is the one that is doing it wrong. It is going to bite them in the rear end at some point. Quite a few devs want a fully programmable RT pipeline, and NVIDIA will be forced to do that in a very suboptimal way, or perhaps, not support it al all with older hardware.

Regarding FSR4, The same hardware that powers it can be used for other things as well. We probably won’t see much until a PS6 release, however, I expect we will see some stuff in the future.

The real issue is, of course, Microsoft. They should be launching new versions of DirectX with new features on a regular basis and then using that as a carrot on a stick to help accelerate GPU development. If they had been leading the way, FSR4, DLSS, etc would not exist, and RT implementation would be significantly improved.
 

GodisanAtheist

Diamond Member
Nov 16, 2006
7,902
9,004
136
Explain it to me like I'm 5: what would a fully programmable RT pipeline do? I'm guessing the usual answers more efficient, more performant RT calculations, but would it allow for more effects as well?

Programmable shaders sort of made sense since shaders are used for basically every visual element in the scene, but a programmable RT pipeline... lighting is lighting right?

Seems like a very specific task to make fully programable.
 
Reactions: Tlh97 and marees

DisEnchantment

Golden Member
Mar 3, 2017
1,774
6,757
136
Maybe because full RT is insanely hard on compute so they want to control light rays per object as you may not want 100 light rays falling on something unimportant to gameplay and even the scene itself.

It is insanely hard on the memory and cache subsystem rather. Memory and cache subsystem has not been evolving at the same rate as on the DC for Client graphics.
Need a lot of Investment in all levels of the cache hierarchy. The stalls during BVH traversal are all memory bound.
 

poke01

Diamond Member
Mar 8, 2022
3,427
4,702
106
I’m actually a huge fan of the way AMD is approaching both RT and FSR. Rather than throwing tensor cores/fixed function hardware at the issue, they are simply expanding the capabilities of the architecture itself.

I haven’t paid close attention to Intel’s implementation of RT, but I think NVIDIA is the one that is doing it wrong. It is going to bite them in the rear end at some point. Quite a few devs want a fully programmable RT pipeline, and NVIDIA will be forced to do that in a very suboptimal way, or perhaps, not support it al all with older hardware.

Regarding FSR4, The same hardware that powers it can be used for other things as well. We probably won’t see much until a PS6 release, however, I expect we will see some stuff in the future.

The real issue is, of course, Microsoft. They should be launching new versions of DirectX with new features on a regular basis and then using that as a carrot on a stick to help accelerate GPU development. If they had been leading the way, FSR4, DLSS, etc would not exist, and RT implementation would be significantly improved.
Fixed function hardware does have its place but it’s often the easy solution.

What AMD is doing is a broader approach. AMD is unique here and that’s perfectly fine.
 
Reactions: Tlh97 and marees

TESKATLIPOKA

Platinum Member
May 1, 2020
2,696
3,259
136
3.2-3.4 GHz is right about where I'd expect it to land. Pretty impressive clocks when pushed hard.
And It will be pushed hard. I wouldn't be surprised about 190-200W TBP to be honest.
Perf/W will be pretty bad, especially against RTX 9070 unless you play with underclocking/undervolting or power limit.

But why is there still no info about launch? 5060Ti's reviews will be out today.
 

marees

Golden Member
Apr 28, 2024
1,001
1,340
96
And It will be pushed hard. I wouldn't be surprised about 190-200W TBP to be honest.
Perf/W will be pretty bad, especially against RTX 9070 unless you play with underclocking/undervolting or power limit.

But why is there still no info about launch? 5060Ti's reviews will be out today.
AMD is probably waiting to get a 5060 & 5060 ti on hand so that they know how to price n44

My guess:

N48 12gb = $400
N44 16gb = $350
N44 8gb = $300
N44 cut down? = $250??
 
Last edited:
Reactions: GodisanAtheist

Vikv1918

Junior Member
Mar 12, 2025
12
26
46
Explain it to me like I'm 5: what would a fully programmable RT pipeline do? I'm guessing the usual answers more efficient, more performant RT calculations, but would it allow for more effects as well?

Programmable shaders sort of made sense since shaders are used for basically every visual element in the scene, but a programmable RT pipeline... lighting is lighting right?

Seems like a very specific task to make fully programable.
Maybe to make it future proof? Next-gen consoles released in 2027 will need to run ray traced games that are released as far in the future as 2034. A closed box solution will prevent devs from optimizing for the console architecture.
 
Reactions: marees

TESKATLIPOKA

Platinum Member
May 1, 2020
2,696
3,259
136
AMD is provably waiting to get a 5060 & 5060 ti on hand so that they know how to price n44

My guess:

N48 12gb = $400
N44 16gb = $350
N44 8gb = $300
N44 cut down? = $250??
That's today. And It was known for weeks Nvidia will release It today, yet AMD si still silent about the date of presenting N44 to the public. They can adjust the price at the last minute, so this wasn't really an issue to begin with.

I don't expect a cutdown N48 with 12GB anytime soon.
N44 with 16GB for $349 would be a very good price, 7600XT 16GB was sold for $329 MRSP.
But I think It will cost $379.
 
Last edited:

dacostafilipe

Senior member
Oct 10, 2013
792
274
136
Explain it to me like I'm 5: what would a fully programmable RT pipeline do? I'm guessing the usual answers more efficient, more performant RT calculations, but would it allow for more effects as well?

Programmable shaders sort of made sense since shaders are used for basically every visual element in the scene, but a programmable RT pipeline... lighting is lighting right?

Seems like a very specific task to make fully programable.

Optimisation. If you read the "RDNA 4’s Raytracing Improvements" article above you will see that AMD tries to find "tricks" to improve RT performance, but those have different impacts depending on the game. If the game-dev can themself select how the RT is handled, it could improve performance. On RDN4 they can already choose the bounding boxes angle, but what if it could go even further?
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,696
3,259
136
It (9070 gre) will have a double release. Just the 7900 gre

When it (eventually) gets a worldwide release the prices should have settled down. So its final msrp should be $400
But this is still months away even in China let alone global release.

Not like It would be particularly interesting one with only 12GB, although performance would be pretty nice, a bit better than 7800XT in raster or 20-25% over 9060XT.

@marees then It's much sooner than previously
 
Last edited:

marees

Golden Member
Apr 28, 2024
1,001
1,340
96
But this is still months away even in China let alone global release.

Not like It would be particularly interesting one with only 12GB, although performance would be pretty nice, a bit better than 7800XT in raster or 20-25% over 9060XT.
Next month in China as per Chinese forums (but msrp would be high likely close to $500)

There should be a $100 cut when it gets released worldwide eventually
 

jpiniero

Lifer
Oct 1, 2010
16,121
6,578
136
AMD is probably waiting to get a 5060 & 5060 ti on hand so that they know how to price n44

My guess:

N48 12gb = $400
N44 16gb = $350
N44 8gb = $300
N44 cut down? = $250??

Keep in mind it seems that the Real MSRP for the 5060 Ti is $479 for 16 and $419 for 8. So unless N44 is a lot slower than the 5060 Ti, they could go higher.
 
Reactions: Tlh97 and marees

PJVol

Senior member
May 25, 2020
849
826
136
Just glanced at the hwub 5060ti review and was "surprised" by their final graphs.
Is the 9070XT really only 2% faster than the 7900xt @1440p ?
 

GTracing

Senior member
Aug 6, 2021
478
1,109
106
Just glanced at the hwub 5060ti review and was "surprised" by their final graphs.
Is the 9070XT really only 2% faster than the 7900xt @1440p ?
Hardware Unboxed is the only reviewer who has them that close. Most reviewers have the 9070XT more like 8-10% faster.

Some users here have postulated that Hardware Unboxed got worse 9070 XT results because they intentionally find demanding scenes to test. In-game benchmarks and opening levels tend to be less demanding.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |