Discussion RDNA4 + CDNA3 Architectures Thread

Page 67 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136





With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.



Previous thread on CDNA2 and RDNA3 here

 
Last edited:
Jul 27, 2020
16,340
10,352
106
Here: https://www.tweaktown.com/news/9499...on-tech-specs-not-finalized-report/index.html

"The big feature that this system will support is Sony's own proprietary DLSS-like solution, where they use their own machine learning to improve images so they can run things at a really high resolution and really high frame rate. They would include their own hardware in the PS5 Pro to do this," Grubb said in a recent episode of Game Mess Decides.

"That's where the 2x hardware ray tracing acceleration comes into place, but they would be able to do even more than just better hardware raytracing."

So their custom RT algo combined with their upscaling tech would provide the boost higher than what's possible with RDNA3.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,362
2,854
106
It's not DirectX 12 RT with PS5. They must've implemented something proprietary and optimized to boost RT only for their platform.
Some form of (fake) frame generation is not really interesting and says nothing about RDNA4 capability.
Here: https://www.tweaktown.com/news/9499...on-tech-specs-not-finalized-report/index.html



So their custom RT algo combined with their upscaling tech would provide the boost higher than what's possible with RDNA3.
I don't think FPS would be much better than what AMD is offering with FSR3, but quality could(should) be better.
 
Jul 27, 2020
16,340
10,352
106

MoogleW

Member
May 1, 2022
50
27
51
Let me get this right. We are moving from 36 to 60CU. The highest RT boost is 4X. So if PS5 pro is using RDNA 4 RT then I guess that means we can expect 'up to':

4X/(60/36) = 2.4X RT performance per CU in RDNA 4.
On average 1.8X in practice.

More optimistically for AMD, lte say due to clock speeds, the actual change resembles more of a move to a 7700XT GPU spec so 2.7X RT in special cases.

2X on average in demanding scenarios
1.5X in lighter scenarios
 
Last edited:
Reactions: Tlh97

SteinFG

Senior member
Dec 29, 2021
423
475
106
Let me get this right. We are moving from 36 to 60CU. The highest RT boost is 4X.
That's against regular PS5, wgich doesn't have Spectral SR I assume. The "RT boost" they're talking about is most likely a combination of RT per CU uplift, CU count uplift, and new upscaling software
 
Jul 27, 2020
16,340
10,352
106
New upscaling isn't faster.
There isn't much information available on BVH8 but from what little I've been able to find, it allows lowering memory requirements and scales better but it's more cpu intensive? That might explain why Sony is able to implement it only in PS5 Pro with its faster CPU. How much of the RT perf increase is coming from switching to BVH8?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,355
1,550
136
There isn't much information available on BVH8 but from what little I've been able to find, it allows lowering memory requirements and scales better but it's more cpu intensive? That might explain why Sony is able to implement it only in PS5 Pro with its faster CPU. How much of the RT perf increase is coming from switching to BVH8?

Why would it be more CPU intensive? I guess there could be slightly more work building the tree when things move, but that's offset by the average walk becoming shorter? I would assume the acceleration structure depth mostly depends on what is cheap to implement on the GPU. You'd want each level of the structure to fit neatly into a cache line on the GPU, and if you don't have an special instruction to quickly pack bits you are probably limited to fairly inefficient representation. I don't know what information exactly needs to be fit in the structure, but at least there needs to be pointers for each child and a f32 for every plane that divides the bounding box. IIRC AMD GPUs use 64 byte lines, so 8 8-byte pointers are right out because they would take up the whole line. AMD's choice of BVH4 and Intel's choice of BVH6 probably reflects this, BVH6 sounds silly but you can pack 6 pointers, 3 f32's and 32 bits of additional data (some of it needed to select direction that the planes cut) neatly in 64 bytes.

If you do have some kind of special instruction that lets you very quickly pack and unpack bits for this purpose, you can probably do better than that. Since all the nodes are cache line aligned, you can cut 6 bits of each pointer from the low end, and if you are willing to limit address space to 57 bits (128PB, plenty enough for the foreseeable future), you can cut 7 bits from the high end, giving you enough space for 8 pointers, 3 f32:s and one byte left over for metadata.

I suspect the reason BVH8 is only available in later GPUs is that they provided hardware support for it, not with full walkers but with instructions that quickly extract the nth pointer from such a packed structure. Node traversal needs to be really fast and iirc shifts generally aren't on GPU, so you wouldn't want to try to implement that without help.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,355
1,550
136

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
Thanks. Then I got nothing, I don't understand why you'd use such a deep and narrow structure. ray/box checks are cheap compared to the cost of cache misses.
This patent below could help you with the box node compression


This patent below contains handling both BVH4 and BVH8.


I think for prebuilt BVH trees performance gain won't be much. For runtime generated BVH trees it could be a lot of boost.
Cache and memory subsystem will be key.
For BVH4 it fetches two cache lines, for BVH8, it fetches single cache line.

7. The method of claim 1, further comprising fetching the small bounding box nodes from a cache by fetching two cache lines and discarding unused portions of the cache lines.
8. The method of claim 1, further comprising fetching the large bounding box nodes from a cache by fetching a cache line as the large bounding box node.

https://twitter.com/NIV_Anteru , who also presented the amazing workgraphs demo recently is inventor of these patents below for generating new kind of BVH trees

ACCELERATION STRUCTURES WITH DELTA INSTANCES
Reduction of data to be stored for a BVH node by representing the data of a box or triangle as a delta from a base node using smaller data types. This reduces the amount of data that has to be fetched from memory. The generation of the modified BVH can be done outside of the GPU

OVERLAY TREES FOR RAY TRACING

Will only help runtime generated trees it looks like to me


BOUNDING VOLUME HIERARCHY HAVING ORIENTED BOUNDING BOXES WITH QUANTIZED ROTATIONS
From <https://www.freepatentsonline.com/y2023/0099806.html>
TECHNIQUES FOR INTRODUCING ORIENTED BOUNDING BOXES INTO BOUNDING VOLUME HIERARCHY
From <https://www.freepatentsonline.com/y2023/0027725.html>
COMMON CIRCUITRY FOR TRIANGLE INTERSECTION AND INSTANCE TRANSFORMATION FOR RAY TRACING
From <https://www.freepatentsonline.com/y2023/0206541.html>
VOLUME INTERSECTION USING ROTATED BOUNDING VOLUMES
https://www.freepatentsonline.com/y2023/0410426.html

Updated way to perform intersection tests too with rotated boxes
Rotated Bounding Box
  • Rotate Bounding Box during BVH building and then rotate Ray to match the bounding box
  • Rotated bounding boxes improve performance because poorly rotated bounding box have more chances of hitting the box but not hitting the triangle inside
Poorly-fit bounding boxes can negatively impact performance because hits within poorly-fit bounding boxes that do not hit any underlying triangles are more common than hits within well-fit bounding boxes that do not hit any underlying triangles. The chance of a hit is directly related to the ratio of box volume to triangle surface area. Note that the two-dimensional diagram provided does not illustrate how much empty space there can be in a three-dimensional bounding box—there can be a much greater amount of such empty space in a three-dimensional bounding box. In addition, with a relatively large number of poorly fit bounding boxes, there is a greater chance that bounding boxes overlap (since bounding boxes must bound the interior geometry), which represents a degree of inefficiency. Hits within bounding boxes that do not hit any underlying triangles result in inefficiencies—it would be advantageous to stop traversal down a branch of a bounding volume hierarchy as early as possible if there are no triangles in that branch that are hit by the ray.

From <https://www.freepatentsonline.com/y2023/0099806.html>
  • The ray is rotated before intersection test on a bounding box and the triangles inside within the intersection engine. The generation of the modified BVH can be done outside the GPU

I doubt PS5 will get all these with the weak CPU (if the BVH generation is not done on GPU)
 
Last edited:

gdansk

Platinum Member
Feb 8, 2011
2,123
2,629
136
I assume the comparison was for 7800 XT where in many games (e.g. Cyberpunk) it's averaging 2350MHz which puts RDNA4 on the 3500 MHz hype train, not 4GHz.
 

jpiniero

Lifer
Oct 1, 2010
14,629
5,247
136
I assume the comparison was for 7800 XT where in many games (e.g. Cyberpunk) it's averaging 2350MHz which puts RDNA4 on the 3500 MHz hype train, not 4GHz.

I was looking at the 3dmark data (of the 7900 XT), which while there is a huge spread... there were plenty that were hitting 2.8 if not a bit more. Plus the APUs can hit that high.

The issue might be more power consumption/AMD overvolting to maximize yield.
 

MoogleW

Member
May 1, 2022
50
27
51
I don't think the clock issue is as big as you think, unless you are thinking by 'fixed clocks' you mean 4+ Ghz.
If 'fixed' means AMD clocks so high, on a node that promises FAR less clock improvements, AMD design team would be lightyears ahead of anyone else and I'm not convinced.

If there even is such a 'bug', then I would expect at MOST 3.2ghz clock, 30% higher than boost clock of 6700XT and over 40% boost clock of 6800XT in games. At least according to this.

 
Last edited:

gdansk

Platinum Member
Feb 8, 2011
2,123
2,629
136
Even If RDNA4 or RDNA3.5 can clock at 3.2->3.5GHz, you still need to feed It.
Just look at the difference between 7700XT vs 7800XT. 6-7% in raw power, but thanks to the crippled memory subsystem you loose 15->19->22% depending on resolution.
Conversely, 7800XT has 44% more bandwidth but is only 15-22% faster.
Upgrade from 19.5gbps to 24gbps could be good enough.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,362
2,854
106
Conversely, 7800XT has 44% more bandwidth but is only 15-22% faster.
Upgrade from 19.5gbps to 24gbps could be good enough.
I disagree with you.
7800XT has higher BW than needed, increasing It from 19.5gbps to 24gbps wouldn't help unless you increase GPU clocks significantly.

P.S. this BW should be enough for ~3GHz and the extra 16MB for another couple of 100mhz.
 
Last edited:

gdansk

Platinum Member
Feb 8, 2011
2,123
2,629
136
I'm not following. If the successor to 7800 XT is 3500MHz in games, then a higher speed GDDR6 will likely be sufficient. You think it won't be?
 
Reactions: Tlh97
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |