Discussion RDNA4 + CDNA3 Architectures Thread

Page 87 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,615
5,869
136





With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.



Previous thread on CDNA2 and RDNA3 here

 
Last edited:

Mahboi

Senior member
Apr 4, 2024
341
574
91
My guess is AMD was counting on the N31 and N32 GCDs clocking >20% above what they ended up doing. That's why there were initial plans to put additional Infinity Cache via VCache over the MCDs, to get higher effective bandwidth through additional LLC to properly feed GCDs that were supposed to be much more demanding due to higher clocks.
All of them. N33 too.

branch_suggestion said it, the advanced plans were canned when they realised it was a dog.
My only gripe is that Navi 40 now got cut at the head, shoulders, and chest. All that's left is below the crotch. N48 is a 6600 xt's size. N44 is a 130mm² tiny little thing meant to go into laptops.

Meanwhile we're left waiting another 2 years or so for another promise of RDNA hype. It's just a little tiring to see so little effort into making costly/risky products. NVidia has no problem dishing out massive monolithic 600+mm² dies every single gen. They aren't nearly as obsessed with penny pinching, and it shows. Nvidia goes big then tries to win big. AMD goes small then tries to win big anyway. RDNA 4 is just another one in the long list of "meh" moments.

I do have great hopes for its RT to be closer to Lovelace, and for the PS5 Pro to show some serious chops both in upscaling and in raytracing. It's time for AMD & partners to at least ring the bell of the latter half of the 2020s, rather than wallow in token RT and no-ML upscaling.
 

Tigerick

Senior member
Apr 1, 2022
676
555
106
Hmm, from GDDR7 to 20Gbps GDDR6 and now 18Gbps GDDR6, I can't believe people said memory bandwidth is not important. Let's see:

N44: 32CU: Same BW as 7600XT: 32CU. Make sense

N48: 64CU: 576GB/s compared to 7800XT: 60CU: 624GB/s. 7700XT: 54CU: 432GB/s. Hmm, we might see another changes in N48 specs...
 

Saylick

Diamond Member
Sep 10, 2012
3,194
6,492
136
I do have great hopes for its RT to be closer to Lovelace, and for the PS5 Pro to show some serious chops both in upscaling and in raytracing. It's time for AMD & partners to at least ring the bell of the latter half of the 2020s, rather than wallow in token RT and no-ML upscaling.
FWIW, @Kepler_L2 compiled a quick comparison of the box and ray intersection capabilities between AMD and Nvidia's GPU architectures (see below). If I'm not mistaken, this is a RT unit vs RT unit comparison and RDNA 2/3 has two RT units per WGP while Nvidia has one RT unit per SM (WGP has 128 FP units while SM has 256 FP units). What AMD is missing is that hardware BVH walker or traversal unit, which will be finally added in RDNA 4.


RDNA 2 WGP:


Ada Lovelace SM:


EDIT: Dang, seems like I need to do more homework into how RT works but I found an informative Reddit post which sheds a little more light.
 
Last edited:
Aug 4, 2023
177
374
96
NVidia has no problem dishing out massive monolithic 600+mm² dies every single gen.
They have their 2 year cadence and they execute it, along with their 2x Blender et al perf/gen goal.
Only time they took over 2 years was Pascal>Turing, due to Volta but they had plenty of time due to their lead starting with Maxwell.
When you look at uArch changes per gen, AMD is being more ambitious recently, but it is risky execution, NV keeps some things very much the same and iterates a bunch of blocks.
Ampere over Turing did the dual issue thing as the big change, Ada vs Ampere did the L2 embiggening.
NV doesn't want to overhaul things too much at once due to keeping CUDA compatibility as consistent as possible.
 

Saylick

Diamond Member
Sep 10, 2012
3,194
6,492
136
It is funny how Box-oriented AMD is actually.
Wonder what got them to do it this way...
The more I read, the more I am now believing that ray-box intersection tests is what speeds up BVH traversal. As I understand it, the BVH structure is comprised of mostly boxes within boxes, where triangles are in the last boxes at the bottom of the tree, although it largely depends on how the BVH structure is built.

You can have a relatively wide but shallow BVH structure, or a narrower but deeper BVH structure. Wider structures have more boxes in each level, thus requiring more ray-box intersection tests to get to the next level, but a deeper BVH requires more jumps because there's simply more levels and more jumps means more demand on the cache/memory subsystem (every time you drill down a level, you need to jump to a new memory address as far as I understand). It's apparent that there's a balance between the two approaches.

The goal is to identify as quick as possible which triangle gets hit by the ray. Having more ray-box intersection lets you get through the levels of boxes quicker, and having more ray-triangle intersection helps you identify the final triangle in the last box quicker.

Seems like AMD prefers the approach of a deeper BVH structure, which puts more pressure on the cache and memory subsystem, while Nvidia prefers a wider but shallower BVH structure. The former is a very CPU-like approach to things (RDNA 2 has a souped up cache system, thanks to AMD's CPU team involvement in the design of RDNA 2) while the latter is a very GPU-like approach, given that GPUs have a ton of compute and memory latency is high.

Some helpful additional reading:
 
Last edited:

Mahboi

Senior member
Apr 4, 2024
341
574
91
The more I read, the more I am now believing that ray-box intersection tests is what speeds up BVH traversal. As I understand it, the BVH structure is comprised of mostly boxes within boxes, where triangles are in the last boxes at the bottom of the tree, although it largely depends on how the BVH structure is built.
You can have a relatively wide but shallow BVH structure, or a narrower but deeper BVH structure. Wider structures have more boxes in each level, thus requiring more ray-box intersection tests to get to the next level, but a deeper BVH requires more jumps because there's simply more levels and more jumps means more demand on the cache/memory subsystem (every time you drill down a level, you need to jump to a new memory address as far as I understand). It's apparent that there's a balance between the two approaches.
Can't speak about RT, but I know BVHs from back when I was studying game physics.

It's essentially a 3D space divided into 8 (yes this is divided into more than that, I couldn't find a clean picture).
You divide the top level bounding box, which contains your world (for physics not RT), by its center point. Divide it in 3D, and you get a X, Y, and Z axis dividers. These create 2 boxes top forward, 2 boxes top back, 2 boxes bottom forward, 2 boxes bottom back. If you can visualise that.

These boxes have effectively divided your world by 8, which is necessary for physics since you can't have every single dynamic thing in the world collide with every single other thing. You divide them into usable areas.

A BVH is a structure that holds these "divisions" of the world, and tests them(I.E stores what elements are within the bounding box) when a ray passes through. I'm not too sure if it's based off the camera's viewport or the world, I imagine the world but I don't know.

I'm guessing that "8 Boxes" means that the world gets divided 8 times, and the Hierarchy thus can divide by 2^8, so 256 times more precision. Which means that you greatly lower the number of needed calculations because you can get a much more precise idea of where the ray will collide. However, I'm not too sure on 2 triangles. It returns 2 triangles hit by the ray, I guess? So it lights them up together?

The goal is to identify as quick as possible which triangle gets hit by the ray. Having more ray-box intersection lets you get through the levels of boxes quicker, and having more ray-triangle intersection let's you identify the final triangle in the last box.
I'd say the opposite, more ray-box intersections means more boxes, means longer times to get through. Which would possibly explain why they waited to have a BVH walker to go up to 8 Box, but then again NV doesn't seem to give a damn going beyond 4.
I'm gonna need a hand on guessing what ray-triangle is, if it's not one triangle and its nearest one, or going through one triangle. But I really don't see why NV would go through three triangles, so probably it's just a triangle and its neighbour.

Edit: it would be quite consistent with the general company philosophies that AMD's BVH is more precise but also heavier on the HW, while NV goes for more triangles and loses precision. No proof of it or anything, but if you look at how much AMD cares about FP64 and how much NV cares about FP8, I'd say I'm probably on the right guess.

Edit 2: now that I think about it, a 8-deep BVH would have an insane amount of boxes to store. Forget walking through it, it would already take a crazy amount of VRAM space.
If every box division yields 8 times more: 8^8 = 16,777,216 boxes. I suppose they can have a smart structure that just erases/ignores the empty boxes, but still. 8^4 == 4096 volumes for comparison.
 
Last edited:

Aapje

Golden Member
Mar 21, 2022
1,395
1,883
106
Meanwhile we're left waiting another 2 years or so for another promise of RDNA hype.

Rumor is that RDNA4 is a stopgap exactly so they can accelerate RDNA5 for a release late next year. So please continue the hype.

It's just a little tiring to see so little effort into making costly/risky products. NVidia has no problem dishing out massive monolithic 600+mm² dies every single gen.

A rather weird take when Nvidia just keeps shoving more compute into chips with the same architecture, while AMD took a gamble on chiplets for GPU's.

They aren't nearly as obsessed with penny pinching, and it shows. Nvidia goes big then tries to win big. AMD goes small then tries to win big anyway.

You completely ignore that AMD can't win big anyway with just hardware, as their software stack is lagging behind Nvidia, primarily due to CUDA.

AMD can't beat that unless they manage to gradually make inroads with ROCm, they wait for the big players to abandon CUDA and pick/create an open standard (which would be the sensible move for Google/MS/Meta/Amazon/etc), or they have such strong hardware that the market will make big short-term investments to support ROCm, because the hardware is so much better than Nvidia's.

The last option is the least likely, not in the least because Jensen is not a slacker like the previous Intel CEOs, and Lisa would be very dumb to gamble on it and risk making huge losses on expensive dies that don't sell for enough money.

RDNA 4 is just another one in the long list of "meh" moments.

Perhaps not for gamers who just want to play their games in high quality for a relatively modest price. The 7800 XT seems to sell very well for a card that is less meh that most cards of this gen, so if the 8800 XT is a much better meh, that would be pretty nice for gamers with that kind of budget.
 
Reactions: Tlh97 and moinmoin

Saylick

Diamond Member
Sep 10, 2012
3,194
6,492
136
I'm guessing that "8 Boxes" means that the world gets divided 8 times, and the Hierarchy thus can divide by 2^8, so 256 times more precision. Which means that you greatly lower the number of needed calculations because you can get a much more precise idea of where the ray will collide. However, I'm not too sure on 2 triangles. It returns 2 triangles hit by the ray, I guess? So it lights them up together?
As far as I understand it, that is correct. A binary BVH structure has 2 branches per box (aka BVH2), BVH4 has 4 branches per box, and BVH8 has 8 branches per box. I don't necessarily agree that a higher branching factor is "more accurate" because whether you split things into 2, 4, or 8 boxes you still have to traverse down to the final "leaf" node to find the triangle the ray intersects (or doesn't intersect).
I'd say the opposite, more ray-box intersections means more boxes, means longer times to get through. Which would possibly explain why they waited to have a BVH walker to go up to 8 Box, but then again NV doesn't seem to give a damn going beyond 4.
I'm gonna need a hand on guessing what ray-triangle is, if it's not one triangle and its nearest one, or going through one triangle. But I really don't see why NV would go through three triangles, so probably it's just a triangle and its neighbour.
More ray-box intersection capability just means you can get through more box intersection tests per cycle. That's it.
Edit: it would be quite consistent with the general company philosophies that AMD's BVH is more precise but also heavier on the HW, while NV goes for more triangles and loses precision. No proof of it or anything, but if you look at how much AMD cares about FP64 and how much NV cares about FP8, I'd say I'm probably on the right guess.
Again, I'm not sure "more precise" is the correct way to see it.
Edit 2: now that I think about it, a 8-deep BVH would have an insane amount of boxes to store. Forget walking through it, it would already take a crazy amount of VRAM space.
If every box division yields 8 times more: 8^8 = 16,777,216 boxes. I suppose they can have a smart structure that just erases/ignores the empty boxes, but still. 8^4 == 4096 volumes for comparison.
It's a matter of width vs. depth. In theory, if you have a higher branching factor, meaning each layer is wider, you should have less layers in general because of how finely you are chopping up the space.

To use your example, you can get to 4096 volumes using BVH2 (12 layers, 2^12), BVH4 (6 layers: 4^6 = 4096) or BVH8 (4 layers: 8^4 = 4096).

I was doing more reading and came across this article. It helped me understand some of the theory.
 
Reactions: Tlh97 and cherullo

blckgrffn

Diamond Member
May 1, 2003
9,131
3,072
136
www.teamjuchems.com
What's the source on that? It would be a bit shocking to me for that to be true, GDDR7 is PAM3 and significantly different from GDDR6, a GDDR6-only memory controller would be much simpler and smaller.
If RDNA5 is really 2026 a bump in mid '25 with GRDD7 would be... something at least. Not saying that proves anything, but for a SKU to last ~3 years and to bridge memory technologies during that time, it at least is plausible from a strategy standpoint.
 

Saylick

Diamond Member
Sep 10, 2012
3,194
6,492
136
What's the source on that? It would be a bit shocking to me for that to be true, GDDR7 is PAM3 and significantly different from GDDR6, a GDDR6-only memory controller would be much simpler and smaller.
It's almost as if the higher end RDNA 4 parts that got canned share the same memory controller and were designed to use GDDR7 while the lower end parts, the ones that didn't get canned, were intended to use GDDR6. Hmm....
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |