Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

Mahboi · Apr 23, 2024

ToTTenTranz said:
My guess is AMD was counting on the N31 and N32 GCDs clocking >20% above what they ended up doing. That's why there were initial plans to put additional Infinity Cache via VCache over the MCDs, to get higher effective bandwidth through additional LLC to properly feed GCDs that were supposed to be much more demanding due to higher clocks.

All of them. N33 too.

branch_suggestion said it, the advanced plans were canned when they realised it was a dog.
My only gripe is that Navi 40 now got cut at the head, shoulders, and chest. All that's left is below the crotch. N48 is a 6600 xt's size. N44 is a 130mm² tiny little thing meant to go into laptops.

Meanwhile we're left waiting another 2 years or so for another promise of RDNA hype. It's just a little tiring to see so little effort into making costly/risky products. NVidia has no problem dishing out massive monolithic 600+mm² dies every single gen. They aren't nearly as obsessed with penny pinching, and it shows. Nvidia goes big then tries to win big. AMD goes small then tries to win big anyway. RDNA 4 is just another one in the long list of "meh" moments.

I do have great hopes for its RT to be closer to Lovelace, and for the PS5 Pro to show some serious chops both in upscaling and in raytracing. It's time for AMD & partners to at least ring the bell of the latter half of the 2020s, rather than wallow in token RT and no-ML upscaling.

Tigerick · Apr 23, 2024

Hmm, from GDDR7 to 20Gbps GDDR6 and now 18Gbps GDDR6, I can't believe people said memory bandwidth is not important. Let's see:

N44: 32CU: Same BW as 7600XT: 32CU. Make sense

N48: 64CU: 576GB/s compared to 7800XT: 60CU: 624GB/s. 7700XT: 54CU: 432GB/s. Hmm, we might see another changes in N48 specs...

Mahboi · Apr 23, 2024

Sort of unrelated to the thread but how much better is HBM3 vs GDDR6 in terms of power efficiency?

Timorous · Apr 23, 2024

On that basis my expectation is top N48 ~= 7900GRE and top N44 around 6700XT.

Saylick · Apr 23, 2024

Mahboi said:
I do have great hopes for its RT to be closer to Lovelace, and for the PS5 Pro to show some serious chops both in upscaling and in raytracing. It's time for AMD & partners to at least ring the bell of the latter half of the 2020s, rather than wallow in token RT and no-ML upscaling.

FWIW, @Kepler_L2 compiled a quick comparison of the box and ray intersection capabilities between AMD and Nvidia's GPU architectures (see below). If I'm not mistaken, this is a RT unit vs RT unit comparison and RDNA 2/3 has two RT units per WGP while Nvidia has one RT unit per SM (WGP has 128 FP units while SM has 256 FP units). What AMD is missing is that hardware BVH walker or traversal unit, which will be finally added in RDNA 4.

https://twitter.com/x/status/1769514652239589885

RDNA 2 WGP:

Ada Lovelace SM:

EDIT: Dang, seems like I need to do more homework into how RT works but I found an informative Reddit post which sheds a little more light.

https://www.reddit.com/r/Amd/comments/k88sp5/comment/gewxqhj

Mahboi · Apr 23, 2024

Saylick said:
https://twitter.com/x/status/1769514652239589885

It is funny how Box-oriented AMD is actually.
Wonder what got them to do it this way...

branch_suggestion · Apr 23, 2024

Mahboi said:
NVidia has no problem dishing out massive monolithic 600+mm² dies every single gen.

They have their 2 year cadence and they execute it, along with their 2x Blender et al perf/gen goal.
Only time they took over 2 years was Pascal>Turing, due to Volta but they had plenty of time due to their lead starting with Maxwell.
When you look at uArch changes per gen, AMD is being more ambitious recently, but it is risky execution, NV keeps some things very much the same and iterates a bunch of blocks.
Ampere over Turing did the dual issue thing as the big change, Ada vs Ampere did the L2 embiggening.
NV doesn't want to overhaul things too much at once due to keeping CUDA compatibility as consistent as possible.

Saylick · Apr 23, 2024

Mahboi said:
It is funny how Box-oriented AMD is actually.
Wonder what got them to do it this way...

The more I read, the more I am now believing that ray-box intersection tests is what speeds up BVH traversal. As I understand it, the BVH structure is comprised of mostly boxes within boxes, where triangles are in the last boxes at the bottom of the tree, although it largely depends on how the BVH structure is built.

You can have a relatively wide but shallow BVH structure, or a narrower but deeper BVH structure. Wider structures have more boxes in each level, thus requiring more ray-box intersection tests to get to the next level, but a deeper BVH requires more jumps because there's simply more levels and more jumps means more demand on the cache/memory subsystem (every time you drill down a level, you need to jump to a new memory address as far as I understand). It's apparent that there's a balance between the two approaches.

The goal is to identify as quick as possible which triangle gets hit by the ray. Having more ray-box intersection lets you get through the levels of boxes quicker, and having more ray-triangle intersection helps you identify the final triangle in the last box quicker.

Seems like AMD prefers the approach of a deeper BVH structure, which puts more pressure on the cache and memory subsystem, while Nvidia prefers a wider but shallower BVH structure. The former is a very CPU-like approach to things (RDNA 2 has a souped up cache system, thanks to AMD's CPU team involvement in the design of RDNA 2) while the latter is a very GPU-like approach, given that GPUs have a ton of compute and memory latency is high.

Some helpful additional reading:

AMD’s RDNA 2: Shooting For the Top

In 2019, AMD moved off their long-serving GCN architecture in favor of RDNA. We’ll cover the first generation of RDNA some other time. RDNA 2 takes that foundation and scales it up while addi…

chipsandcheese.com

Raytracing on AMD’s RDNA 2/3, and Nvidia’s Turing and Pascal

Note: Jake has commented that Nvidia’s tools may not show the true BVH structure. That’s a distinct possibility, as the structure implied by Nsight is indeed ridiculously wide. The rest…

chipsandcheese.com

Mahboi · Apr 23, 2024

Saylick said:
The more I read, the more I am now believing that ray-box intersection tests is what speeds up BVH traversal. As I understand it, the BVH structure is comprised of mostly boxes within boxes, where triangles are in the last boxes at the bottom of the tree, although it largely depends on how the BVH structure is built.
You can have a relatively wide but shallow BVH structure, or a narrower but deeper BVH structure. Wider structures have more boxes in each level, thus requiring more ray-box intersection tests to get to the next level, but a deeper BVH requires more jumps because there's simply more levels and more jumps means more demand on the cache/memory subsystem (every time you drill down a level, you need to jump to a new memory address as far as I understand). It's apparent that there's a balance between the two approaches.

Can't speak about RT, but I know BVHs from back when I was studying game physics.

It's essentially a 3D space divided into 8 (yes this is divided into more than that, I couldn't find a clean picture).
You divide the top level bounding box, which contains your world (for physics not RT), by its center point. Divide it in 3D, and you get a X, Y, and Z axis dividers. These create 2 boxes top forward, 2 boxes top back, 2 boxes bottom forward, 2 boxes bottom back. If you can visualise that.

These boxes have effectively divided your world by 8, which is necessary for physics since you can't have every single dynamic thing in the world collide with every single other thing. You divide them into usable areas.

A BVH is a structure that holds these "divisions" of the world, and tests them(I.E stores what elements are within the bounding box) when a ray passes through. I'm not too sure if it's based off the camera's viewport or the world, I imagine the world but I don't know.

I'm guessing that "8 Boxes" means that the world gets divided 8 times, and the Hierarchy thus can divide by 2^8, so 256 times more precision. Which means that you greatly lower the number of needed calculations because you can get a much more precise idea of where the ray will collide. However, I'm not too sure on 2 triangles. It returns 2 triangles hit by the ray, I guess? So it lights them up together?

Saylick said:
The goal is to identify as quick as possible which triangle gets hit by the ray. Having more ray-box intersection lets you get through the levels of boxes quicker, and having more ray-triangle intersection let's you identify the final triangle in the last box.

I'd say the opposite, more ray-box intersections means more boxes, means longer times to get through. Which would possibly explain why they waited to have a BVH walker to go up to 8 Box, but then again NV doesn't seem to give a damn going beyond 4.
I'm gonna need a hand on guessing what ray-triangle is, if it's not one triangle and its nearest one, or going through one triangle. But I really don't see why NV would go through three triangles, so probably it's just a triangle and its neighbour.

Edit: it would be quite consistent with the general company philosophies that AMD's BVH is more precise but also heavier on the HW, while NV goes for more triangles and loses precision. No proof of it or anything, but if you look at how much AMD cares about FP64 and how much NV cares about FP8, I'd say I'm probably on the right guess.

Edit 2: now that I think about it, a 8-deep BVH would have an insane amount of boxes to store. Forget walking through it, it would already take a crazy amount of VRAM space.
If every box division yields 8 times more: 8^8 = 16,777,216 boxes. I suppose they can have a smart structure that just erases/ignores the empty boxes, but still. 8^4 == 4096 volumes for comparison.

igor_kavinski · Apr 23, 2024

Mahboi said:
It is funny how Box-oriented AMD is actually.
Wonder what got them to do it this way...

The loudest most obnoxious PhD graphics researcher on their team.

Or someone they really respect. Last time they respected one of their architects, they got Bulldozer as a gift from him.

Aapje · Apr 23, 2024

Mahboi said:
Meanwhile we're left waiting another 2 years or so for another promise of RDNA hype.

Rumor is that RDNA4 is a stopgap exactly so they can accelerate RDNA5 for a release late next year. So please continue the hype.

Mahboi said:
It's just a little tiring to see so little effort into making costly/risky products. NVidia has no problem dishing out massive monolithic 600+mm² dies every single gen.

A rather weird take when Nvidia just keeps shoving more compute into chips with the same architecture, while AMD took a gamble on chiplets for GPU's.

Mahboi said:
They aren't nearly as obsessed with penny pinching, and it shows. Nvidia goes big then tries to win big. AMD goes small then tries to win big anyway.

You completely ignore that AMD can't win big anyway with just hardware, as their software stack is lagging behind Nvidia, primarily due to CUDA.

AMD can't beat that unless they manage to gradually make inroads with ROCm, they wait for the big players to abandon CUDA and pick/create an open standard (which would be the sensible move for Google/MS/Meta/Amazon/etc), or they have such strong hardware that the market will make big short-term investments to support ROCm, because the hardware is so much better than Nvidia's.

The last option is the least likely, not in the least because Jensen is not a slacker like the previous Intel CEOs, and Lisa would be very dumb to gamble on it and risk making huge losses on expensive dies that don't sell for enough money.

Mahboi said:
RDNA 4 is just another one in the long list of "meh" moments.

Perhaps not for gamers who just want to play their games in high quality for a relatively modest price. The 7800 XT seems to sell very well for a card that is less meh that most cards of this gen, so if the 8800 XT is a much better meh, that would be pretty nice for gamers with that kind of budget.

adroc_thurston · Apr 23, 2024

Aapje said:
so they can accelerate RDNA5 for a release late next year

no.

Aapje said:
Rumor is that RDNA4 is a stopgap exactly

IT IS NOT A STOPGAP.
parts are just dead.

Saylick · Apr 23, 2024

Mahboi said:
I'm guessing that "8 Boxes" means that the world gets divided 8 times, and the Hierarchy thus can divide by 2^8, so 256 times more precision. Which means that you greatly lower the number of needed calculations because you can get a much more precise idea of where the ray will collide. However, I'm not too sure on 2 triangles. It returns 2 triangles hit by the ray, I guess? So it lights them up together?

As far as I understand it, that is correct. A binary BVH structure has 2 branches per box (aka BVH2), BVH4 has 4 branches per box, and BVH8 has 8 branches per box. I don't necessarily agree that a higher branching factor is "more accurate" because whether you split things into 2, 4, or 8 boxes you still have to traverse down to the final "leaf" node to find the triangle the ray intersects (or doesn't intersect).

Mahboi said:
I'd say the opposite, more ray-box intersections means more boxes, means longer times to get through. Which would possibly explain why they waited to have a BVH walker to go up to 8 Box, but then again NV doesn't seem to give a damn going beyond 4.
I'm gonna need a hand on guessing what ray-triangle is, if it's not one triangle and its nearest one, or going through one triangle. But I really don't see why NV would go through three triangles, so probably it's just a triangle and its neighbour.

More ray-box intersection capability just means you can get through more box intersection tests per cycle. That's it.

Mahboi said:
Edit: it would be quite consistent with the general company philosophies that AMD's BVH is more precise but also heavier on the HW, while NV goes for more triangles and loses precision. No proof of it or anything, but if you look at how much AMD cares about FP64 and how much NV cares about FP8, I'd say I'm probably on the right guess.

Again, I'm not sure "more precise" is the correct way to see it.

Mahboi said:
Edit 2: now that I think about it, a 8-deep BVH would have an insane amount of boxes to store. Forget walking through it, it would already take a crazy amount of VRAM space.
If every box division yields 8 times more: 8^8 = 16,777,216 boxes. I suppose they can have a smart structure that just erases/ignores the empty boxes, but still. 8^4 == 4096 volumes for comparison.

It's a matter of width vs. depth. In theory, if you have a higher branching factor, meaning each layer is wider, you should have less layers in general because of how finely you are chopping up the space.

To use your example, you can get to 4096 volumes using BVH2 (12 layers, 2^12), BVH4 (6 layers: 4^6 = 4096) or BVH8 (4 layers: 8^4 = 4096).

I was doing more reading and came across this article. It helped me understand some of the theory.

Psychopath Renderer | BVH4 Without SIMD

SolidQ · Apr 23, 2024

adroc_thurston said:
no.

RDNA5 2026?

adroc_thurston · Apr 23, 2024

SolidQ said:
RDNA5 2026?

Obviously.

Tuna-Fish · Apr 23, 2024

Kepler_L2 said:
RDNA4 memory controller supports GDDR7 too.

What's the source on that? It would be a bit shocking to me for that to be true, GDDR7 is PAM3 and significantly different from GDDR6, a GDDR6-only memory controller would be much simpler and smaller.

blckgrffn · Apr 23, 2024

Tuna-Fish said:
What's the source on that? It would be a bit shocking to me for that to be true, GDDR7 is PAM3 and significantly different from GDDR6, a GDDR6-only memory controller would be much simpler and smaller.

If RDNA5 is really 2026 a bump in mid '25 with GRDD7 would be... something at least. Not saying that proves anything, but for a SKU to last ~3 years and to bridge memory technologies during that time, it at least is plausible from a strategy standpoint.

Saylick · Apr 23, 2024

Tuna-Fish said:
What's the source on that? It would be a bit shocking to me for that to be true, GDDR7 is PAM3 and significantly different from GDDR6, a GDDR6-only memory controller would be much simpler and smaller.

It's almost as if the higher end RDNA 4 parts that got canned share the same memory controller and were designed to use GDDR7 while the lower end parts, the ones that didn't get canned, were intended to use GDDR6. Hmm....

igor_kavinski · Apr 23, 2024

Saylick said:
It's almost as if the higher end RDNA 4 parts that got canned share the same memory controller and were designed to use GDDR7 while the lower end parts, the ones that didn't get canned, were intended to use GDDR6. Hmm....

If true, it means GDDR7 is LATE! For once, it's not RTG's fault!

blckgrffn · Apr 23, 2024

igor_kavinski said:
If true, it means GDDR7 is LATE! For once, it's not RTG's fault!

I don’t think he is saying that at all.

It’s the now dead RDNA4 parts that were slated for GDDR7 all along, but the memory controller design was unified across the line. I think that’s what the point is.

igor_kavinski · Apr 23, 2024

blckgrffn said:
It’s the now dead RDNA4 parts that were slated for GDDR7 all along, but the memory controller design was unified across the line. I think that’s what the point is.

I get that but now we can speculate that RDNA4 high end is dead coz of late GDDR7 production.

adroc_thurston · Apr 23, 2024

blckgrffn said:
If RDNA5 is really 2026 a bump in mid '25 with GRDD7 would be... something at least

pointless.

blckgrffn said:
it at least is plausible from a strategy standpoint.

The strategy is you wait until Navi50 gets cooked.
Simple 'nuff?

blckgrffn · Apr 23, 2024

adroc_thurston said:
pointless.

The strategy is you wait until Navi50 gets cooked.
Simple 'nuff?

Ugh. But that's more boring. Fine.

adroc_thurston · Apr 23, 2024

blckgrffn said:
But that's more boring. Fine.

Boring, predictable execution is good.

SolidQ · Apr 23, 2024

blckgrffn said:
Ugh. But that's more boring. Fine.

That's mean RDNA5 will compete with 6090

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Senior member

Senior member

Senior member

Golden Member

Diamond Member

Senior member

Member

Diamond Member

Senior member

Lifer

Golden Member

Platinum Member

Diamond Member

Senior member

Platinum Member

Golden Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Platinum Member

Diamond Member

Platinum Member

Senior member