Discussion RDNA 5 / UDNA (CDNA Next) speculation

jpiniero · Sunday at 8:32 PM

eek2121 said:
This is incorrect. What you were seeing is a different front end for Windows. They never stated you could play Xbox games.

Figure this is part of how MS is going to transition the XBox user base away from console console to... well... Windows... running as a console. It sounds like it would only be made available to machines with the MS semi custom chip(s).

The other part of this is making Sony look bad since you could run it as an XBox, or use it as a PC and not have to pay for MP.

adroc_thurston · Sunday at 9:50 PM

jpiniero said:
Figure this is part of how MS is going to transition the XBox user base away from console console to... well... Windows... running as a console

They're not doing that at all.

511 · Sunday at 11:04 PM

Kepler_L2 said:
That should be a favourable comparison if anything, DXR 1.2 SER/OMM is supported with SW emulation on RDNA4 while RDNA5 has HW accel, and NRC/Ray denoising should be much faster on RDNA5 thanks to FP4 support.

So the 2X AI is comparing FP4 vs FP8 🤣🤣

basix · Monday at 2:48 AM

This is very likely, yes. I do not see huge benefits in ramping up ML/AI FLOPS any more than already implemented. RDNA4 features the same matrices to FP32 FLOPS ratio like Nvidia since Ampere. And per SM it stayed constant since Turing if just looking at FP16-Tensor.

Kepler_L2 · Monday at 6:05 AM

adroc_thurston said:
I don't think they can run denoiser at FP4.
Even FP8 is stretching it.

NVIDIA does Ray denoising with FP8 though

del42sa · Monday at 7:26 AM

511 said:
Kepler your message originated from another forum went to reddit twitter and came here just to get confirmed lmao.
If you have info on Xe3/4 as well don't hesitate to share on Intel thread.

My guess is that the next generation of graphic cards will be called "AI-DNA" and that it will have much better perf/watt per CU plus much greater performance in RT with twice as much memory and really huge AI capability, but please don´t tell anyone, cos it´s secret.

basix · Monday at 3:10 PM

Kepler_L2 said:
NVIDIA does Ray denoising with FP8 though

I am pretty convinced, that FP4 or FP6 will be utilized wherever possible (advancements in model research, weight pruning, etc.). This allows for multiple things:

Reduced memory and bandwidth usage per weight
More weights = Enhanced model = Enhanced quality
More FLOPS = Faster execution

But you could probably not switch to FP4 for everything, just because Blackwell was released. With that you wouldn't make your user base happy. Some sort this happened with Ray Reconstruction on Turing and Ampere (DLSS4). It still runs, but with FP16 and much slower than the CNN version of RR.

Anyways, for future iterations the next logical step would be FP4 for Nvidia (Blackwell and Blackwell Next support it with more FLOPS) and FP4/FP6 for AMD (RDNA5). DLSS4 Super Resolution still runs on FP16, but I suspect they left it like that because it is rather light on the Tensor Cores (I traced that by myself) and due to DLSS4 SR compatiblity back to Turing. About Frame Generation I am not certain, could be FP8 (because only supported on Ada and Blackwell). But the Tensor Core load is minimal as well.

Note: SR and FG use regular Vector/FP32 math and this for the vast majority of their execution times (roughly 80...90% of it).

The only thing which really eats frametime on the Tensor Cores is Ray Reconstruction. See my Nsight trace below (Zorah NVRTX sample, look at the bottom at "SM Tensor Pipe Active").

RR takes a significant chunk of the frametime
SR is tiny (also with RR there is an additional tensor core pass there)
FG is tiny

If I could propose optimization goals for DLSS5 / FSR5:

SR and FG: Move to FP8
- FP16 fallback on older architectures -> Turing, Ampere, maybe RDNA3
SR and FG: Optimize everything around it with vector FP32 math to use Tensor BF16 instead
- That would quadruple the FLOPS and saves bandwidth
- Use FP16 Tensor as FP16 vector replacement (doubled FLOPS)
RR: Move to FP4
- FP8 fallback on older architectures -> Ada, RDNA4

AMD / Nvidia / Game devs / Engine devs:

Fullscreen effects: Move to Tensor FP16 or BF16 (get rid of fullscreen effects with vector FP16/FP32)
- That would double/quadruple the FLOPS and makes high resolutions less taxing (e.g. Super Resolution and any post processing would scale better)
- DX12 updates like cooperative vectors could be used as API basis.

Why BF16 Tensor as FP32 vector replacement?

BF16 has the same value range like FP32, but with reduced precision
BF16 works on all WMMA capable GPUs and will stay / stick around
- Except Turing, which has only limited BF16 support. But hey, Turing would be 9 years old by ~2027
- TF32 would be another option, but was recently thrown out of CDNA4 hardware (TF32 gets emulated in software with help of BF16)
Some existing FP16 vector based effects should be portable to FP16 Tensor rather easily
- If an API like cooperative vectors is available

adroc_thurston · Monday at 6:20 PM

basix said:
Some existing FP16 vector based effects should be portable to FP16 Tensor rather easily

uh, no?
meme cores only ever do GEMM, you know.
They're not programmable.

Io Magnesso · Tuesday at 7:13 AM

To be honest, I still don't know about the effectiveness of FP4...

511 · Tuesday at 7:29 AM

Io Magnesso said:
To be honest, I still don't know about the effectiveness of FP4...

FP4 would be good for inference but for training I doubt so but manufacturers have to Quote FP4 Sparse waiting for FP2 Sparse 🙂.

jpiniero · Tuesday at 7:59 AM

Tyler. said:
Will MS lock these chips down like consoles?

We'll have to see how it gets implemented.

RnR_au · Tuesday at 8:21 AM

511 said:
FP4 would be good for inference but for training I doubt so but manufacturers have to Quote FP4 Sparse waiting for FP2 Sparse 🙂.

QAT (Quantization-Aware Training) is a thing nowadays, but not sure what levels of quants they can target. I have only heard of Q8 in QAT context, but can't see why lower quants can't be used.

ToTTenTranz · Tuesday at 10:33 AM

Io Magnesso said:
To be honest, I still don't know about the effectiveness of FP4...

When we're getting LLM with an average 2.3bits per weight, FP4 is very effective.

moinmoin · Tuesday at 11:13 AM

ToTTenTranz said:
When we're getting LLM with an average 2.3bits per weight, FP4 is very effective.

FP3 incoming?

jpiniero · Tuesday at 11:37 AM

https://www.bloomberg.com/news/articles/2025-06-24/microsoft-plans-major-job-cuts-at-xbox-gaming-division

Case in point, MS is doing another round of layoffs at XBox... Part of the problem I think is that the Series consoles are very uncompetitive on cost structure compared to PS5.

I think we could easily see Series production ending soon, and if they do that they are going to need a replacement. Hence the new Consoles/PC OEM approach.

adroc_thurston · Tuesday at 12:02 PM

jpiniero said:
Part of the problem I think is that the Series consoles are very uncompetitive on cost structure compared to PS5.

Nope.

jpiniero said:
I think we could easily see Series production ending soon, and if they do that they are going to need a replacement. Hence the new Consoles/PC OEM approach

nope.
They're just harvesting studios for the pound of flesh.

igor_kavinski · Tuesday at 12:03 PM

jpiniero said:
Part of the problem I think is that the Series consoles are very uncompetitive on cost structure compared to PS5.

Proprietary NVMe upgrade solution didn't help matters.

jpiniero · Tuesday at 12:19 PM

adroc_thurston said:
Nope.

Die size is bigger, the extra ram... but also that I think a big issue is that the consoles are pretty much entirely assembled in China so tariffs. There may be other manufacturing advantages to the PS5 beyond that too.

adroc_thurston · Tuesday at 1:31 PM

jpiniero said:
Die size is bigger, the extra ram... but also that I think a big issue is that the consoles are pretty much entirely assembled in China so tariffs.

Tariffs are an amerimutt-only thing.
DRAM's not any bigger either.
Also there's a whole other world to sell to.

basix · Wednesday at 2:44 AM

adroc_thurston said:
uh, no?
meme cores only ever do GEMM, you know.
They're not programmable.

That's the reason why we need an API like cooperative vectors. It is programmable like vectors but you can offload math operations to matrix accelerators.

Cooperative Vector DXIL operations for vector-matrix operations that can be accelerated by the underlying hardware

Introduce new DXIL operations to accelerate matrix-vector operations. In this specification we add four operations:

Matrix-Vector Multiply: Multiply a matrix in memory and a vectorparameter.

Matrix-Vector Multiply-Add: Multiply a matrix in memory and a vectorparameter and add a vector from memory.

Vector-Vector Outer Product and Accumulate: Compute the outer product oftwo vectors and accumulate the result matrix atomically-elementwise inmemory.

Vector Accumulate: Accumulate elements of a vectoratomically-elementwise to corresponding elements in memory.

In addition, the Tensor Cores provide matrix-matrix multiplication, but each ray tracing thread only needs vector-matrix multiplication, which would under-utilize the Tensor Cores.

This is what cooperative means: threads band together to turn several vector-matrix operations into matrix-matrix operations.

Cooperative vectors are used to implement RTX Neural Shaders and RTX Neural Texture Compression

If you don't believe me, read the docs and learn the math:

adroc_thurston · Wednesday at 2:51 AM

basix said:
That's the reason why we need an API like cooperative vectors.

Yeah they just expose HMMA/V_MFMA_F32 and friends to D3D12.

basix said:
It is programmable like vectors but you can offload math operations to matrix accelerators.

Well no they're very much very very fixed-function GEMM accelerators.
They do one thing, various precision GEMM into (usually) FP32 accumulator.

basix · Wednesday at 2:57 AM

The matrix accelerators by itself may not be programmable. But that is not the point of this whole stuff. It can integrate into programmable workflows.
But it seems you just want to be right with something ("matrix cores are not programmable blablablubb") and neglect the effective use cases and their benefits / outcomes.

That is the difference between being pedant and practical oriented

adroc_thurston · Wednesday at 3:03 AM

basix said:
It can integrate into programmable workflows.

Well, no.
We already do with with vendor APIs, dat thing just moves it to d3d12.

basix said:
and neglect the effective use cases and their benefits / outcomes.

Such as?

basix · Wednesday at 6:30 AM

Then enlighten me, why it should not be programmable (or what you understand under that term).

In the end, cooperative vectors allow regular shader (vector) code to be interweaved / extended with matrix math acceleration. Not more, not less. And as shader code is programmable, the matrix acceleration gets programmable as well if you want to look at it in that way. You can transform vectors into matrices as well, if you want. And much math in graphics deals with matrices anyways (but were split into vectors due to lack of native matrix instruction support). So matrix acceleration can be used in various and flexible ways, very similar as vector math today. In the end you have more or less the same shader code, but push the math through matrix accelerators instead of vector engines. With one prerequisite, that FP16/BF16 or less can be used.

Cooperative vectors work differently than e.g. CUDA. It is a different programming model. Sure, Nvidia could implement cooperative vectors via NVAPI and there is a cooperative vectors extension being introduced now. But I do not know if anything similar preceded that in NVAPI.

One example of one use case (based on Nvidia Optix Raytracing): Better hardware utilization / increased throughput.

Neural Rendering in NVIDIA OptiX Using Cooperative Vectors | NVIDIA Technical Blog

The release of NVIDIA OptiX 9.0 introduces a new feature called cooperative vectors that enables AI workflows as part of ray tracing kernels. The feature leverages NVIDIA RTX Tensor Cores for hardware…

developer.nvidia.com

Cooperative vectors address these limitations by providing an API that:

Allows matrix operations with warps that have some inactive threads

Provides cross-architecture forward and backward compatibility

Enables users specify vector data in a single thread while remapping the operation to more efficiently utilize the Tensor Cores

itsmydamnation · Wednesday at 5:26 PM

basix said:
Then enlighten me, why it should not be programmable (or what you understand under that term).

How about this one back at you , what's the difference between a programmable GEMM unit and a regular FMAC.

GEMM/Tensor unit is really register bandwidth/power optimisation, if your not doing super dense GEMM in it them all your doing is complicating data transfer and state. If you want to make them more flexible you will eat into their advantages in their key workloads over FMAC.

So the question is how many game FMAC workloads are dense GEMM that warrant direct programmability/ interaction.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Lifer

Diamond Member

Platinum Member

Member

Senior member

Member

Member

Diamond Member

Member

Platinum Member

Lifer

Platinum Member

Senior member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Member

Diamond Member

Member

Diamond Member

Member

Diamond Member