Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 18 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

jpiniero

Lifer
Oct 1, 2010
16,414
6,879
136
This is incorrect. What you were seeing is a different front end for Windows. They never stated you could play Xbox games.

Figure this is part of how MS is going to transition the XBox user base away from console console to... well... Windows... running as a console. It sounds like it would only be made available to machines with the MS semi custom chip(s).

The other part of this is making Sony look bad since you could run it as an XBox, or use it as a PC and not have to pay for MP.
 

511

Platinum Member
Jul 12, 2024
2,534
2,372
106
That should be a favourable comparison if anything, DXR 1.2 SER/OMM is supported with SW emulation on RDNA4 while RDNA5 has HW accel, and NRC/Ray denoising should be much faster on RDNA5 thanks to FP4 support.
So the 2X AI is comparing FP4 vs FP8 🤣🤣
 

basix

Member
Oct 4, 2024
131
271
96
This is very likely, yes. I do not see huge benefits in ramping up ML/AI FLOPS any more than already implemented. RDNA4 features the same matrices to FP32 FLOPS ratio like Nvidia since Ampere. And per SM it stayed constant since Turing if just looking at FP16-Tensor.
 
Reactions: Tlh97 and 511

del42sa

Member
May 28, 2013
178
304
136
Kepler your message originated from another forum went to reddit twitter and came here just to get confirmed lmao.
If you have info on Xe3/4 as well don't hesitate to share on Intel thread.
My guess is that the next generation of graphic cards will be called "AI-DNA" and that it will have much better perf/watt per CU plus much greater performance in RT with twice as much memory and really huge AI capability, but please don´t tell anyone, cos it´s secret.
 

basix

Member
Oct 4, 2024
131
271
96
NVIDIA does Ray denoising with FP8 though
I am pretty convinced, that FP4 or FP6 will be utilized wherever possible (advancements in model research, weight pruning, etc.). This allows for multiple things:
  • Reduced memory and bandwidth usage per weight
  • More weights = Enhanced model = Enhanced quality
  • More FLOPS = Faster execution

But you could probably not switch to FP4 for everything, just because Blackwell was released. With that you wouldn't make your user base happy. Some sort this happened with Ray Reconstruction on Turing and Ampere (DLSS4). It still runs, but with FP16 and much slower than the CNN version of RR.

Anyways, for future iterations the next logical step would be FP4 for Nvidia (Blackwell and Blackwell Next support it with more FLOPS) and FP4/FP6 for AMD (RDNA5). DLSS4 Super Resolution still runs on FP16, but I suspect they left it like that because it is rather light on the Tensor Cores (I traced that by myself) and due to DLSS4 SR compatiblity back to Turing. About Frame Generation I am not certain, could be FP8 (because only supported on Ada and Blackwell). But the Tensor Core load is minimal as well.
  • Note: SR and FG use regular Vector/FP32 math and this for the vast majority of their execution times (roughly 80...90% of it).

The only thing which really eats frametime on the Tensor Cores is Ray Reconstruction. See my Nsight trace below (Zorah NVRTX sample, look at the bottom at "SM Tensor Pipe Active").
  • RR takes a significant chunk of the frametime
  • SR is tiny (also with RR there is an additional tensor core pass there)
  • FG is tiny

If I could propose optimization goals for DLSS5 / FSR5:
  • SR and FG: Move to FP8
    • FP16 fallback on older architectures -> Turing, Ampere, maybe RDNA3
  • SR and FG: Optimize everything around it with vector FP32 math to use Tensor BF16 instead
    • That would quadruple the FLOPS and saves bandwidth
    • Use FP16 Tensor as FP16 vector replacement (doubled FLOPS)
  • RR: Move to FP4
    • FP8 fallback on older architectures -> Ada, RDNA4

AMD / Nvidia / Game devs / Engine devs:
  • Fullscreen effects: Move to Tensor FP16 or BF16 (get rid of fullscreen effects with vector FP16/FP32)
    • That would double/quadruple the FLOPS and makes high resolutions less taxing (e.g. Super Resolution and any post processing would scale better)
    • DX12 updates like cooperative vectors could be used as API basis.

Why BF16 Tensor as FP32 vector replacement?
  • BF16 has the same value range like FP32, but with reduced precision
  • BF16 works on all WMMA capable GPUs and will stay / stick around
    • Except Turing, which has only limited BF16 support. But hey, Turing would be 9 years old by ~2027
    • TF32 would be another option, but was recently thrown out of CDNA4 hardware (TF32 gets emulated in software with help of BF16)
  • Some existing FP16 vector based effects should be portable to FP16 Tensor rather easily
    • If an API like cooperative vectors is available
 

RnR_au

Platinum Member
Jun 6, 2021
2,512
5,897
136
FP4 would be good for inference but for training I doubt so but manufacturers have to Quote FP4 Sparse waiting for FP2 Sparse 🙂.
QAT (Quantization-Aware Training) is a thing nowadays, but not sure what levels of quants they can target. I have only heard of Q8 in QAT context, but can't see why lower quants can't be used.
 

jpiniero

Lifer
Oct 1, 2010
16,414
6,879
136
Reactions: Tlh97 and Saylick

adroc_thurston

Diamond Member
Jul 2, 2023
5,888
8,251
96
Die size is bigger, the extra ram... but also that I think a big issue is that the consoles are pretty much entirely assembled in China so tariffs.
Tariffs are an amerimutt-only thing.
DRAM's not any bigger either.
Also there's a whole other world to sell to.
 

basix

Member
Oct 4, 2024
131
271
96
uh, no?
meme cores only ever do GEMM, you know.
They're not programmable.
That's the reason why we need an API like cooperative vectors. It is programmable like vectors but you can offload math operations to matrix accelerators.

Cooperative VectorDXIL operations for vector-matrix operations that can be accelerated by the underlying hardware
Introduce new DXIL operations to accelerate matrix-vector operations. In this specification we add four operations:
  • Matrix-Vector Multiply: Multiply a matrix in memory and a vectorparameter.
  • Matrix-Vector Multiply-Add: Multiply a matrix in memory and a vectorparameter and add a vector from memory.
  • Vector-Vector Outer Product and Accumulate: Compute the outer product oftwo vectors and accumulate the result matrix atomically-elementwise inmemory.
  • Vector Accumulate: Accumulate elements of a vectoratomically-elementwise to corresponding elements in memory.
In addition, the Tensor Cores provide matrix-matrix multiplication, but each ray tracing thread only needs vector-matrix multiplication, which would under-utilize the Tensor Cores.
This is what cooperative means: threads band together to turn several vector-matrix operations into matrix-matrix operations.
Cooperative vectors are used to implement RTX Neural Shaders and RTX Neural Texture Compression

If you don't believe me, read the docs and learn the math:
 
Last edited:

adroc_thurston

Diamond Member
Jul 2, 2023
5,888
8,251
96
That's the reason why we need an API like cooperative vectors.
Yeah they just expose HMMA/V_MFMA_F32 and friends to D3D12.
It is programmable like vectors but you can offload math operations to matrix accelerators.
Well no they're very much very very fixed-function GEMM accelerators.
They do one thing, various precision GEMM into (usually) FP32 accumulator.
 

basix

Member
Oct 4, 2024
131
271
96
The matrix accelerators by itself may not be programmable. But that is not the point of this whole stuff. It can integrate into programmable workflows.
But it seems you just want to be right with something ("matrix cores are not programmable blablablubb") and neglect the effective use cases and their benefits / outcomes.

That is the difference between being pedant and practical oriented
 

basix

Member
Oct 4, 2024
131
271
96
Then enlighten me, why it should not be programmable (or what you understand under that term).

In the end, cooperative vectors allow regular shader (vector) code to be interweaved / extended with matrix math acceleration. Not more, not less. And as shader code is programmable, the matrix acceleration gets programmable as well if you want to look at it in that way. You can transform vectors into matrices as well, if you want. And much math in graphics deals with matrices anyways (but were split into vectors due to lack of native matrix instruction support). So matrix acceleration can be used in various and flexible ways, very similar as vector math today. In the end you have more or less the same shader code, but push the math through matrix accelerators instead of vector engines. With one prerequisite, that FP16/BF16 or less can be used.

Cooperative vectors work differently than e.g. CUDA. It is a different programming model. Sure, Nvidia could implement cooperative vectors via NVAPI and there is a cooperative vectors extension being introduced now. But I do not know if anything similar preceded that in NVAPI.

One example of one use case (based on Nvidia Optix Raytracing): Better hardware utilization / increased throughput.
Cooperative vectors address these limitations by providing an API that:
  • Allows matrix operations with warps that have some inactive threads
  • Provides cross-architecture forward and backward compatibility
  • Enables users specify vector data in a single thread while remapping the operation to more efficiently utilize the Tensor Cores
 
Last edited:

itsmydamnation

Diamond Member
Feb 6, 2011
3,037
3,811
136
Then enlighten me, why it should not be programmable (or what you understand under that term).
How about this one back at you , what's the difference between a programmable GEMM unit and a regular FMAC.

GEMM/Tensor unit is really register bandwidth/power optimisation, if your not doing super dense GEMM in it them all your doing is complicating data transfer and state. If you want to make them more flexible you will eat into their advantages in their key workloads over FMAC.

So the question is how many game FMAC workloads are dense GEMM that warrant direct programmability/ interaction.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |