NVIDIA does Ray denoising with FP8 though
I am pretty convinced, that FP4 or FP6 will be utilized wherever possible (advancements in model research, weight pruning, etc.). This allows for multiple things:
- Reduced memory and bandwidth usage per weight
- More weights = Enhanced model = Enhanced quality
- More FLOPS = Faster execution
But you could probably not switch to FP4 for everything, just because Blackwell was released. With that you wouldn't make your user base happy. Some sort this happened with Ray Reconstruction on Turing and Ampere (DLSS4). It still runs, but with FP16 and much slower than the CNN version of RR.
Anyways, for future iterations the next logical step would be FP4 for Nvidia (Blackwell and Blackwell Next support it with more FLOPS) and FP4/FP6 for AMD (RDNA5). DLSS4 Super Resolution still runs on FP16, but I suspect they left it like that because it is rather light on the Tensor Cores (I traced that by myself) and due to DLSS4 SR compatiblity back to Turing. About Frame Generation I am not certain, could be FP8 (because only supported on Ada and Blackwell). But the Tensor Core load is minimal as well.
- Note: SR and FG use regular Vector/FP32 math and this for the vast majority of their execution times (roughly 80...90% of it).
The only thing which really eats frametime on the Tensor Cores is Ray Reconstruction. See my Nsight trace below (Zorah NVRTX sample, look at the bottom at "SM Tensor Pipe Active").
- RR takes a significant chunk of the frametime
- SR is tiny (also with RR there is an additional tensor core pass there)
- FG is tiny
If I could propose optimization goals for DLSS5 / FSR5:
- SR and FG: Move to FP8
- FP16 fallback on older architectures -> Turing, Ampere, maybe RDNA3
- SR and FG: Optimize everything around it with vector FP32 math to use Tensor BF16 instead
- That would quadruple the FLOPS and saves bandwidth
- Use FP16 Tensor as FP16 vector replacement (doubled FLOPS)
- RR: Move to FP4
- FP8 fallback on older architectures -> Ada, RDNA4
AMD / Nvidia / Game devs / Engine devs:
- Fullscreen effects: Move to Tensor FP16 or BF16 (get rid of fullscreen effects with vector FP16/FP32)
- That would double/quadruple the FLOPS and makes high resolutions less taxing (e.g. Super Resolution and any post processing would scale better)
- DX12 updates like cooperative vectors could be used as API basis.
Why BF16 Tensor as FP32 vector replacement?
- BF16 has the same value range like FP32, but with reduced precision
- BF16 works on all WMMA capable GPUs and will stay / stick around
- Except Turing, which has only limited BF16 support. But hey, Turing would be 9 years old by ~2027
- TF32 would be another option, but was recently thrown out of CDNA4 hardware (TF32 gets emulated in software with help of BF16)
- Some existing FP16 vector based effects should be portable to FP16 Tensor rather easily
- If an API like cooperative vectors is available