Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

TESKATLIPOKA · Aug 19, 2023

jpiniero said:
You have to remember that it's pretty likely that Blackwell "midrange" is not going to be substantially faster than Ada. 64 CUs, especially if they fix it, would be too much even.

4096SP wouldn't be too much even If It could clock at 3.5GHz.
It would be somewhere between 7900XT and XTX level of performance at best.

The bigger question is If they could put 4096 RDNA4 SPs in 200-250mm2 using N3P process.
Let's say It's possible to put 4096SPs in that die size, then I doubt the smaller chip would have only 2048SPs. 2560SPs looks more likely to me.

P.S. I used stream processors(dual issue shaders) instead, because who knows If CUs(WGPs) won't be changed compared to RDNA3.

jpiniero · Aug 19, 2023

TESKATLIPOKA said:
It would be somewhere between 7900XT and XTX level of performance at best.

I would say that would be decently faster than GB206.

branch_suggestion · Aug 19, 2023

Reading back through everything, GFX12 is clearly going to be the biggest shift in GPU's since Unified Shaders, a near complete IP overhaul with a revolutionary design that is a massive performance bump.
Unfortunately RDNA4 is only a taste of what RDNA5 will finish. But hey, we still have 2 nice <250mm^2 mono dies which should still be a nice return to form.

To speculate on said dies:
The smaller of the 2 is N23v3, same spec, N4P and GDDR7 support. Probably uses a lite RDNA4. Replaces N22.
The bigger I'd say is N3E and uses the full RDNA4, 2SE/6SA/60CU, 192-bit GDDR7, 48MB MALL. I see no reason why that can't fit inside 250mm^2 and it should replace N21/N32. It could be a 3SE config but you probably get more outta ALU spam than another raster chunk.
That leaves N31 as the flagship until RDNA5. Considering that is ~2.5 years from release, I would hope AMD fixes it in the meantime.

jpiniero · Aug 19, 2023

branch_suggestion said:
The bigger I'd say is N3E and uses the full RDNA4, 2SE/6SA/60CU, 192-bit GDDR7, 48MB MALL. I see no reason why that can't fit inside 250mm^2

SRAM/IO scaling blows on N3E and it also seems likely that they will spend a ton of transistors on RT.

branch_suggestion · Aug 19, 2023

jpiniero said:
SRAM/IO scaling blows on N3E and it also seems likely that they will spend a ton of transistors on RT.

N4P would be more economical, but you are giving up a little bit of PPA.
I think the ASP is just large enough that N3E is justifiable. Also RT accel really doesn't take up many xtors, but I guess the revamped system is no longer part of the TMU's, regardless it is like a few % more area per CU. AMD needs to make the strongest midrange part possible to hold things over until the thing arrives.

tajoh111 · Aug 19, 2023

I think AMD one potential strategy AMD will use is utilize 5nm for Navi 43.

This has been a poor year for AMD in general and with AMD likely using less 5nm than they originally intended throughout 2023, I think take into account cost, the low performance bar set by Navi 33, a last gen node will make sense for AMD. Particularly if Nvidia keeps their RTX xx60 series under 150mm2.

If I were to guess if they chose this approach is something with between 2560 to 3072 shaders with 128bit bus again but using the very fastest GDDR6. Something as a result, 10-20% faster than a RTX 4060 ti but sold for under 300.

adroc_thurston · Aug 19, 2023

branch_suggestion said:
GFX12 is clearly going to be the biggest shift in GPU's since Unified Shaders

No.

branch_suggestion said:
a near complete IP overhaul with a revolutionary design that is a massive performance bump.

No.
But should be p good.

branch_suggestion · Aug 20, 2023

adroc_thurston said:
No.

No.
But should be p good.

Lemme inhale the RTG hopium.
But for real, if N50 isn't the biggest overall leap since G80 then what is? Technically speaking the GPU hasn't moved on much from Xenos in regards to rendering. Then again API's haven't either. We still rely on ALU SIMT/SIMD spam with additional accelerators and instruction/precision support glued on over time.
So I think making the first true chiplet GPU as a singular graphics device is the biggest innovation since USA.

adroc_thurston · Aug 20, 2023

branch_suggestion said:
So I think making the first true chiplet GPU as a singular graphics device is the biggest innovation since USA

Eh it's just win more stuff.
Maybe programmable RT cores or something will be an actual step-change.
Maybe.

branch_suggestion · Aug 20, 2023

adroc_thurston said:
Eh it's just win more stuff.
Maybe programmable RT cores or something will be an actual step-change.
Maybe.

Yeah it is only a step-change in scalability, a step-change in PPA hasn't happened in 17 years.
It really all goes back to the software this stuff is built to run. SM 4.0 unlocked so much hardware potential and NV nailed it first time, ATI/AMD's failure to release a big Terascale 2 to crush Fermi has set a trend that has never truly been broken, all thanks to G80.
Programmable RT cores/pipeline should be the next goal, unfortunately DXR is currently a complete meme and the alternatives don't really exist.
A good RTRT framework is required before such a big shift in hardware priorities is viable.

TESKATLIPOKA · Aug 20, 2023

branch_suggestion said:
To speculate on said dies:
The smaller of the 2 is N23v3, same spec, N4P and GDDR7 support. Probably uses a lite RDNA4. Replaces N22.
The bigger I'd say is N3E and uses the full RDNA4, 2SE/6SA/60CU, 192-bit GDDR7, 48MB MALL. I see no reason why that can't fit inside 250mm^2 and it should replace N21/N32. It could be a 3SE config but you probably get more outta ALU spam than another raster chunk.
That leaves N31 as the flagship until RDNA5. Considering that is ~2.5 years from release, I would hope AMD fixes it in the meantime.

I don't see why the smaller one should be a half generation compared to the bigger one, or why It should have only comparable specs to N33(N23) except clockspeed.
I also wonder If GDDR7 will be available at the time and at what speeds.
BTW 48MB MALL is kinda low.

branch_suggestion · Aug 20, 2023

TESKATLIPOKA said:
I don't see why the smaller one should be a half generation compared to the bigger one, or why It should have only comparable specs to N33(N23) except clockspeed.

You could be right, but the larger vGPR might not be worth the area cost on a small config. I was referring to N33 vs N31/32 btw in regards to a lite RDNA4.

TESKATLIPOKA said:
I also wonder If GDDR7 will be available at the time and at what speeds.

H2 2024 seems like a safe enough bet, but the 24Gbit parts might take longer, only Micron is confirmed to have 3GB GDDR7 dies right now. 128-bit cards need to move to 12GB capacity to remain viable. 192-bit could use the 18GB bump too, the bandwidth never hurts either.

TESKATLIPOKA said:
BTW 48MB MALL is kinda low.

48MB MALL/192-bit 24-32Gbps GDDR7 is plenty for a 2-3 SE part.

Tigerick · Aug 20, 2023

I think the real reasons for AMD to create N43 and N44 which replace N32 and N33 without much improvements as generation leap is not because AMD needs solution to desktop platform but for notebook platform.

Both N32 and N33 suffer power inefficiency due to process node and chiplet design which is crucial to notebook platform. With Fire Range Zen5 notebook platform coming end of next year, AMD has to replace N32 and N33 with monolithic GPU in order to avoid embarrassment of using RTX4000 series, hence N43 and N44 are created.

TESKATLIPOKA · Aug 20, 2023

branch_suggestion said:
You could be right, but the larger vGPR might not be worth the area cost on a small config. I was referring to N33 vs N31/32 btw in regards to a lite RDNA4.

So that was It.
Honestly, to me It newer made much sense to use different WGPs. That vGPR is not that big from the whole chip perspective and having more registers should be more power efficient if I am not wrong.

H2 2024 seems like a safe enough bet, but the 24Gbit parts might take longer, only Micron is confirmed to have 3GB GDDR7 dies right now. 128-bit cards need to move to 12GB capacity to remain viable. 192-bit could use the 18GB bump too, the bandwidth never hurts either.

If there won't be 24gb chips, then with 192-bit width you are limited to 12GB Vram or 24GB in clamshell, which would be costly.
I think 16GB 256-bit GDDR6 is a cheaper solution.

branch_suggestion said:
48MB MALL/192-bit 24-32Gbps GDDR7 is plenty for a 2-3 SE part.

If It's 64 CU at ~40% higher clockspeed then that's ~49% more TFLOPs than N32, comparable to 7900XTX.
N31 has 96MB IC and 384-bit 20gbps GDDR6.
48MB IC + 192-bit 32gbps GDDR7 would be 1/2 of infinity cache and 80% of BW.

branch_suggestion · Aug 20, 2023

Tigerick said:
I think the real reasons for AMD to create N43 and N44 which replace N32 and N33 without much improvements as generation gap is not because AMD needs solution to desktop platform but for notebook platform.

Both N32 and N33 suffer power inefficiency due to process node and chiplet design which is crucial to notebook platform. WIth Fire Range Zen5 notebook platform coming end of next year, AMD has to replace N32 and N33 with monolithic GPU in order to avoid embarrassment of using RTX4000 series, hence N43 and N44 are created.

Well yeah, 2024 is a big laptop push year, though Sarlak might take until 2025 due to OEM integration time.

TESKATLIPOKA said:
So that was It.
Honestly, to me it never made much sense to use different WGPs. That vGPR is not that big from the whole chip perspective and having more registers should be more power efficient if I am not wrong.

Clearly it was beneficial to AMD in some way.

TESKATLIPOKA said:
I think 16GB 256-bit GDDR6 is a cheaper solution.
If there won't be 24gb chips, then with 192-bit width you are limited to 12GB Vram or 24GB in clamshell, which would be costly.

Possibly, next gen is probably a mix of GDDR6/7 variants. N32's solution is massive overkill due to the clock miss though.

TESKATLIPOKA said:
N31 has 96MB IC and 384-bit 20gbps GDDR6.
48MB IC + 192-bit 32gbps GDDR7 would be 1/2 of infinity cache and 80% of BW.
If It's 64 CU at ~40% higher clockspeed then that's ~49% more TFLOPs than N32, comparable to 7900XTX.

64CU only makes sense with 4SE, I can't see 4SE fitting in the budget.
My solution is 60CU and is like 53.7TF @3.5Ghz with dual issue.
Remember it is 2SE vs 6, that simplifies things a lot, also RDNA4 MALL could be higher clocked and have some other refinements.
Also remember higher clocks means faster cache, and L2 is getting bigger which takes pressure off MALL and VRAM. RDNA is all about keeping data as close to the SIMDs as possible and maximising utilisation at all times. I see no issues with being bandwidth limited with my design.

TESKATLIPOKA · Aug 20, 2023

branch_suggestion said:
N32's solution is massive overkill due to the clock miss though.

Why? N31 is not any better.

branch_suggestion said:
64CU only makes sense with 4SE, I can't see 4SE fitting in the budget.
My solution is 60CU and is like 53.7TF @3.5Ghz with dual issue.
Remember it is 2SE vs 6, that simplifies things a lot, also RDNA4 MALL could be higher clocked and have some other refinements.
Also remember higher clocks means faster cache, and L2 is getting bigger which takes pressure off MALL and VRAM. RDNA is all about keeping data as close to the SIMDs as possible and maximising utilisation at all times. I see no issues with being bandwidth limited with my design.

How much do you even save by having less SE?
Let's say It's 60CU, but why only 2SE? You would lose RB, Prism and rasterizer units this way.
Shader engine would need to be changed a lot.

53.7TF vs 61.4TF, that's 12.5% less TF compared to N31.
N31 96MB IC has 5.3TB/s peak BW. That's 2.3GHz * 2304B/clock.
48MB IC would have 4TB/s peak BW at 3.5GHz.
Hitrate for 96MB at 4K is ~53% so effective IC BW is 0.53*5.3TB/s = 2.8TB/s
Hitrate for 48MB at 4K is ~35% so effective IC BW is 0.35*4TB/s = 1.4TB/s so 1/2 of N31
Add to them BW from Vram and you end up with total effective BW.
N31: 2.8TB/s + 0.96TB/s = 3.76 TB/s
N43?: 1.4TB/s + 0.77TB/s = 2.17 TB/s
73% difference in effective BW.
I don't think a few more MB of L2 is enough to compensate and I already added higher frequency for MALL.

branch_suggestion · Aug 20, 2023

TESKATLIPOKA said:
Why? N31 is not any better.

How much do you even save by having less SE?
Let's say It's 60CU, but why only 2SE? You would lose RB, Prism and rasterizer units this way.
Shader engine would need to be changed a lot.

53.7TF vs 61.4TF, that's 12.5% less TF compared to N31.
N31 96MB IC has 5.3TB/s peak BW. That's 2.3GHz * 2304B/clock.
48MB IC would have 4TB/s peak BW at 3.5GHz.
Hitrate for 96MB at 4K is ~53% so effective IC BW is 0.53*5.3TB/s = 2.8TB/s
Hitrate for 48MB at 4K is ~35% so effective IC BW is 0.35*4TB/s = 1.4TB/s so 1/2 of N31
Add to them BW from Vram and you end up with total effective BW.
N31: 2.8TB/s + 0.96TB/s = 3.76 TB/s
N43?: 1.4TB/s + 0.77TB/s = 2.17 TB/s
73% difference in effective BW.
I don't think a few more MB of L2 is enough to compensate and I already added higher frequency for MALL.

Well considering SE's can have 3 SA on RDNA4, it is also possible that other SE resources could bump up by 50% or more.
N31's memory was built for an 85TF compute engine, with even the prospect of doubled MALL for something pushed even harder.
There is certainly more to think about with memory pressure than raw FLOPS. The new top RDNA4 is not targeted for 4K, it is to usurp N21/32 at 1440p max. Still fine for 2160p, but it will need to be kept a tad below top settings for cutting edge games.

TESKATLIPOKA · Aug 20, 2023

branch_suggestion said:
Well considering SE's can have 3 SA on RDNA4, it is also possible that other SE resources could bump up by 50% or more.
N31's memory was built for an 85TF compute engine, with even the prospect of doubled MALL for something pushed even harder.
There is certainly more to think about with memory pressure than raw FLOPS. The new top RDNA4 is not targeted for 4K, it is to usurp N21/32 at 1440p max. Still fine for 2160p, but it will need to be kept a tad below top settings for cutting edge games.

Then this 60CU 3.5GHz RDNA4 should basically have comparable memory subsystem to N32.
N32: 64MB IC at 2.5GHz + 256-bit 18gbps GDDR6
N4*: 48MB IC at 4GHz + 192-bit 24gbps GDDR*
Or Infinity cache in RDNA4 can be wider by 1/3 and clocked at 3GHz. Infinity cache in N3* is also 3x wider per MB than Infinity cache in N2*.

jpiniero · Aug 29, 2023

branch_suggestion said:
There is certainly more to think about with memory pressure than raw FLOPS. The new top RDNA4 is not targeted for 4K, it is to usurp N21/32 at 1440p max.

This is where I point out that (from Compubase's 7900GRE review) that the 6950XT is 80% faster than the 7600. That seems too much to overcome unless the top RDNA4 die gets reshaped to be decently bigger than 200 mm2.

tajoh111 · Aug 29, 2023

I think when you consider the rumors of H100 being sold out for 2024(2 million x $30-40000), we are looking at a 80 to 100 billion dollar market in 2024.

This makes you realize how trash the gaming market is for AMD and Nvidia. A 10 billion dollar annual market with low margins and almost no future grow is something that Nvidia nor AMD would want to put further investment in beyond what they are already doing. Even consoles sales are stagnating in terms of units compared to last generation.

With this in mind and the software maintenance to maintain a good driver team, RTG is likely to be underfunded and use any possible cashflow to accelerate it's data centering roadmap or keep it on track. Hence the cut to AMD high end and pulling out of the market.

Considering AMD's likely 100-150 million discrete graphic sales, they just need to capture 5% of the market in AI data center using 2023 numbers. In 2024, AI data center is going to be a 20 billion dollars quarterly market, so 5% is a billion dollars.

As people who likely game more so than be involved in the AI data center market, we have to remove our biases and be pragmatic. AMD is going to do what is best for AMD, not consumers. Building 220mm2 nm and 350nm monolith on 3 nm with 192 and 256 bit bus on GDDR7 is unlikely to happen given consumers reluctance to pay.

adroc_thurston · Aug 29, 2023

tajoh111 said:
I think when you consider the rumors of H100 being sold out for 2024(2 million x $30-40000), we are looking at a 80 to 100 billion dollar market in 2024.

That's a bubble.

tajoh111 said:
This makes you realize how trash the gaming market is for AMD and Nvidia. A 10 billion dollar annual market with low margins and almost no future grow is something that Nvidia nor AMD would want to put further investment in beyond what they are already doing. Even consoles sales are stagnating in terms of units compared to last generation.

It's a perfectly fine market still.

tajoh111 said:
RTG is likely to be underfunded

They're not.
What's with the meme.

tajoh111 said:
Hence the cut to AMD high end and pulling out of the market.

They aren't.
The ultra-halo tiled parts just got moved over to RDNA5.

Ajay · Aug 29, 2023

adroc_thurston said:
That's a bubble.

There's no way to know this yet. Too early. Nonetheless, it will be a great line of business for Nvidia for the next few years (or more, depending on how well the coming competition does on their software stacks).

Saylick · Aug 29, 2023

Ajay said:
There's no way to know this yet. Too early. Nonetheless, it will be a great line of business for Nvidia for the next few years (or more, depending on how well the coming competition does on their software stacks).

Yeah, even if it were a bubble, they'll take the profit and use it to R&D in preparation for the next hype wave, be it self driving cars or quantum computing. Rinse and repeat.

adroc_thurston · Aug 29, 2023

Ajay said:
There's no way to know this yet

Oh it is.
I'll give it 4-6Q before startups go mass extinction/exit scam.

Joe NYC · Aug 29, 2023

adroc_thurston said:
Oh it is.
I'll give it 4-6Q before startups go mass extinction/exit scam.

That's precisely how I see it. Around end of 2024 to be the top in revenue. Unit may still sell, even grow, but with supply meeting demand, the astronomical (and inflated) margins will start to descend to "normal" territory.

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Platinum Member

Lifer

Senior member

Lifer

Senior member

Senior member

Platinum Member

Senior member

Platinum Member

Senior member

Platinum Member

Senior member

Senior member

Platinum Member

Senior member

Platinum Member

Senior member

Platinum Member

Lifer

Senior member

Platinum Member

Lifer

Diamond Member

Platinum Member

Platinum Member