Question Speculation: RDNA3 + CDNA2 Architectures Thread

Page 54 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,687
6,329
146

DownTheSky

Senior member
Apr 7, 2013
787
156
106
I was thinking something. It's possible chiplet cache is not as efficient as a single large pool of cache? By this I mean on a single die the CUs have direct access to all the cache through very fast interconnects. By splitting the cache in 6 chiplets you split the lanes in 6 so access to each chiplet is slower. Also if a CU needs some data from a chiplet there's 1/6 chance of finding it 1st try. That's if they don't make some kind of id mechanism or sort mechanism for all the data. Just ramblings from a guy who knows nothing about graphic cards.

Same for memory dies. Before they were linked to all the cache, now just to a part of it.
 
Reactions: Tlh97 and Leeea

Kepler_L2

Senior member
Sep 6, 2020
424
1,721
106
I was thinking something. It's possible chiplet cache is not as efficient as a single large pool of cache? By this I mean on a single die the CUs have direct access to all the cache through very fast interconnects. By splitting the cache in 6 chiplets you split the lanes in 6 so access to each chiplet is slower. Also if a CU needs some data from a chiplet there's 1/6 chance of finding it 1st try. That's if they don't make some kind of id mechanism or sort mechanism for all the data. Just ramblings from a guy who knows nothing about graphic cards.

Same for memory dies. Before they were linked to all the cache, now just to a part of it.
That split already happens in RDNA2. The L2 cache is global, however the L3/Infinity Cache is tied to each memory controller.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,386
1,652
136
Also if a CU needs some data from a chiplet there's 1/6 chance of finding it 1st try. That's if they don't make some kind of id mechanism or sort mechanism for all the data.

The data is sorted. Each cache slice caches accesses from a single memory controller. Memory locations are distributed to memory controllers based on their address. So the moment there is an access, you instantly know which cache it is possibly in.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,626
5,909
136
The N31 version with 3D V-Cache has 384MB of Infinity Cache, not 576MB.
Obviously I am not privy to any insider information but based on openly available information I am not able to see the advantages of the 3D V-Cache version of the MCD. Also, at FAD22 I recollect AMD specifically saying 3D chiplet for MI300 but not mentioned for RDNA3.

Additional cache needs additional real estate, be it stacked chiplet or regular base die/chiplet. So nothing changes here. And in terms of packaging cost, why stack dies using SoIC when you could just make a bigger MCD.
These chips are tiny anyway. Is overall chip packaging space so critical that it is worth the cost of SoIC stacking. The chip is going to be way smaller than what CoWoS could handle (which is ~2500 mm2) or if EFB is used this limitation does not arise.

On the other hand, if at all they are doing 3D V-Cache why skimp on the GCD (at less than 400m2). Having 3D V-Cache MCDs with 32MiB on top on the 32 MiB + 64 bit wide G6 bus would take the die area of each MCD to 65-70mm2. Six of them would take the total area to 400+ mm2.
In this case the total die area of the MCDs (including the die area of the stacked chiplet) would exceed the die area of the GCD (albeit the GCD is on N5).
 
Reactions: Tlh97

HurleyBird

Platinum Member
Apr 22, 2003
2,697
1,293
136
Obviously I am not privy to any insider information but based on openly available information I am not able to see the advantages of the 3D V-Cache version of the MCD. Also, at FAD22 I recollect AMD specifically saying 3D chiplet for MI300 but not mentioned for RDNA3.

Additional cache needs additional real estate, be it stacked chiplet or regular base die/chiplet. So nothing changes here. And in terms of packaging cost, why stack dies using SoIC when you could just make a bigger MCD.
These chips are tiny anyway. Is overall chip packaging space so critical that it is worth the cost of SoIC stacking. The chip is going to be way smaller than what CoWoS could handle (which is ~2500 mm2) or if EFB is used this limitation does not arise.

On the other hand, if at all they are doing 3D V-Cache why skimp on the GCD (at less than 400m2). Having 3D V-Cache MCDs with 32MiB on top on the 32 MiB + 64 bit wide G6 bus would take the die area of each MCD to 65-70mm2. Six of them would take the total area to 400+ mm2.
In this case the total die area of the MCDs (including the die area of the stacked chiplet) would exceed the die area of the GCD (albeit the GCD is on N5).

A pure cache chip may be more density optimized than one with both cache and logic. There's also placement so if you make the MCDs too large, you may not be able to fit enough of them around the compute die to enable wide buses. For the most part though, I agree. Something doesn't quite add up in that, based on leaks, it doesn't look like AMD is anywhere near to maximizing the potential of the architecture based on what we've heard so far. I see five possibilities:

1) AMD is feeding leakers disinformation, like when we thought RV770 was 640 SPs instead of 800.
2) Multiple GCDs were planned, but couldn't be made to work.
3) The leaks are right, but there's a higher performance part coming later. AMD is copying what Nvidia did with Kepler, Maxwell, and Pascal. Given the lack of leaks, this may not be a product for board partners at all (a "Titan" or "Fury").
2) Our performance expectations are about right, but the leaked die sizes are wrong.
5) An apple from the stupid tree feel off and hit an AMD big wig in the head, who decided it would be a good idea to try "sweet spot," (aka. we can easily take the performance crown, but we won't) again.
 
Reactions: Kaluan and Tlh97

Frenetic Pony

Senior member
May 1, 2012
218
179
116
Obviously I am not privy to any insider information but based on openly available information I am not able to see the advantages of the 3D V-Cache version of the MCD. Also, at FAD22 I recollect AMD specifically saying 3D chiplet for MI300 but not mentioned for RDNA3.

Additional cache needs additional real estate, be it stacked chiplet or regular base die/chiplet. So nothing changes here. And in terms of packaging cost, why stack dies using SoIC when you could just make a bigger MCD.
These chips are tiny anyway. Is overall chip packaging space so critical that it is worth the cost of SoIC stacking. The chip is going to be way smaller than what CoWoS could handle (which is ~2500 mm2) or if EFB is used this limitation does not arise.

On the other hand, if at all they are doing 3D V-Cache why skimp on the GCD (at less than 400m2). Having 3D V-Cache MCDs with 32MiB on top on the 32 MiB + 64 bit wide G6 bus would take the die area of each MCD to 65-70mm2. Six of them would take the total area to 400+ mm2.
In this case the total die area of the MCDs (including the die area of the stacked chiplet) would exceed the die area of the GCD (albeit the GCD is on N5).

The idea behind the large LLCs and stacking was for AMD to beat out the ever growing need for more cache while getting ever less gains from node shrinks. I suspect they want/ed to switch entirely to stacked cache as producing just SRAM as a chiplet is cheaper than producing in a mixed chip environment as SRAM needs a lot less metal layers than logic. It should also save on area below the die overall, meaning other than packaging costs it should be a win overall (cheaper to produce and design).

However TSMCs choice of packaging tech isn't fast/reliable enough to make that "cheaper overall" happen, at least not yet. Meaning the only thing they have to do with stacked SRAM right now is to produce specialty SKUs that can use a much higher than average amount of SRAM. Hypothetically a very top end GPU could fit in there, it takes 256mb just for an upsacled 8k accumulating/target buffer, let alone all the other working buffers you might need for working just at 4k. Thus a 384mb SRAM gpu could see use in the right cases, and would be an excellent target for stacked cache (maybe double up on the normal cache size of 192mb?)
 
Reactions: Tlh97

tomatosummit

Member
Mar 21, 2019
184
177
116
A pure cache chip may be more density optimized than one with both cache and logic. There's also placement so if you make the MCDs too large, you may not be able to fit enough of them around the compute die to enable wide buses. For the most part though, I agree. Something doesn't quite add up in that, based on leaks, it doesn't look like AMD is anywhere near to maximizing the potential of the architecture based on what we've heard so far. I see five possibilities:

1) AMD is feeding leakers disinformation, like when we thought RV770 was 640 SPs instead of 800.
2) Multiple GCDs were planned, but couldn't be made to work.
3) The leaks are right, but there's a higher performance part coming later. AMD is copying what Nvidia did with Kepler, Maxwell, and Pascal. Given the lack of leaks, this may not be a product for board partners at all (a "Titan" or "Fury").
2) Our performance expectations are about right, but the leaked die sizes are wrong.
5) An apple from the stupid tree feel off and hit an AMD big wig in the head, who decided it would be a good idea to try "sweet spot," (aka. we can easily take the performance crown, but we won't) again.
With both the shader engine change and chiplet I can see this is risky but with amd's performance over the few years I don't expect them to launch a dud. If the potential of the arch is missed then it's probably like ampere where the fp potential is doubled but only in synthetic environment. I like to think of it as a kind of cmt in the new shader engines.

Performance target has been very vague. The ball is in their court so it's probably beat nvidia in as many fields as they can without blowing a power budget, unless amd also has a 1KW card planned just for fanfare and benchmarks now with the amount of seemingly daily changes to nvidia's specs. We saw similar kind of leaks of every number possible before ampere and the changes to the 3080 being 102 instead of 103/104 based.

I don't know what to think about the original multi gcd leaks. They look horribly similar to what we're hearing as dual-navi32 now, even with a 256bit memory bus, although doubled but I never thought a single 256bit bus was ever in the realm of possibility for high end performance, even with cache, maybe gddr7 in the future.
Then there was the patent with the stacked cache cross bar throwing everyone off and someone correctly heard navi31 7 chiplets but then applied it to the old dual gcd design, making 2gcd, 4mcd and inventing a control or io chiplet as well.

I'd bet against stacked cache on the mcd or at least not seeing it put into a product that's released. It seems overly complicated for a small chiplet that should be a stocked part.
And do they need it the double cache to win performance? Probably something they'd be thinking about very hard right now up until the last moment they have if they put the tsvs on chip and have it as an option.
 

randomhero

Member
Apr 28, 2020
183
249
116
What if, and that is big if, stacked cache is for dual GCD models?
So, you use up half of your connects to memory(MCDs) to connect your GCDs(two to be exact). Now you are bandwidth starved. Stack additional cache on MCDs. Voila!
 

Karnak

Senior member
Jan 5, 2017
399
767
136
Again there shouldn't be any reason to do more than one cut per line... especially that far for N32 and N33.
If cut downs are more effective than having 5 dies in total rather than only the allegedly 3 then there is.

That's more plausible than calling it a fake and I'm sure it'll be similar to this. The 6800 already is a 25% cut down from the full chip. Cut down the bigger chip by a decent amount and clock the one below to the moon (think you'll get what that's supposed to mean) and the gap isn't that huge anymore - and you don't need a third one in between both of these.
 

jpiniero

Lifer
Oct 1, 2010
14,739
5,368
136
If cut downs are more effective than having 5 dies in total rather than only the allegedly 3 then there is.

That's more plausible than calling it a fake and I'm sure it'll be similar to this. The 6800 already is a 25% cut down from the full chip. Cut down the bigger chip by a decent amount and clock the one below to the moon (think you'll get what that's supposed to mean) and the gap isn't that huge anymore - and you don't need a third one in between both of these.

N21 was 520 mm2. Presumably with the IF$ and memory controllers separated into separate chips, N31's main die won't be anywhere near that big.
 

Kepler_L2

Senior member
Sep 6, 2020
424
1,721
106
Oh god. Youtubers gonna Youtube. Again there shouldn't be any reason to do more than one cut per line... especially that far for N32 and N33. That's not even a decent fake.
It's not fake but a little wrong. 7950XT is 84 CUs, 7800 is 256-bit, 7600XT is 32 CUs, 7600 is 28 CUs and 7500XT does not exist for now.
 
Reactions: Kaluan

Saylick

Diamond Member
Sep 10, 2012
3,269
6,752
136
Are we sure N33 is 40 CUs again? That implies we're back at 5120 shaders, no? I thought the rumors are that N33 topped out at 4096 shaders? The rumor mill seems to constantly flip-flop these days. I don't really want to believe anything at this point simply because we're only a few months out from the reveal.
 

Frenetic Pony

Senior member
May 1, 2012
218
179
116
Oh god. Youtubers gonna Youtube. Again there shouldn't be any reason to do more than one cut per line... especially that far for N32 and N33. That's not even a decent fake.

I can see disabled SE's being a thing, they have been for AMD for a while now (Rx 6800, etc.) but those should be rare enough to not do anything with them while inventory builds up. It's also quite rare to see AMD cut bus size, I guess yields on TSMC w/GDDR6 are just too high there to make it worth it, meanwhile they were bad enough for GDDR6x on Samsung to provide plenty of cut dies for Nvidia. I don't see this suddenly changing.

Also the bandwidth to CU figures are, odd.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,697
1,293
136
It's not fake but a little wrong. 7950XT is 84 CUs, 7800 is 256-bit, 7600XT is 32 CUs, 7600 is 28 CUs and 7500XT does not exist for now.

The 6800XT is a 10% cut of a 520mm2 die. The 3080 Ti is a 5% cut of a 628mm2 die. Color me skeptical that the 7950XT is a 12.5% of an allegedly 350mm2 die. The naming scheme makes it seem even less likely. The 7975 XT name brings to mind a slightly juiced 7950XT which was only released because the performance crown is so closely contested that every drop of performance matters.

This doesn't add up unless there's something really weird going on behind the scenes. Many indications yields are decent enough. If AMD felt so far ahead of Nvidia that they could make sizeable cuts without worry they wouldn't be pushing TDPs. Either AMD marketing has regressed by about a decade, or there's something really odd happening, or you're being fed bad info.

Now, I'm not saying it's 86 CUs like RGT claims. I'm just saying this naming scheme + this alleged die size + these alleged cuts = big red flag.
 
Last edited:
Reactions: Tlh97 and Elfear

maddie

Diamond Member
Jul 18, 2010
4,783
4,759
136
If AMD is pushing clocks as much as is rumored, then besides defect yield, parametric yield should have more influence than previously.

In any case, what is the crossover point in sales volume, when doing a new design is cheaper than cutting a bigger die? Without knowing this, its all fantasy. You cut because its cheaper, not always because you have to, due to defects, as some claim. And, you know what, you sometimes estimate wrong. You sell more than you thought, but this has to be balanced with the additional staff you would have employed to do more designs and not cut an existing die. These are multi-dimensional problems.

The 320 & 192 bit products can easily be explained by them being multichip products, An additional failure node happens with the chip assembly step. Some will have cache/mem controller bonding failures.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,697
1,293
136
If AMD is pushing clocks as much as is rumored, then besides defect yield, parametric yield should have more influence than previously.

Outside of the top, fully enabled and cherry picked parts, pushing clocks to that extent doesn't make much sense. If pushing clocks requires significant cuts, you're going two or three steps back for every step you take forward. The cuts eat into your performance gains while TDP balloons and board costs increase.
 

maddie

Diamond Member
Jul 18, 2010
4,783
4,759
136
Outside of the top, fully enabled and cherry picked parts, pushing clocks to that extent doesn't make much sense. If pushing clocks requires significant cuts, you're going two or three steps back for every step you take forward. The cuts eat into your performance gains while TDP balloons and board costs increase.
I always felt that they designed for a slightly flawed chip as their main product. Eg, the 6800XT not the 6900XT was the target. In my mind there was only 1 cut, the 6800. The 6900XT was a sort of golden product. Of course, as the process improves you get more upper end products from the fab.
 
Reactions: Tlh97 and Leeea

TESKATLIPOKA

Platinum Member
May 1, 2020
2,381
2,879
136
It's not fake but a little wrong. 7950XT is 84 CUs, 7800 is 256-bit, 7600XT is 32 CUs, 7600 is 28 CUs and 7500XT does not exist for now.
7950XT with 84 CUs would be very close to 7900Xt with 80CU, even If that has only 320bit bus.
7700XT is a pretty big cutdown(75% everything) from N322, and It has wrong number of WGP.
 

beginner99

Diamond Member
Jun 2, 2009
5,216
1,589
136
The 6800 already is a 25% cut down from the full chip

Hence why the SKU simply did not exist during the shortage. Even the 6800xt barley did as every chip seems to have been used for a 6900xt.
If we see 3 SKUs form say N31, I would expect the top one to have more RAM and higher clocks (or just higher clocks) just like 6800xt vs 6900xt). The second part then isn't cut-down at all, just lower clocks and the 3rd sku it the only cut-down part.
 

Leeea

Diamond Member
Apr 3, 2020
3,649
5,382
136
Hence why the SKU simply did not exist during the shortage. Even the 6800xt barley did as every chip seems to have been used for a 6900xt.
If we see 3 SKUs form say N31, I would expect the top one to have more RAM and higher clocks (or just higher clocks) just like 6800xt vs 6900xt). The second part then isn't cut-down at all, just lower clocks and the 3rd sku it the only cut-down part.
The rx6800xt is different from the rx6900xt. The rx6900xt has more ray accelerators, texture units, and stream processors:

The rx6800xt and rx6900xt have the same clocks.
 

Timorous

Golden Member
Oct 27, 2008
1,705
3,040
136
7950XT with 84 CUs would be very close to 7900Xt with 80CU, even If that has only 320bit bus.
7700XT is a pretty big cutdown(75% everything) from N322, and It has wrong number of WGP.

80CUs, reduced clocks, less cache and less ram is a reasonable differentiator vs the 84CU version. Unless AMD can cut in a lop sided way though the cut N31 has to be 84CUs because that cuts 1 WGP per SE and thus that is how the maths works.

7700XT is a pretty big cut down but the base N32 die is only around 230mm so it is pretty small for x700 tier let alone x800 tier so I think margin is there.

With MCDs and a mix/match strategy between knows it brings in a new paradigm for the best way to manage your product stack which means the old 'you only need 1 cut because yields are so good' mantra may not be true any longer.
 
Reactions: Kaluan
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |