Discussion Modern GPU designs

biostud · Nov 4, 2022

I'm no hardware engineer so it is just based on a layman's observations.

SLI/CF is now virtually dead, and we have returned to monolithic (and now chiplet) designs. SLI/CF gave os the possibility to increase our performance usually around 70-80% (if it worked), but at double cost and power consumption. But with it, problems with microstutter and not all games supporting it also existed.

With the 4090 and 7900 series presented we see nearly a triple number of transistors used compared to last generation, but raster performance is not even doubled. As I understand it, we run into the problem with total power budget for the GPU, as it cannot increase in the same amount as the number of transistors.

So what I'm wondering is why is the performance/transistor decreasing so drastically with each generation? It doesn't seem like an economical design. Wouldn't it be better (if possible) to make the die smaller and run with a higher frequency?

If most of the performance/watt increase comes from change in proces nodes, then can we ever expect better performance/watt gen. to gen. than it allows for?

Mopetar · Nov 4, 2022

Transistors are being used for things like RT and Tensor cores or other instructions/features that don't improve performance in the general sense.

I suspect we'll circle round on SLI/Xfire again if only because the economics of moving off of a monolithic die are so favorable.

The technology to go that route will probably be mature enough by the next console cycle and they have even more reason to go that route. That pushes developers to design for the approach.

Performance per watt is likely to remain constrained by node advances. If there were a way to see massive architectural gains, designers would be doing that already because at a certain point, the power wall is a limiting factor for top end GPUs.

biostud · Nov 4, 2022

Yeah, if I remember correctly there has been power someone measuring power draw of the 4090 to 375W when RT/tensor cores where not in use.

But if we look at navi31, then it isn't carrying tensor cores or rt cores, but the transistor count is -2.5 times greater than navi21.

The red spirit · Nov 4, 2022

In short breakdown of Dennard scaling (https://en.wikipedia.org/wiki/Dennard_scaling). More advanced explanation:
Due to size of transistors being shrunken down and down, current leakage became more and more significant. We can make them smaller, but we can't do that due to them leaking. Some tech has been developed to counter that like high K dialectrics, FinFET or strained silicon. That helps a bit, but we are still hitting limits of that. All of it means that transistors aren't perfectly insulated and even if transistor is "off", some electrons leak and that ends up being heat, extra power consumption. The smaller and denser chip gets, the more "noise" there is. Leakage also increases with voltage and to achieve twice as high frequency we need squared voltage. Besides that, manufacturers like AMD or nVidia, clcok cards way past their highest perf/watt point. How much depends on each model, but 4090 has been very absurd. You can cut clock speed by 5% to get 25% power savings.

biostud · Nov 4, 2022

The red spirit said:
In short breakdown of Dennard scaling (https://en.wikipedia.org/wiki/Dennard_scaling). More advanced explanation:
Due to size of transistors being shrunken down and down, current leakage became more and more significant. We can make them smaller, but we can't do that due to them leaking. Some tech has been developed to counter that like high K dialectrics, FinFET or strained silicon. That helps a bit, but we are still hitting limits of that. All of it means that transistors aren't perfectly insulated and even if transistor is "off", some electrons leak and that ends up being heat, extra power consumption. The smaller and denser chip gets, the more "noise" there is. Leakage also increases with voltage and to achieve twice as high frequency we need squared voltage. Besides that, manufacturers like AMD or nVidia, clcok cards way past their highest perf/watt point. How much depends on each model, but 4090 has been very absurd. You can cut clock speed by 5% to get 25% power savings.

So by increasing transistor count by >150%, and shrinking the transistors you only get up to 70% more performance in the same power budget.
I can understand the 4090 where they also increase the power budget, but I would have thought that N32 with the same power budget as the N31, would have enough transistors to get the performance of the N31.

Mopetar · Nov 4, 2022

biostud said:
Yeah, if I remember correctly there has been power someone measuring power draw of the 4090 to 375W when RT/tensor cores where not in use.

But if we look at navi31, then it isn't carrying tensor cores or rt cores, but the transistor count is -2.5 times greater than navi21.

The doubled cores that both AMD and NVidia have implemented in recent architectures, can theoretically double the performance, but there are plenty of cases where only one of the pair can be utilized, which is why neither company saw the performance double, even though the shader count effectively did. It does work a lot better for compute workloads, however.

All the extra cache has added more transistors not just for the cache itself, but the controllers to support it as well. Some of that means fewer transistors spent on memory interconnects, but more transistors will be used on the cache than the memory interconnects replaced by them. Even if that seems wasteful, it's more power efficient and possibly more area efficient as well.

Almost anything that either AMD or NVidia does will lower the traditional efficiency per transistor. But that's okay, as long as we have a next node the trade off is generally worth it.

The red spirit · Nov 4, 2022

biostud said:
So by increasing transistor count by >150%, and shrinking the transistors you only get up to 70% more performance in the same power budget.
I can understand the 4090 where they also increase the power budget, but I would have thought that N32 with the same power budget as the N31, would have enough transistors to get the performance of the N31.

Here's a fun part, nowadays new "node" can also mean no shrinking. You can gain some density if you figure out how to avoid quantum tunneling and current leakage. Der8auer made a video about how manufacturers "lie" about node sizes (they don't lie, they ran out of Moore's Law like scaling, improved manufacturing in many other ways and don't know how to best define their advances without doing bad PR job).
As to why those Navi chips perform differently it's more complex answer and frankly not determined by process node as it is the same. Thing is is bigger chip, which is clocked lower will not need as much voltage to achieve same level of performance as smaller die clocked higher. The reason for this is the rule of power usage (https://en.wikipedia.org/wiki/Processor_power_dissipation):

Here P is power usage in watts, C is current in amps, V is voltage in volts and f is frequencies in Hertz. And another rule is that for each unit of frequency to stabilize you more or less need square of voltage. However, it's true that due to current leakage big chips use more amps while idle and due to quantum leakage for their are size they also have more leakage of current, but unlike voltage amperage scale linearly, meanwhile voltage squares. So that's why smaller chip can't crank clock speeds high enough to compensate reduction in die size, well at least not at some power consumption or long term durability.

igor_kavinski · Nov 4, 2022

I wonder why rasterization is still a thing with 3D graphics. Wouldn't a complete vector based pipeline be the way to go, so the textures are infinitely scalable to any resolution? Is it because these companies are just trying to improve the basic technology they started with in the 90s and don't want to disturb the status quo because it would be too risky?

Mopetar · Nov 4, 2022

igor_kavinski said:
I wonder why rasterization is still a thing with 3D graphics. Wouldn't a complete vector based pipeline be the way to go, so the textures are infinitely scalable to any resolution? Is it because these companies are just trying to improve the basic technology they started with in the 90s and don't want to disturb the status quo because it would be too risky?

Screens are a fixed number of pixels so any image, regardless of how it's represented or worked on internally ultimately needs to be rasterized.

It also turns out that we have a lot of people and tools that are really good at creating and mapping textures onto polygon meshes, so the current mode of graphics we're using isn't going away anytime soon.

Sure we have the hardware now to fully ray trace a game and do so at playable frame rates. The only problem is that it has to be a game that's a few decades old because we don't have the horse power for anything more complex unless you're okay measuring your slide show in SPF.

Leeea · Nov 4, 2022

biostud said:
But if we look at navi31, then it isn't carrying tensor cores or rt cores, but the transistor count is -2.5 times greater than navi21.

This is not true.

Rdna3 has both rt and ai units.

IntelUser2000 · Nov 5, 2022

biostud said:
With the 4090 and 7900 series presented we see nearly a triple number of transistors used compared to last generation, but raster performance is not even doubled. As I understand it, we run into the problem with total power budget for the GPU, as it cannot increase in the same amount as the number of transistors.

Well, this is too simple of an explanation, and when we look at complex objects, simple explanations have limits.

High-K and FinFET has pretty much addressed the leakage issues, but still hasn't stopped the slowing down of scaling, and I mean in terms of performance/watt. So rather than just shrinking offering power reduction per transistor, to get the same effect as before you need to change materials and/or structures, which gets harder, longer, and more expensive.

In addition to that we have performance being limited by data transfer, because that's taking an increasing amount of power, and we're limited by power.

So the transistor count increases disproportionately to compute improvement, because things like Tensor Cores and caches are area and power efficient.

The actual transistor density has not improved 3x across the board, but because they are increasing the area that's easier to scale, you have more transistors than before.

If Ada and RDNA3 just got more of what predecessors had, I bet you it would be 2x the transistors and roughly 2x the performance gain. Well, you are probably losing 10-15% performance by not focusing entirely on raster, but you get entirely new things like DLSS(not DLSS3 LOL!) and Ray Tracing.

biostud · Nov 5, 2022

IntelUser2000 said:
Well, this is too simple of an explanation, and when we look at complex objects, simple explanations have limits.

High-K and FinFET has pretty much addressed the leakage issues, but still hasn't stopped the slowing down of scaling, and I mean in terms of performance/watt. So rather than just shrinking offering power reduction per transistor, to get the same effect as before you need to change materials and/or structures, which gets harder, longer, and more expensive.

In addition to that we have performance being limited by data transfer, because that's taking an increasing amount of power, and we're limited by power.

So the transistor count increases disproportionately to compute improvement, because things like Tensor Cores and caches are area and power efficient.

The actual transistor density has not improved 3x across the board, but because they are increasing the area that's easier to scale, you have more transistors than before.

If Ada and RDNA3 just got more of what predecessors had, I bet you it would be 2x the transistors and roughly 2x the performance gain. Well, you are probably losing 10-15% performance by not focusing entirely on raster, but you get entirely new things like DLSS(not DLSS3 LOL!) and Ray Tracing.

So they are using a lot of extra transistors in the new design, without any performance benefits?

IntelUser2000 · Nov 5, 2022

biostud said:
So they are using a lot of extra transistors in the new design, without any performance benefits?

They are, but it's being used for things like the new blocks(Ray Tracing/Deep Learning Upscaling), and power efficiency such as caches.

That's why transistor count is a misleading metric. Because you can often decouple the actual density gain from transistor improvement. Plus you have different way of counting them. If X chip has 3 billion transistors, and Y chip has 6 billion transistors, but X = Y, then who cares?

If you look at things in total such as Shader count, TMU increase then you get a better picture, though still not complete. Like recent architectures(starting from Turing with Nvidia and RDNA3 with AMD) the amount of shaders increase dramatically, but rest of them stay same.

So the overall amount seems a lot less, and it confuses people because people think shader = performance. But in reality if you look at many games in general, it's bit of shaders, bit of fillrate(TMU/ROP), that uses a lot of memory(bandwidth), and Ray Tracing depends on performance of that particular unit. You have 4 major sectors, so what happens when you improve just 1?

biostud · Nov 5, 2022

IntelUser2000 said:
They are, but it's being used for things like the new blocks(Ray Tracing/Deep Learning Upscaling), and power efficiency such as caches.

That's why transistor count is a misleading metric. Because you can often decouple the actual density gain from transistor improvement. Plus you have different way of counting them. If X chip has 3 billion transistors, and Y chip has 6 billion transistors, but X = Y, then who cares?

If you look at things in total such as Shader count, TMU increase then you get a better picture, though still not complete.

As a consumer I don't care. All that matters is price/performance.

As a tech enthusiast I'm just trying to understand the logic behind it. (Without having any technical background.)

IntelUser2000 · Nov 5, 2022

biostud said:
As a tech enthusiast I'm just trying to understand the logic behind it. (Without having any technical background.)

That's awesome. Keep at it, that's how you learn!

Reread the post again. I was updating on the way lol.

The people and teams working on these can and be genuinely called architects. Just like those working on skyscrapers and on ships, the amount of work and experience needed to bring it to fruition is not a trivial thing at all.

An architect cannot consider one area when looking for improvements. Everything has to improve. And then you are doing all dynamically, meaning you have to adjust to market conditions.

Games are probably going in the direction of being shader-heavy hence the bigger focus on improving shader performance without increasing others as much. In addition, for RDNA3, in order to double the amount of ALU throughput, they are compromising somewhere. So 1 RDNA3 shader doesn't equal 1 RDNA2 shader. Pretty close, but not quite.

biostud · Nov 5, 2022

I know it is not "simple" but if RDNA2 was simply ported to 5nm what kind of performance increase would you think possible?

The red spirit · Nov 5, 2022

biostud said:
I know it is not "simple" but if RDNA2 was simply ported to 5nm what kind of performance increase would you think possible?

Same performance. You get performance only if you improve IPC, therefore expanding current architecture or by clocking it higher.

Anyway, I think you might want to check out rDNA whitepaper:

https://www.techpowerup.com/gpu-specs/docs/amd-rdna-whitepaper.pdf

And then official rDNA 2 documentation:

AMD RDNA™ Performance Guide

Our one-stop resource for getting great AMD RDNA™ performance on Vulkan® and DirectX®12 APIs!

gpuopen.com

igor_kavinski · Nov 5, 2022

IntelUser2000 said:
In addition, for RDNA3, in order to double the amount of ALU throughput, they are compromising somewhere.

The AT article mentioned that they chose the cheap way out by using dual issue and performance will be more dependent on the drivers ensuring that the hardware doesn't go under-utilized. They are pulling the fine wine crap again.

The red spirit · Nov 5, 2022

igor_kavinski said:
The AT article mentioned that they chose the cheap way out by using dual issue and performance will be more dependent on the drivers ensuring that the hardware doesn't go under-utilized. They are pulling the fine wine crap again.

Meh, it might work out just fine. AMD works closely with Microsoft. Microsoft owns xBox that relies on Radeons and creates DirectX API. Neither of them want to fail and therefore are ready to put in extra effort for each other. This relationship helped nVidia a lot back in around 2008, when it rolled out CUDA and driver optimizations.

biostud · Nov 5, 2022

The red spirit said:
Same performance. You get performance only if you improve IPC, therefore expanding current architecture or by clocking it higher.

Anyway, I think you might want to check out rDNA whitepaper:

https://www.techpowerup.com/gpu-specs/docs/amd-rdna-whitepaper.pdf

And then official rDNA 2 documentation:

AMD RDNA™ Performance Guide

Our one-stop resource for getting great AMD RDNA™ performance on Vulkan® and DirectX®12 APIs!

gpuopen.com

Yes, I know if it runs the same clock it would be same performance, what I meant was how much higher would it clock in same power budget going from 7nm to 5nm.

The red spirit · Nov 5, 2022

biostud said:
Yes, I know if it runs the same clock it would be same performance, what I meant was how much higher would it clock in same power budget going from 7nm to 5nm.

Hard to say, because it's process specific (more accurately fab specific due to what VLSI block rules they have and how where exactly leaks are). I only know that you might need less voltage for smaller chip, because of lower capacitance. It's probably not a whole lot, maybe 20%. It goes back to Dennard's Law that I mentioned, which broke down around 2006 and now scaling has been worse and more volatile.

Edit: I found some data. TSMC's N7 process has density of 95 million transistors per square millimeter, N5 has 127. We can take Navi 21 die (RX 6900 XT) as example and it is 520 square millimeters big and has 26800 millions transistors. For some reason Navi 21 die has density of just 51.5 million transistors per millimeter, so if we wanted to make it with N5 process, we need to increase Navi 21's density from 51.5 milT/mm2 to 69 milT/mm2 (51.5 * 1.34). We end up with roughly 388 mm2 die. Leakage current tend to increase the denser chips get, but it decreases if we use lower voltage. I have no idea how to calculate leakage current, so that is thrown out. So if we only get voltage reduction for same clock speed only, we may get squared reduction in voltage requirements, but then again we need squared voltage increases for linear clock speed increases. We end up at same 34% improvement, but it's also optimistic, because I mentioned that leakage increases with higher voltage. And I don't know how much, perhaps by root of our "gains" and we only end up with measly 16% of clock speed gain. But I'm probably wrong here, so it would be nice to be corrected by someone who knows more.

IntelUser2000 · Nov 5, 2022

@biostud Significant, but not a lot in the big picture. 10-20% maybe?

igor_kavinski said:
The AT article mentioned that they chose the cheap way out by using dual issue and performance will be more dependent on the drivers ensuring that the hardware doesn't go under-utilized. They are pulling the fine wine crap again.

It's just an engineering tradeoff. The doubled ALUs with the new design come with almost no area increase. Being more dependent on drivers and scaling less with shader firepower is a worthy tradeoff when it comes for "free".

A/// · Nov 6, 2022

Sli will hold a special place in my heart. my very cold heart. I had a badboy setup of 7900 GTX in sli. I had one of the best rigs on a game forum. I SLI'd nearly every generation until I grew tired of gaming. only for it to snap back to me in delight years later with me wanting a 7900xt or xtx.

biostud · Nov 11, 2022

So the idea is that it is better to do a design that uses a lot of transistors, if that prevents the power budget to be blown?
And is it the nature of a GPU, because it is often under full load that, if you have to many watt/mm^2 it is difficult to keep cool and thermal resistance also goes up? So a larger design with fewer watt/mm^2 is preferable?

The red spirit · Nov 11, 2022

biostud said:
So the idea is that it is better to do a design that uses a lot of transistors, if that prevents the power budget to be blown?

For GPUs, generally yes, because they aren't exactly general purpose processors and their performance scales extremely well with core count, due to the nature of their tasks. And since we ignore reality, I suppose we can ignore higher silicon failure of bigger die chips and higher activation voltage required for bigger chips, as well as higher idle wattage as result of higher amperage and leakage of bigger dies.

biostud said:
And is it the nature of a GPU, because it is often under full load that, if you have to many watt/mm^2 it is difficult to keep cool and thermal resistance also goes up? So a larger design with fewer watt/mm^2 is preferable?

Yes, it's difficult to cool down high heat density chips, however it's usually preffereable to have higher heat density than bigger die, because bigger dies suck more power at idle and bigger dies may clock lower due to their inherent higher voltage requirements, which arise from having to maintain same signal integrity across bigger area. As to which is preferable depends on particular design and needs

Discussion Modern GPU designs

Lifer

Diamond Member

Lifer

Member

Lifer

Diamond Member

Member

Lifer

Diamond Member

Diamond Member

Elite Member

Lifer

Elite Member

Lifer

Elite Member

Lifer

Member

Lifer

Member

Lifer

Member

Elite Member

Diamond Member

Lifer

Member