5&6. In order to understand what I mean by ROp to cache or ROp to Memory Controller ratio, we need to look at a schematic of GM107.
GM20x differs from GM107 in that NVIDIA increased the ROps ratio from 8:1 to 16:1. So lets look at both GM204 and GM200.
GM204
- 64 ROps divided by 16 = 4.
- 2MB of L2 cache divided by 4 = 512KB.
- 256bits/4 = 64bits
- Each grouping of 16 ROps has 512KB L2 cache and a 64-bit memory controller at its disposal (aside from the color cache).
GM200
- 96 ROps divided by 16 = 6.
- 3MB of L2 cache divided by 6 = 512KB.
- 384bits/6 = 64bits
- Each grouping of 16 ROps has 512KB L2 cache and a 64-bit memory controller at its disposal (aside from the color cache).
The result is that there isn't enough bandwidth to feed these ROps and they're consistently 10GPixel/s behind their theoretical throughput. This is without any other work straining the memory controller or L2 cache as seen here:
NVIDIA thus, knowing this was a limitation, invested heavily in color compression algorithms in order to reach parity, or near parity, with Fiji and its 64 ROps as seen here:
This issue is further compounded by the inefficient memory controllers used by GM20x. NVIDIA had to sacrifice efficiency in order to keep die size down and power usage low as seen here: