Slower than L3 for sure, but a damn sight lower latency than DDR mounted on the motherboard, not to mention lower power draw too as the distance is so much less to travel - if they ever mount HBM on the CCX chiplets it would make a killer L4.
I don’t think it actually would make a good L4; it is still DRAM and it is quite slow to read. It is pretty unclear where AMD is going next. I don’t know if it is really worth it to put cpu chiplets on an interposer. The cpu chips in Rome are around 600 square mm by themselves. It is over 400 more for the IO die. That would be a very big and expensive interposer possibly without much benefit.
I could see them going with an interposer for the IO die though; an active interposer would make a huge amount of sense. They could place all of the larger, higher power transistors needed to drive external interfaces into the interposer and then stack 7 or 5 nm chiplets on top for the logic. Interposers add a lot of flexibility, so it is hard to predict what configuration they would choose. The IO interposer may actually be smaller than the current IO die since it would have twice the effective die area. They may be limited by the number of pads required for all of their interfaces though. It has a huge number of signal pins that may limit how small it can be.
I am assuming that the current IO die arranges the memory controllers into 4 128-bit controllers. The infinity fabric is now 256-bit read and 128-bit write, so they need 128-bit memory controller operating at DDR rate (effectively 256-bit per clock) to supply it. I could see them using four separate 7 or 5 nm memory controller chiplets with a huge amount of SRAM cache on each one. They could connect to each other with very wide paths at low clocks to save power. Cache scales very well, so it could be a very large cache.
They would also need another few chiplets for the IO and fabric logic but that isn’t as important as the memory latency, so it doesn’t need to be as tightly coupled. This may be where the 15 chiplet rumors come from. They may have room on the Epyc package for 2 more cpu die, bringing the total up to 80 cores.
This would require a huge amount of redesign of just about everything though, so something like this may not be coming until Zen 4 or 5. I would guess at least Zen 4. They may do something like this with the switch to DDR5. I suspect Zen 3 will be similar IO die; it is already massively over kill on the IO. I think they will probably focus on core improvements. They may tweak the IO die design for better latency. They could even shrink it and add some L4 cache, although, I don’t know if 7 nm makes sense. A shrink could also make room for more cpu die, although IO doesn’t scale well.
For core improvements, I don’t think SMT 4 is that outlandish. Zen was designed with server in mind. It had a giant infinity fabric switch and 4 IFOP links that were completely wasted in the consumer space. Zen 2 gets rid of that so the consumer design doesn’t have all of the extra server components wasting die area. You generally wouldn’t want SMT 4 for the consumer space, so if they do implement it, it may be not supported on consumer parts or disabled by default.
What enthusiast don’t seem to get is that for a lot of servers, all of that AVX hardware is completely wasted. A lot of servers perform almost zero FP operations. A lot of server code is very branch intensive with hard to predict branches. They also have large memory footprints that reduce the effectiveness of caches. Server code often achieves an IPC of much less than one. I have profiled code with processor counter monitor and even seemingly compute heavy code often only achieves an IPC of around 1.
Certain types of server code can make good use of SMT since it can’t achieve very high IPC anyway, so you can run some extra threads and get much more throughput. It is a good way of sharing things like the FP units that go mostly unused in many servers. Earlier processors have done 4 and 8 way SMT. The SPARC T-series processors went up to 8-way a long time ago. The issue with those were that they were made on 65 nm for the T2 with 8 cores / 8 threads per core. That is maybe a couple hundred million transistors, but it performed well for some specific applications. Zen 2 is already close to 4 billion transistors just on the cpu die. With 7 nm+ or 5 nm, they can afford to duplicate or intelligently share a lot of resources that would not have been possible with earlier implementations. Also, for those that don’t know, SMT is often not useful for HPC applications. It depends on the application, but many HPC applications are compute intensive enough that SMT can hurt performance. It is often disabled on HPC machines.
If they don’t increase the actual core count with Zen 3, then doubling the number of logical cores could be a good substitute if they throw enough hardware at it. I wouldn’t mind having 512-thread machine for compiling code, even if it is *only* 128 physical cores.
I kind of doubt that we will see AVX512. I always saw AVX512 as intel’s kludge to try to make their cpus perform like gpus. From what I have heard about Xeon Phi from HPC people, it didn’t really work that well and intel is now designing their own gpu anyway. If you have something that can really take advantage of 512-bit vectors, then you probably should look at running it on a gpu. They may go up to 4 full 256-bit FMA units, but increasing the width again also requires increasing all of the interconnect to feed the new units. The die area may not be the issue. The interconnect may just burn too much power.
I guess I kind of expect some bigger core updates with Zen 3 and the IO not changing that much. The big IO changes may come with Zen 4 with DDR5 and, I guess, possibly pci-express 5.0. They could certainly tweak the IO die with Zen 3, but I don’t think there will be radical changes. Moving to an interposer requires completely redesigning everything to take advantage of wide internal paths; it seems too soon for that. They will need to redesign for DDR5 anyway. I don’t expect HBM because it would probably require an interposer and therefore a complete redesign. It also just isn’t a good cache.