Well, I don't understand the last bit about OS scheduling, and you definitely want to keep the front end close to the L1 and L2 and the execution units. (The retire also needs to be close to the front end and L1.) As far as software scheduling I can't see vast changes, only improvements in making the OS more aware of the topology (especially in big+little hybrid cores).
One can add another layer of abstraction so that the OS only sees virtual cores and allows the CPU (with all its branch predictors, prefetchers, internal statistics, and awareness of shared data in the L3) make the decision how to match up the virtual cores to the physical cores. There are pros and cons; as long as there is a way to disable such a mode in the bios setup I'd be ok with it.
I agree with moinmoin and Vatilla that widening the core without going beyond SMT2 would cause under utilization of the core. I think they've already gone way beyond the low hanging fruit for branch prediction and prefetch and other bottlenecks; I think they are near the point where it's a choice refraining from widening the core, accepting low utilization, or going beyond SMT2. It wouldn't surprise me if % utilization is already on the low side despite all the progress made with the two tier branch prediction and pre-fetchers and the doubled op cache and wider re-order window.
3. You don’t *need* more threads because you widened the core, you may want wider decode/dispatch units and better prefetch and speculative execution to keep the execution units busy. [....]
AMD would be better off putting 3 CCXs on a die - there is no reticle limit with chiplets, there are just package size and power limits. More cores is the most efficient way to gain significant MT performance for now.
If there is no need to increase IPC nor efficiency, yes that would be a perfectly fine (and low cost) way to do it.
The problem is the market is demanding more IPC as well as lower wattage for mobile PC's (the biggest market) and server. So for servers and high end ULP laptops, just throwing more cores at it is often not a good solution. In fact, for ULP laptops you want to keep core count down to very reasonable number (for power consumption's sake). You might go on the high side for mobile core count if you are able to dynamically power down physical cores.
This ability to power down cores is something that SMT4 could enable for some consumer products like ULP mobile. If a hypothetical future AMD APU (with SMT4 capable cores) senses that it's been at the lowest p-state and with low utilization load for a while, it can power down half its cores and transition on the fly from SMT2 to SMT4 mode, and then power up cores again and transition back to SMT2 mode once it senses p-state increasing or higher utilization. (All the while it would appear just as an SMT2 processor to the consumer, eg 4c/8t.)
As for going SMT4, I think it will be the case for Zen 4. For Zen 3 my hope is that it's SMT2 plus background thread(s).
These hypothetical background threads would have their own dispatch and primitive scheduler without register renaming (which makes them simple and very close to an in-order execution core). Minimal (if any) L1 and L2 prefetch. No speculative execution which eliminates the need for branch prediction and checkpoints. Because they now pause at branches, a pair of background threads (4-way MT) is far better than a single background thread (3-way MT).
The background thread instructions enter the µop queue just like all instructions. They're dispatched from the front end, in order, to the µop queue and then to the scheduling pool. From this point they're treated just like SMT threads in the pipeline, so I'll just quote it right out of the textbook: In the scheduling pool the instruction is checked by the scheduler against the Ready flags in the register file (PRF), and when all the operands are ready, the scheduler forwards the instruction to an available execution unit.
The only minor difference is that "available" for background threads is redefined for background threads. Read AGU instructions are shared democratically, writes get prioritized to main threads (except on say, every fourth clock), and ALUs get prioritized to main threads (except for a very high timeout value, like every 32nd clock).
Having a pair of such background threads doesn't impact (significantly) the SMT2 performance; but it does improve the % utilization and the MT and the perf/watt. You get a pair of low IPC threads (think pentirum 4 class) that hack certain loads very well: FPU and latency intensive (eg waiting on RAM or HDD) threads--- remember that under such loads, even big cores have low IPC, because waiting doesn't improve all that much when you make a core larger.
The software side to make use of most these capabilities are terribly simple. The OS simply sets the core affinity for any processes that don't have positive nice values to the main logical cores. So for example in our hypothetical 4c/16t APU, cores 0-7 are SMT2 threads, and 8-15 background threads, all non-niced tasked default to core affinity 0-7 by default. Niced value of 19 are taskset automatically to background threads (8-15 affinity). The remaining tasks (with positive nice values, would have no affinity and jump all over to any free logical core).
The next level of software sophistication would be only slightly more complex and improve performance for niced values 1-18 under low load conditions. Next level of hardware sophistication would be for ability for a background thread to swap places with an SMT thread when it is known the SMT thread is either idle or stalled for some time (until load resumes on the real SMT thread).