Speculation: Ryzen 4000 series/Zen 3

BTRY B 529th FA BN · Jul 17, 2019

Man I hope it's still AM4. I just bought a stupid expensive X570 board. lol, sigh

Ajay · Jul 17, 2019

BTRY B 529th FA BN said:
Man I hope it's still AM4. I just bought a stupid expensive X570 board. lol, sigh

So long as it comes out in 2020. IIRC, AMD promised compatibility till 2020. After that, DDR5 and other things will require a new socket.

tamz_msc · Jul 17, 2019

Abwx said:
So those 13.4% include AVX512, you can see the SFFT score wich is highly inflating your "estimation".

So apparently AES, wich skewed the result way less, is no good but AVX512 is, since it help get an illusion of a point, i guess that s why i had difficulties grasping your "method"..

Also, it seems that Intel created instructions (or a hardware block) to improve the throughput in LZMA (7ZIP in their slide), that s exactly the same as the AES hardware blocks, but one more time never mind, all is good to boost the "IPC" numbers, and they rely a lot on those news instructions, among others AES vectorization, lol...

Anyway be prepared for an abrupt landing once the dust of those truncated numbers has settled...

https://www.computerbase.de/2019-05/intel-ice-lake-ueberblick/

I don't see how the 13.4 percent increase in integer IPC is inflated due to AVX512 because none of the integer workloads in the Geekbench suite benefit from AVX512.

I also don't see where Intel has created dedicated logic for acceleration of LZMA as nothing has been stated from them regarding such a hardware block. Must be a figment of your imagination.

Looks like you have a hard time accepting that Zen 2 is significantly behind Ice Lake IPC.

coercitiv · Jul 17, 2019

IntelUser2000 said:
That chip is running at near 5.3GHz. The one abwx and tamz_msc is arguing about runs at nearly 250MHz lower.

tamz_msc said:
The 9900K in your link is running at 5.3 GHz. The variance in Geekbench is high, but you can always find useful information if you know what to look for. In this case you forgot to find out exactly what clock speeds the CPU was running at.

Time for me to eat some crow, not (only) because of the frequency delta between the two 9900K tests, but because of my own testing done with fixed clocks and high RAM clock & latency delta.

CPU - i7 8700 fixed @ 4Ghz

RAM running at 4000 MT/s CL 17-18-18
Test 1 / Test 2 / Test 3

RAM running at 2666 MT/s CL 17-18-18
Test 1 / Test 2 / Test 3

Best ST INT for 4000 MT/s RAM - 5193
Worst ST INT for 2666 MT/s RAM - 5156

Delta percentage for ST INT score is less than 1% for a 50% increase in RAM frequency coupled with significantly lower latency.

Abwx · Jul 17, 2019

tamz_msc said:
I don't see how the 13.4 percent increase in integer IPC is inflated due to AVX512 because none of the integer workloads in the Geekbench suite benefit from AVX512.

FP is inflated as well as the average...

tamz_msc said:
I also don't see where Intel has created dedicated logic for acceleration of LZMA as nothing has been stated from them regarding such a hardware block. Must be a figment of your imagination.

Or that you do not follow the news, it was in a slide posted by Computerbase with 7zip explicitly displayed, and also related by CdH :

http://www.comptoir-hardware.com/im...el-architectural-day-2018/kaby-lake-7-zip.jpg

http://www.comptoir-hardware.com/im...l-architectural-day-2018/sunny-cove-7-zip.jpg

http://www.comptoir-hardware.com/ar...re-day-sunny-cove-contre-attaque.html?start=1

Dunno who is actually imagination "figmented"...

tamz_msc said:
Looks like you have a hard time accepting that Zen 2 is significantly behind Ice Lake IPC.

Look like you are short of arguments, hence the personal attacks, as for Zen 2 being behind, we ll see what is the value of your "significantly", i would say that if they are at Zen 2 level without relying on new instructions and hardware blocks then it would be already a big achievement for them.

BTRY B 529th FA BN · Jul 17, 2019

Ajay said:
So long as it comes out in 2020. IIRC, AMD promised compatibility till 2020. After that, DDR5 and other things will require a new socket.

I've been hearing from others that 16 core 32 thread is the most AM4 can handle, I think for a dual channel config? I'd assume a new socket is needed. Correct me if I am wrong.

Ajay · Jul 17, 2019

BTRY B 529th FA BN said:
I've been hearing from others that 16 core 32 thread is the most AM4 can handle, I think for a dual channel config? I'd assume a new socket is needed. Correct me if I am wrong.

Yeah probably, more power pins and signaling.

Vattila · Jul 17, 2019

We discussed SMT4 in the CCX speculation thread, where it was argued that SMT4 could justify widening the core. Like pointed out earlier, widening the core to increase single-thread IPC increases the complexity and reduces the efficiency of the core. Enabling SMT4 is a way to regain efficiency.

Many argue that SMT4 makes no sense for the client space. True. It is not intended for that. It is a server feature. Zen 3 will be designed for server. In the data centre, utilisation and power efficiency is of uttermost importance, and hence SMT4 makes sense for wide cores. Secondly, it will of course be possible to turn SMT off. Likely, the core would be configurable for no-SMT, SMT2 and SMT4 mode. AMD may even choose to disable SMT4 altogether in the client space.

Intuitively, to me SMT4 makes particular sense in context of the 4-core CCX with direct low-latency core-to-core interconnect. It allows up to 16 threads with very low synchronisation latency.

Imagine having 16 closely related and dependent threads, e.g. working together to compute something and frequently blocking on shared memory accesses (locks). While such code, using shared memory and synchronisation, is to be avoided if at all possible, I guess it is quite common, due to run-of-the-mill and poorly written code, or due to unavoidable inherent algorithmic limitations and characteristics.

It seems to me that having 16 such threads in a tightly integrated CCX, each perhaps dedicated to a single VM, would be quite favourable.

Vattila · Jul 17, 2019

Regarding frequency and IPC, I think it would be of immense marketing value for AMD to overtake Intel on single-thread performance. Actually, I suspect they fell a little short of target for Zen 2 frequency, and I guess they may have a little more in the tank, with further refinements of the implementation and process. As for IPC, widening the core to beat Intel may necessitate SMT4 to regain efficiency, as argued in my last post.

Regarding architecture and power, we should see evolution of the chiplet design, especially the packaging technology and interconnect, to increase efficiency. The use of interposer to reduce interconnect energy consumption would make sense.

Regarding AVX-512, they may support the main parts of it in split 2-cycle operation. I do not think AMD will widen the data paths and registers to 512 bits. It just seems unbalanced for a general-purpose core and contrary to Zen design principles. Wide vectors can be supported by GPU and accelerators, in line with heterogenous system architecture. HSA seems to be the mantra at both AMD and Intel these days.

Presumably, the RDNA graphics architecture will be the foundation for all APUs, starting with Renoir, which should arrive by 2020-Q1, if the yearly cadence remains intact.

NostaSeronx · Jul 17, 2019

Thunder 57 said:
You have been right about FD-SOI and CMT making a comeback exactly 0% of the time.

Stoney Ridge is still successful and still has other models yet released;
05/04/2016 => A9-9425, A4-9125, A6-9225 & A6-9200e, A6-9220e(A4-9120c), A9-9420e(A6-9220c)
07/08/2016 => E2-9000e
12/19/2018 => A9-9435, A6-9235 & A4-9120e (Acer/HP just released this one recently, appears to be an upgrade to the E2-9000e)

a4-9120c => 1.6 GHz/2.4 GHz => http://valid.x86.fr/rkcqey
a4-9120e => 1.5 GHz/2.2 GHz => https://valid.x86.fr/ia0fia
e2-9000e => 1.5 GHz/2 GHz => https://valid.x86.fr/8rau3j
A full 0.05V reduction at 1.7 GHz. Which I assume will appear as the E2-9020c.

So, we still have A9-9435 and A6-9235 before AMD will acknowledge any 22FDX/12FDX products. I definitely hope that the Ryzen 4000/Athlon 400 series will have atleast a 22FDX companion lineup instead of A9-9445/A6-9245, etc 28nm refresh lineup.

tamz_msc · Jul 17, 2019

Abwx said:
FP is inflated as well as the average...

The integer workloads in Geekbench don't benefit from AVX512. There is no question of them being inflated.

Abwx said:
Or that you do not follow the news, it was in a slide posted by Computerbase with 7zip explicitly displayed, and also related by CdH :

That is absolutely no proof of some dedicated "LZMA block" if such a thing even exists.Looks like you don't even read the damn things you quote. Ice Lake is much faster than Kaby Lake simply due to its superior memory subsystem.

Ajay · Jul 17, 2019

Vattila said:
We discussed SMT4 in the CCX speculation thread, where it was argued that SMT4 could justify widening the core. Like pointed out earlier, widening the core to increase single-thread IPC increases the complexity and reduces the efficiency of the core. Enabling SMT4 is a way to regain efficiency.

What? Wider cores increase ILP, which means increased efficiency (throughput per core). SMT-4 will increase complexity more than adding an execution pipeline. Each additional thread offers diminishing returns. Each additional execution resource offer diminishing returns. More actual cores is the best solution for now, hands down.

Edit; vague comment.

Abwx · Jul 18, 2019

tamz_msc said:
The integer workloads in Geekbench don't benefit from AVX512. There is no question of them being inflated.

That is absolutely no proof of some dedicated "LZMA block" if such a thing even exists.Looks like you don't even read the damn things you quote. Ice Lake is much faster than Kaby Lake simply due to its superior memory subsystem.

You were not even aware of the slide i posted and apparently you didnt even notice that there s 65% better perf in 7zip, so much for not reading what is posted.

Actually Intel stated that it s 40% better perf/clock in 7zip, you are right, it must be due to the so much superior memory subsystem, much easier explanation than doing your homework .

Btw, from C2D to Haswell the improvement/clock in CB is about 50-60% while in 7 zip it s no more than 20% and Haswell has only 28% better perf/clock in ST than Piledriver.
It say that LZMA is very difficult to improve clock/clock , but seems that they are all stupid to not have thought that memory subsystem improvement could do in one gen what was impossible in 5 from C2D to SKL....

https://www.computerbase.de/2018-12/sunny-cove-intel-ice-lake-cpu/

Antey · Jul 18, 2019

deleted (wrong thread)

JoeRambo · Jul 18, 2019

Abwx said:
It say that LZMA is very difficult to improve clock/clock , but seems that they are all stupid to not have thought that memory subsystem improvement could do in one gen what was impossible in 5 from C2D to SKL....

Your tendency for load of bs claims is just incredible. Basically what You claim is: Intel has some sort of magic accelerator implemented in Ice Lake for LZMA, that GeekBench is somehow able to immediately implement and utilize ( I guess for hardcore morons it is additional proof of conspiracy and collusion between Intel and devs of Geekbench ).

Simpler explanations like superior memory subsystem with 50% more L1 cache, more L2, 2 store ports and increased OoO and Mem parallelizm do not work when You have agenda to push forward.

Abwx · Jul 18, 2019

JoeRambo said:
Your tendency for load of bs claims is just incredible. Basically what You claim is: Intel has some sort of magic accelerator implemented in Ice Lake for LZMA, that GeekBench is somehow able to immediately implement and utilize ( I guess for hardcore morons it is additional proof of conspiracy and collusion between Intel and devs of Geekbench ).

Simpler explanations like superior memory subsystem with 50% more L1 cache, more L2, 2 store ports and increased OoO and Mem parallelizm do not work when You have agenda to push forward.

That s all it is about, an agenda, lol, i guess that all that is left when arguments are short of logic, and go as far as negating evidences on the front of one s deliberatly blinded eyes...

I stated that they either implemented new instructions or a hardware block to increase LZMA througput, the slide above tell exactly what i state but still, there s people who argue against it.

That being said i never said that those instructions where implemented in Geekbench but that Intel claim of 18% better IPC is highly inflated thanks to such tricks, from here my conclusion is that we ll see nowhere something like 18%, and not even the 13.4% "computed" by my homework reluctant contender who stated that there were no such instructions or hardware block.

Now, and i already posted this opinion, if we look at the numbers displayed at Geekbench they show 10% better perf/clock in LZMA, and if, as you point it, the memory subsystem has a say in this number then it means that the CPU pipeline is not that better and that it s not an improvement entirely due to the uarch enhancement, a SKL with faster memory subsystem would had gained a part of this improvement if i were to follow your logic.

tamz_msc · Jul 18, 2019

Abwx said:
You were not even aware of the slide i posted and apparently you didnt even notice that there s 65% better perf in 7zip, so much for not reading what is posted.

Actually Intel stated that it s 40% better perf/clock in 7zip, you are right, it must be due to the so much superior memory subsystem, much easier explanation than doing your homework .

Btw, from C2D to Haswell the improvement/clock in CB is about 50-60% while in 7 zip it s no more than 20% and Haswell has only 28% better perf/clock in ST than Piledriver.
It say that LZMA is very difficult to improve clock/clock , but seems that they are all stupid to not have thought that memory subsystem improvement could do in one gen what was impossible in 5 from C2D to SKL....

https://www.computerbase.de/2018-12/sunny-cove-intel-ice-lake-cpu/

JoeRambo said:
Your tendency for load of bs claims is just incredible. Basically what You claim is: Intel has some sort of magic accelerator implemented in Ice Lake for LZMA, that GeekBench is somehow able to immediately implement and utilize ( I guess for hardcore morons it is additional proof of conspiracy and collusion between Intel and devs of Geekbench ).

Simpler explanations like superior memory subsystem with 50% more L1 cache, more L2, 2 store ports and increased OoO and Mem parallelizm do not work when You have agenda to push forward.

After a bit of thought, it's clear what is going on here. What Intel showed was a comparison of Ice Lake and Kaby Lake in compressing and ENCRYPTING with 7-zip. So of course with Ice Lake's new instructions the encryption part would be completed faster.

amrnuke · Jul 18, 2019

tamz_msc said:
After a bit of thought, it's clear what is going on here. What Intel showed was a comparison of Ice Lake and Kaby Lake in compressing and ENCRYPTING with 7-zip. So of course with Ice Lake's new instructions the encryption part would be completed faster.

Wait, so you're saying there's no conspiracy? What am I going to get worked up about now?! I guess there's still Area 51.

Vattila · Jul 18, 2019

Ajay said:
[Enabling SMT4 is a way to regain efficiency.] What?

"Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures."

https://en.wikipedia.org/wiki/Simultaneous_multithreading

Wider cores increase ILP, which means increased efficiency (throughput per core).

Actually, as I understand it, a wider core can increase IPC (instructions per clock), although this is dependent on workload, as each workload has its own unique ILP (instruction level parallelism) which sets the limit on the IPC achievable. ILP is a characteristic of the workload, and IPC is a characteristic of the core given a particular workload.

SMT-4 will increase complexity more than adding execution cores.

With SMT-2 already in place, is it really such a big change going to SMT-4? Of course, SMT-4 has diminishing returns. That's why it only makes sense in conjunction with widening the core. The argument for the latter is to achieve higher IPC, at the cost of some power efficiency.

More actual cores is the best solution for now, hands down.

That may be the case, for relatively simple cores. However, without SMT, wide cores will go underutilised for many workloads. The point is to overequip the core with resources to squeeze out more IPC for single-thread performance, then use SMT-4 to make sure the core is utilised for parallel workloads, especially those workloads with low ILP, of which there are many in the server space.

Ajay · Jul 18, 2019

Vattila said:
Actually, as I understand it, a wider core can increase IPC (instructions per clock), although this is dependent on workload, as each workload has its own unique ILP (instruction level parallelism) which sets the limit on the IPC achievable. ILP is a characteristic of the workload, and IPC is a characteristic of the core given a particular workload.

With SMT-2 already in place, is it really such a big change going to SMT-4? Of course, SMT-4 has diminishing returns. That's why it only makes sense in conjunction with widening the core. The argument for the latter is to achieve higher IPC, at the cost of some power efficiency.

That may be the case, for relatively simple cores. However, without SMT, wide cores will go underutilised for many workloads. The point is to overequip the core with resources to squeeze out more IPC for single-thread performance, then use SMT-4 to make sure the core is utilised for parallel workloads, especially those workloads with low ILP, of which there are many in the server space.

1. IPC is a very abused term. Clocks per instruction is something you can actually look up in CPU data books. We used those values when I was a firmware engineer (even then it wasn’t clear cut, was the data in a register(s) or cache). Since we don’t normally compute the number of instructions executed in a given benchmark, we are actually only able to compute throughput per clock - however that is displayed (Frames encoded per second per megahertz or whatever) .

2. ILP is a measure of the number of instructions dispatched to execution units per clock. Obviously, it can vary extensively (depending on data load source and other factors). The point is to keep as many execution pipelines full as possible from the instruction stream. Depending on the max number of instructions that can be decoded and dispatch in a given time period, the execution units may start in a staggered (time shifted) order. That doesn’t necessarily matter that much, so long as the pipelines are kept active. So it is difficult to calculate exactly. Normally, a number is given, say 2.75 instructions per clock for a given chunk of executed code. One needs to know the actual number of instructions executed in that run, I think this can be found using a profiler, but it’s been a while).

3. You don’t *need* more threads because you widened the core, you may want wider decode/dispatch units and better prefetch and speculative execution to keep the execution units busy.

So maybe you do SMT-4 (or 8) because you are close to the reticle limit for the number of cores you want or because it’s the exciting thing to do (like after DEC published an empirical paper on SMT-4 in alpha processors) or you are able to develop a dev community with exceptional knowledge of SMT CPUs for maximum uptime highly resilient systems - like SUN and IBM. AMD would be better off putting 3 CCXs on a die - there is no reticle limit with chiplets, there are just package size and power limits. More cores is the most efficient way to gain significant MT performance for now.

moinmoin · Jul 18, 2019

Ajay said:
What? Wider cores increase ILP, which means increased efficiency (throughput per core). SMT-4 will increase complexity more than adding an execution pipeline. Each additional thread offers diminishing returns. Each additional execution resource offer diminishing returns. More actual cores is the best solution for now, hands down.

Edit; vague comment.

Complexity is already naturally introduced in the ever ongoing pursuit of increasing IPC by any way possible.

An imo very bad case of that is AVX512 which uses a lot of die space, uses far more power than it adds to the performance, and the performance increase is exclusive to newly added instructions so its existence improves no pre-existing software.

Adding executions pipelines is indeed the more sane way to expand on performance, and SMT helps making good use of of such expansion. E.g. if an SVE alike SIMD extension were introduced in place of AVX512, Zen 2 could already combine its 2x 256bit FMAs for 512bit calculation, and through SMT each 256bit is still usable for distinct threads. An imaginary CPU with SVE, SMT4 and 4x 256bit FMAs could allow for up to 1024bit calculations as well as 4 threads using 256bit each.

The real issue with SMT is the statical partitioning of resources which halves the resources available per thread. There Zen appears to be well prepared, with only the micro-op, retire and store queues being split per thread. (WikiChip claims the op cache is halved using SMT, but that contradict the slide below. I personally interpret "competitively shared" as dynamically allocated on demand, ideally making better use of existing resources without hard limits per thread.)

Maybe there are more downsides to SMT, but so far I'm not seeing them.

---

This makes me wonder, how feasible would it be to decouple the front end from the back end, put the former on the IOD and let it take over the current OS scheduler job of engaging specific (virtual) cores for execution? Currently the OS scheduler often works against the decode/dispatch units of the front ends instead supporting them, so giving them a global view of all cores could be a way to increase efficiency/efficient usage of available resources.

amd6502 · Jul 19, 2019

Well, I don't understand the last bit about OS scheduling, and you definitely want to keep the front end close to the L1 and L2 and the execution units. (The retire also needs to be close to the front end and L1.) As far as software scheduling I can't see vast changes, only improvements in making the OS more aware of the topology (especially in big+little hybrid cores).

One can add another layer of abstraction so that the OS only sees virtual cores and allows the CPU (with all its branch predictors, prefetchers, internal statistics, and awareness of shared data in the L3) make the decision how to match up the virtual cores to the physical cores. There are pros and cons; as long as there is a way to disable such a mode in the bios setup I'd be ok with it.

I agree with moinmoin and Vatilla that widening the core without going beyond SMT2 would cause under utilization of the core. I think they've already gone way beyond the low hanging fruit for branch prediction and prefetch and other bottlenecks; I think they are near the point where it's a choice refraining from widening the core, accepting low utilization, or going beyond SMT2. It wouldn't surprise me if % utilization is already on the low side despite all the progress made with the two tier branch prediction and pre-fetchers and the doubled op cache and wider re-order window.

Ajay said:
3. You don’t *need* more threads because you widened the core, you may want wider decode/dispatch units and better prefetch and speculative execution to keep the execution units busy. [....]

AMD would be better off putting 3 CCXs on a die - there is no reticle limit with chiplets, there are just package size and power limits. More cores is the most efficient way to gain significant MT performance for now.

If there is no need to increase IPC nor efficiency, yes that would be a perfectly fine (and low cost) way to do it.

The problem is the market is demanding more IPC as well as lower wattage for mobile PC's (the biggest market) and server. So for servers and high end ULP laptops, just throwing more cores at it is often not a good solution. In fact, for ULP laptops you want to keep core count down to very reasonable number (for power consumption's sake). You might go on the high side for mobile core count if you are able to dynamically power down physical cores.

This ability to power down cores is something that SMT4 could enable for some consumer products like ULP mobile. If a hypothetical future AMD APU (with SMT4 capable cores) senses that it's been at the lowest p-state and with low utilization load for a while, it can power down half its cores and transition on the fly from SMT2 to SMT4 mode, and then power up cores again and transition back to SMT2 mode once it senses p-state increasing or higher utilization. (All the while it would appear just as an SMT2 processor to the consumer, eg 4c/8t.)

As for going SMT4, I think it will be the case for Zen 4. For Zen 3 my hope is that it's SMT2 plus background thread(s).

These hypothetical background threads would have their own dispatch and primitive scheduler without register renaming (which makes them simple and very close to an in-order execution core). Minimal (if any) L1 and L2 prefetch. No speculative execution which eliminates the need for branch prediction and checkpoints. Because they now pause at branches, a pair of background threads (4-way MT) is far better than a single background thread (3-way MT).

The background thread instructions enter the µop queue just like all instructions. They're dispatched from the front end, in order, to the µop queue and then to the scheduling pool. From this point they're treated just like SMT threads in the pipeline, so I'll just quote it right out of the textbook: In the scheduling pool the instruction is checked by the scheduler against the Ready flags in the register file (PRF), and when all the operands are ready, the scheduler forwards the instruction to an available execution unit.

The only minor difference is that "available" for background threads is redefined for background threads. Read AGU instructions are shared democratically, writes get prioritized to main threads (except on say, every fourth clock), and ALUs get prioritized to main threads (except for a very high timeout value, like every 32nd clock).

Having a pair of such background threads doesn't impact (significantly) the SMT2 performance; but it does improve the % utilization and the MT and the perf/watt. You get a pair of low IPC threads (think pentirum 4 class) that hack certain loads very well: FPU and latency intensive (eg waiting on RAM or HDD) threads--- remember that under such loads, even big cores have low IPC, because waiting doesn't improve all that much when you make a core larger.

The software side to make use of most these capabilities are terribly simple. The OS simply sets the core affinity for any processes that don't have positive nice values to the main logical cores. So for example in our hypothetical 4c/16t APU, cores 0-7 are SMT2 threads, and 8-15 background threads, all non-niced tasked default to core affinity 0-7 by default. Niced value of 19 are taskset automatically to background threads (8-15 affinity). The remaining tasks (with positive nice values, would have no affinity and jump all over to any free logical core).

The next level of software sophistication would be only slightly more complex and improve performance for niced values 1-18 under low load conditions. Next level of hardware sophistication would be for ability for a background thread to swap places with an SMT thread when it is known the SMT thread is either idle or stalled for some time (until load resumes on the real SMT thread).

moinmoin · Jul 19, 2019

amd6502 said:
As far as software scheduling I can't see vast changes, only improvements in making the OS more aware of the topology (especially in big+little hybrid cores).

Due to AMD's approach to binning with Ryzen 3k the topology got a lot more complex compared to previous gens. It appears while binning AMD does only the bare minimum to ensure the chiplet fits the advertised spec, e.g. only one core may be able to reach the advertised max turbo while other cores/CCXs only fulfill the bare minimum. The result is the achievable frequencies vary hugely between cores, and on 3900X even the two CCDs have shown to consist of essentially one gold and one garbage tier CCD. The result of that is that manual overclocking is limited by the lowest common denominator core. Per CCX overclocking is already a huge improvement due to restricting the lowest common denominator core to within each CCX.

Ideally the OS scheduler would be aware of the max frequency achievable by each core as well as the average of each CCX/CCD and allocate cores to threads accordingly. Ironically the latest Windows version finally being aware of the CCX topology can actually backfire by applications ending up trapped in a bad CCX. I was harping on the OS scheduler since this kind of stuff would ideally be better handled within the CPU instead.

The whole binning issue has far reaching ramifications and helps explain plenty of characteristics and oddities specific to Ryzen 3k. Maybe I'll make a thread on this later today or tomorrow.

JoeRambo · Jul 19, 2019

amd6502 said:
Having a pair of such background threads doesn't impact (significantly) the SMT2 performance; but it does improve the % utilization and the MT and the perf/watt.

Implementing them needs to have benefits. Benefits of Big.little are clear in mobile space ( power efficiency ). What about this "background thread" idea? How common outside of servers it is to have fully loaded core, that is stalled on FPU/Memory, but somehow has free resources for the thread that ALU/AGU whatever heavy? What is the use case here, mine bitcoin on your web server that is somehow not bound by response time SLA or maybe some cloud bs, where you pay cheaper to share CPU with someone paying cents for 'background threads"? For power efficiency it is better to just use main thread resources and go to sleep faster, instead of treating your massive core as "in order" and proceeding at P4 speed in 2020.

These "background threads" are better served by BIG.little and i think Apple/Intel/ARM all agree on that. SMT is sharing resources even in some magic dynamic sharing case, TLB, cache space/bw are as important as some internal CPU queues.

moinmoin · Jul 19, 2019

JoeRambo said:
SMT is sharing resources even in some magic dynamic sharing case, TLB, cache space/bw are as important as some internal CPU queues.

There's nothing magic about dynamically sharing resources as needed. Point is, today's CPUs are more and more optimized for extreme corner cases, leading to bountiful resources that are otherwise unused. It's a classical case of overprovisioning, and SMT is a valid way to make better use of all those "overprovided" resources.

Speculation: Ryzen 4000 series/Zen 3

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Lifer

Lifer

Senior member

Senior member

Diamond Member

Diamond Member

Lifer

Lifer

Member

Golden Member

Lifer

Diamond Member

Golden Member

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Golden Member

Diamond Member