Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Schmide · Jan 8, 2024

JoeRambo said:
I think it was working fine in Cinebench. Anywhere else ran the risk of hitting crazy concurrency problems:

63 Cores Blocked by Seven Instructions

I seem to have a habit of writing about super powerful machines whose many cores are laid low by misuse of locks. So. Yeah. It’s that again. But this one seems particularly impressive. I mean, how …

randomascii.wordpress.com

But i guess it was AMD's way to prepare us all for very parallel future, release as retarded product as possible to exposes such weaknesses before launching 64C chips.

On topic of this new chip -> i don't have any problems with it and it will perform great beyond typical scheduling woes. ZEN5C even with reduced clock is lots of perf.

Still, let's not underestimate AMD's stupidity like @naukkis did, they surely can release lazy and stupid products.

Did you read the article you posted? It was basically system restore and the linker fighting over a bitmap. This was a code/compiler/OS problem that has nothing to do with the hardware. The only thing the 32core processor did was probably mask the problem enough for it not to be initially noticed by the programmer.

Schmide · Jan 8, 2024

JoeRambo said:
So You are being intelectually dishonest here, going from "worked fine" to 'certainly knew what they were getting into".

The irony

JoeRambo · Jan 8, 2024

Schmide said:
Did you read the article you posted? It was basically system restore and the linker fighting over a bitmap.

And we can very reasonably infer that said fighting was made worse by said Threadripper having bad cache latencies, in turn scaling worse that in would have been.
So in the era where typical L2 to L2 latency was 10-20ns, an abomination with hundred+ of ns arrived making all lock contention worse and things like false L2 sharing bite much more.

There were plenty of benchmarks where this chip fell over, lagging behind 16C Theadripper, but i guess this is wrong thread to point that out.

Anyway i stay with my opinion that AMD was completely stupid to unleach such chip in workstation market and sorry for overestimating the signal to noise ratio of this forum. It has become no longer tolerable to me.

Thunder 57 · Jan 8, 2024

JoeRambo said:
And we can very reasonably infer that said fighting was made worse by said Threadripper having bad cache latencies, in turn scaling worse that in would have been.
So in the era where typical L2 to L2 latency was 10-20ns, an abomination with hundred+ of ns arrived making all lock contention worse and things like false L2 sharing bite much more.

There were plenty of benchmarks where this chip fell over, lagging behind 16C Theadripper, but i guess this is wrong thread to point that out.

Anyway i stay with my opinion that AMD was completely stupid to unleach such chip in workstation market and sorry for overestimating the signal to noise ratio of this forum. It has become no longer tolerable to me.

Bhahaha! He just rage quit! Reminds me of others. But this one was dumb. It was over a trivial argument.

leoneazzurro · Jan 8, 2024

Timmah! said:
So there goes the presentation and no word of Zen5. Too bad, was looking forward to that. MEH!

Well, they said that Strix Point was launching "later this year". No performance preview, but that's also a mobile part.

DrMrLordX · Jan 8, 2024

JoeRambo said:
So You are being intelectually dishonest here, going from "worked fine" to 'certainly knew what they were getting into".

It worked fine for what it was. Everyone knew where it was good and where it wasn't. Hardly anything to complain about.

Schmide · Jan 8, 2024

JoeRambo said:
And we can very reasonably infer that said fighting was made worse by said Threadripper having bad cache latencies, in turn scaling worse that in would have been.
So in the era where typical L2 to L2 latency was 10-20ns, an abomination with hundred+ of ns arrived making all lock contention worse and things like false L2 sharing bite much more.

There were plenty of benchmarks where this chip fell over, lagging behind 16C Theadripper, but i guess this is wrong thread to point that out.

Anyway i stay with my opinion that AMD was completely stupid to unleach such chip in workstation market and sorry for overestimating the signal to noise ratio of this forum. It has become no longer tolerable to me.

Still going to reply.

Yeah those latencies you state are not a thing. L2 and above is local to the core. If another process needs that data, it needs to be flushed down the hierarchy. The problem with numa domains is you can have more than one location to evict data to and extra synchronization has to take place. It's basically like having multiple sockets on the same socket with some glue.

I don't think you understand locks either. The article you linked to was a file lock which is different from a synchronization lock. They work on the same concept, but in totally different areas. Although you will have to invoke a synchronization lock to insure single access to the file lock.

As bad as any of the previous generations were, they were a step in the right direction to where we are today.

Noid · Jan 8, 2024

MSI released an undocumented chipset driver 5.11.02.217 internally
( not on the official AMD x570 chipset release download page )

MSI is a late release I guess ... Tom's hardware says it's support for the 8000's long ago
Odd that it's been around for many months, yet not noted at AMD
( I have not found the release notes yet )

driver station has this info .... ( not the release notes )

Driver Name	OS	Version
AMD Processor Power Management Support - AMD Ryzen Power Plan	Windows 10(64-bit)	8.0.0.13
Windows 11(64-bit)	8.0.0.13
AMD PCI Device Driver	Windows 10(64-bit)	1.0.0.90
Windows 11(64-bit)	1.0.0.90
AMD I2C Driver	Windows 10(64-bit)	1.2.0.124
Windows 11(64-bit)	1.2.0.124
AMD UART Driver	Windows 10(64-bit)	1.2.0.116
Windows 11(64-bit)	1.2.0.116
AMD GPIO2 Driver	Windows 10(64-bit)	2.2.0.130
Windows 11(64-bit)	2.2.0.130
PT GPIO Driver	Windows 10(64-bit)	3.0.0.0
Windows 10(64-bit)	3.0.0.0
AMD PSP Driver	Windows 10(64-bit)	5.25.0.0
Windows 11(64-bit)	5.25.0.0
AMD IOV Driver	Windows 10(64-bit)	1.2.0.52
Windows 11(64-bit)	1.2.0.52
AMD SMBUS Driver	Windows 10(64-bit)	5.12.0.38
Windows 11(64-bit)	5.12.0.38
AMD AS4 ACPI Driver	Windows 11(64-bit)	1.2.0.46
AMD SFH I2C Driver	Windows 10(64-bit)	1.0.0.86
Windows 11(64-bit)	1.0.0.86
AMD SFH Driver	Windows 10(64-bit)	1.0.0.336
Windows 11(64-bit)	1.0.0.336
AMD MicroPEP Driver	Windows 10(64-bit)	1.0.41.0
Windows 11(64-bit)	1.0.41.0
AMD Wireless Button Driver	Windows 10(64-bit)	1.0.0.2
Windows 11(64-bit)	1.0.0.2
AMD PMF-6000Series Driver	Windows 10(64-bit)	22.0.3.0
Windows 11(64-bit)	22.0.3.0
AMD PPM Provisioning File Driver	Windows 10(64-bit)	8.0.0.26
Windows 11(64-bit)	8.0.0.26
AMD 3D V-Cache Performance Optimizer Driver	Windows 10(64-bit)	1.0.0.7
Windows 11(64-bit)	1.0.0.7
AMD AMS Mailbox Driver	Windows 10(64-bit)	3.0.0.635
Windows 11(64-bit)	3.0.0.635
AMD S0i3 Filter Driver	Windows 10(64-bit)	1.0.0.17
Windows 11(64-bit)	1.0.0.17
AMD CIR Driver	Windows 10(64-bit)	3.2.4.135
AMD USB Filter Driver	Windows 11(64-bit)	2.1.11.304
AMD USB4 CM Driver	Windows 10(64-bit)	1.0.0.37
AMD SFH1.1 Driver	Windows 10(64-bit)	1.1.0.12
Windows 11(64-bit)	1.1.0.12
AMD PMF-7040Series Driver	Windows 10(64-bit)	23.2.3.0
Windows 11(64-bit)	23.2.3.0
AMD PMF-8000Series Driver	Windows 10(64-bit)	23.5.9.0
Windows 11(64-bit)	23.5.9.0
AMD PMF-7736Series Driver	Windows 10(64-bit)	no
Windows 11(64-bit)	23.1.17.0
AMD Interface Driver	Windows 10(64-bit)	2.0.0.14
Windows 11(64-bit)	2.0.0.14
AMD DRTM Driver	Windows 11(64-bit)	1.0.16.4

Submitted By:Fdrsoft (admin)Submitted On:09 Nov 2023File Size:62.1MoDownloads:2449File Version:5.11.02.217 WHQLFile Author:Fdrsoft
Log in to cast your vote

naukkis · Jan 9, 2024

NostaSeronx said:
imho, I am hoping for a PHX2 CCX * 2 configuration. That way it scales up from PHX2, rather than being a new paradigm.

CCX0: 2x Zen5 + 4x Zen5c
CCX1: 2x Zen5 + 4x Zen5c

Didn't think that possibility. It's more sensible configuration than pairing normal and c-cores to different CCX's. Wonder how today's MT-applications scale - will such a configuration affect some 3-to-4 equal threaded(do they exists ) applications or is such a configuration doing fine in most of applications?

Hitman928 · Jan 9, 2024

naukkis said:
Didn't think that possibility. It's more sensible configuration than pairing normal and c-cores to different CCX's. Wonder how today's MT-applications scale - will such a configuration affect some 3-to-4 equal threaded(do they exists ) applications or is such a configuration doing fine in most of applications?

That configuration only makes sense if the 4 core frequency is within the 'c' core reach and in the 'c' core part of the curve that is more efficient than the 'p' core. However, if that is the case, then it doesn't make sense to have more than 2 'p' cores to begin with. They would take up additional die space for no added value.

naukkis · Jan 9, 2024

Hitman928 said:
That configuration only makes sense if the 4 core frequency is within the 'c' core reach and in the 'c' core part of the curve that is more efficient than the 'p' core. However, if that is the case, then it doesn't make sense to have more than 2 'p' cores to begin with. They would take up additional die space for no added value.

Actually that's not how today's CPU's boost. Even with only p-cores cores differ in their boost frequencies. Prime core is fastest and so on. There's probably at least 1GHz speed difference between P and C-cores on 4-thread load - question I make was that does it matter - or is just two P-cores sufficient if there ain't much of workloads that rely on 3-4 evenly fast threads?

coercitiv · Jan 9, 2024

naukkis said:
It's more sensible configuration than pairing normal and c-cores to different CCX's.

While pairing p and c-cores to different CCXs will surely introduce penalties of sorts, I fail to see how splitting the p cores in two CCXs would be better. From my understanding the whole idea of using smaller/denser cores is to cater to MT workloads, which care more about throughput than latency. Meanwhile most consumer workloads are still comparatively lightly threaded and much more latency sensitive.

With that in mind, why split P cores and wreak havoc in the speed sensitive workloads? Think about browsers or games, which can definitely make use of 4 P cores.

The only scenario where the dense cores should stay close to perf cores would be in power saving, but I really doubt that's the goal of this generation. (power savings are more of an indirect goal here, obtained through better MT efficiency and probably better idle behavior)

PS: also I'm confused about @JoeRambo , that looked like a benign contradictory discussion... almost perfectly normal for the forum. Hope he reconsiders.

Hitman928 · Jan 9, 2024

naukkis said:
Actually that's not how today's CPU's boost. Even with only p-cores cores differ in their boost frequencies. Prime core is fastest and so on. There's probably at least 1GHz speed difference between P and C-cores on 4-thread load - question I make was that does it matter - or is just two P-cores sufficient if there ain't much of workloads that rely on 3-4 evenly fast threads?

P-core boost differences are typically very small, especially if all of the cores are in close proximity on the same piece of silicon. That's just max boost though which doesn't come into play in this context because you're not hitting max boost clocks past 1 - 2 cores being loaded. Then the question becomes, what is the 3 - 4 cores boost frequency of the P-cores? If the C-cores can't hit that same frequency, it makes no sense to have a 2p4c+2p4c split because you'd have a significant drop-off in performance past 2 cores being loaded. If the C-cores can hit that frequency but are at the end of their frequency range and thus less efficient than the P-cores in that range, then it makes no sense to have a 2p4c+2p4c split because you are using more power for no performance improvement. Additionally, once you move to the 2nd CCX, you are now bringing in 2 P-cores that will never boost above a 7-8 core loaded frequency, which the C-cores could easily achieve, so why make them P-cores at all? You are then using more space for no performance or efficiency gain. The proposed configuration makes no sense.

naukkis · Jan 9, 2024

coercitiv said:
With that in mind, why split P cores and wreak havoc in the speed sensitive workloads? Think about browsers or games, which can definitely make use of 4 P cores.

Because with such a split what they got is 4-core cpu, which probably would suck on games. With 6 core hybrid CCX they basically got 12-core cpu divided to dual CCX - not so far from 7900x.

naukkis · Jan 9, 2024

Hitman928 said:
P-core boost differences are typically very small, especially if all of the cores are in close proximity on the same piece of silicon. That's just max boost though which doesn't come into play in this context because you're not hitting max boost clocks past 1 - 2 cores being loaded. Then the question becomes, what is the 3 - 4 cores boost frequency of the P-cores? If the C-cores can't hit that same frequency, it makes no sense to have a 2p4c+2p4c split because you'd have a significant drop-off in performance past 2 cores being loaded. If the C-cores can hit that frequency but are at the end of their frequency range and thus less efficient than the P-cores in that range, then it makes no sense to have a 2p4c+2p4c split because you are using more power for no performance improvement. Additionally, once you move to the 2nd CCX, you are now bringing in 2 P-cores that will never boost above a 5-6 core loaded frequency, which the C-cores could easily achieve, so why make them P-cores at all? You are then using more space for no performance or efficiency gain. The proposed configuration makes no sense.

You still can split you 4-core load to 4 p cores even if they are in different CCX. Actually you got twice the L3 with split CCX. Thing is that multithreaded shared job threads will almost newer scale to equal strong threads - so there's usually need for 1 prime core to do main thread which splits child threads. By splitting p-cores to both CCX have equal cabability to work with such a jobs - and doing asymmetric CCX configuration is something AMD hasn't done and should newer done - non symmetrical configurations should be avoidable for being really hard for scheduler to optimize workloads.

adroc_thurston · Jan 9, 2024

naukkis said:
is something AMD hasn't done and should newer done - non symmetrical configurations should be avoidable for being really hard for scheduler to optimize workloads.

It's very easy to schedule everything there since anything with a QoS priority gets scheduled onto bigs and everything not lives in the dense ghetto.

naukkis · Jan 9, 2024

adroc_thurston said:
It's very easy to schedule everything there since anything with a QoS priority gets scheduled onto bigs and everything not lives in the dense ghetto.

That scheduling downgrades that "12" core cpu pretty much to 4-core cpu. That's the main point against doing such braindead splitting of cores.

moinmoin · Jan 9, 2024

I personally expect the CCX setup still to be balanced.

4x Zen 5 CCX + 8x Zen 5c CCX would be unbalanced
4x Zen 5 CCX + 4x Zen 5c CCX+ 4x Zen 5c CCX would be unprecedented with the odd amount of CCXs
That does leave only the 2x PHX2 like configuration of 2x Zen 5 + 4x Zen 5c CCX + 2x Zen 5 + 4x Zen 5c CCX.

One thing to keep in mind is that AMD already indicated that Zen 5c contains more optimizations than Zen 4c does compared to the respective standard core. That may (or may not) make the PHX2 like configuration more suitable for high performance use cases Strix Halo is supposed to excel in.

StefanR5R · Jan 9, 2024

coercitiv said:
With that in mind, why split P cores and wreak havoc in the speed sensitive workloads?

naukkis said:
Because with such a split what they got is 4-core cpu, which probably would suck [...]

Maybe there are 8 cores in one CCX, of which half are normal and half are dense. (Leaving 4 more dense for another CCX.) Still, a single CCX would be preferable in multithreaded applications with notable inter-thread communication.

naukkis said:
You still can split you 4-core load to 4 p cores even if they are in different CCX.

And get RAM level latency (and respective energy cost) for inter-thread synchronization. :-(

naukkis said:
Actually you got twice the L3 with split CCX.

In some ideal workloads, split caches perform almost the same as a unified cache with a size = the sum of the sizes of the split caches.
In other workloads, the effect of split caches, each of size X, is not much better than one unified cache of the same size X.
And then there are workloads which perform worse with split caches of size X than with a unified cache of size X.
Finally, there are workloads which may profit from split caches under lucky circumstances, if these workloads would suffer from the noisy neighbor problem otherwise.

Existing popular operating system kernels can't figure out of which type a given workload is, therefore they don't make any attempt to schedule multithreaded processes aligned with cache boundaries.

(PS: Of course smaller split caches are better than bigger unified caches if the workloads aren't sensitive to the splits and if the split helps the CPU designers make these caches very fast. Cf. Zen vs. Skylake-SP and successors.)

--------

AMD are allegedly working on a 16-core dense CCX for Zen 6 server. Maybe they'll have a nice surprise for us with a 12-core CCX for Zen 5 mobile? (OK, sounds too good to be likely.)

naukkis · Jan 9, 2024

StefanR5R said:
And get RAM level latency (and respective energy cost) for inter-thread synchronization. :-(

This wasn't point that I replied. With split CCX it's still possible to load 4 p-cores first and not limited to populate one CCX before other. Cinebench would be doing just as fine with dual 2+4 CCX's than with pairing all p-cores to one CCX. And because that thread-syncronization doing symmetrical CCX with P-cores on both will be vastly better - just imagine how to split two equal priority 4-thread jobs to such a asymmetric CCX configuration - as scheduler should split those jobs so both jobs are given equal resources. If not the case something like gaming would be utterly painful when engine except similar latencies from both jobs and what they got is something unpredictable.

adroc_thurston · Jan 9, 2024

naukkis said:
That scheduling downgrades that "12" core cpu pretty much to 4-core cpu. That's the main point against doing such braindead splitting of cores.

No it works and works really well.
Everything not upper QoS priority lives in a ghetto and the moment stuff gets relevant it gets promoted to the premium quad.

naukkis · Jan 9, 2024

adroc_thurston said:
No it works and works really well.
Everything not upper QoS priority lives in a ghetto and the moment stuff gets relevant it gets promoted to the premium quad.

C-cores should give something like 80% of P-cores speed. Both 2+4 CCX's should be faster than 4 p core CCX on multithreaded jobs. With quite a fast c-cores p-cores are there just for giving priority core speed boost - like all Android systems are done for years.

And additional point is cost. Why would AMD use additional money and resources to design 4-core CCX just for Strix Point instead of using CCX:'s they have already designed?

Hitman928 · Jan 9, 2024

naukkis said:
You still can split you 4-core load to 4 p cores even if they are in different CCX. Actually you got twice the L3 with split CCX. Thing is that multithreaded shared job threads will almost newer scale to equal strong threads - so there's usually need for 1 prime core to do main thread which splits child threads. By splitting p-cores to both CCX have equal cabability to work with such a jobs - and doing asymmetric CCX configuration is something AMD hasn't done and should newer done - non symmetrical configurations should be avoidable for being really hard for scheduler to optimize workloads.

You get a performance penalty for splitting a job across CCXs in many cases, especially anything latency sensitive. By doing a 2P+4C split, you are only going to be facing performance penalties with no efficiency or density gains.

naukkis said:
That scheduling downgrades that "12" core cpu pretty much to 4-core cpu. That's the main point against doing such braindead splitting of cores.

This makes no sense either. How does a 12 core become a 4 core by doing 4P+8C? You are getting the most possible performance for 1-4 cores. You will only get into performance penalties once you reach >4 core usage but those loads will typically be the least sensitive to multi-CCX environments to begin with. With the 2p+4c CCX configuration, you are guaranteeing performance penalties for anything beyond 2 cores and are wasting die area on 2 additional P-cores.

adroc_thurston · Jan 9, 2024

naukkis said:
C-cores should give something like 80% of P-cores speed.

No they're specifically clocked low.
It's a low power island (sort of, not a real one, a real one comes later).

naukkis said:
Both 2+4 CCX's should be faster than 4 p core CCX on multithreaded jobs.

No one cares about that stuff in a premium laptop chip. What are you on?

naukkis said:
Why would AMD use additional money and resources to design 4-core CCX just for Strix Point instead of using CCX:'s they have already designed?

That may sound weird, but Zen5 cluster interconnect is designed to be explicitly modular, goes from 4 to 16 cores.

coercitiv · Jan 9, 2024

naukkis said:
That's the main point against doing such braindead splitting of cores.

Language aside, the solution you like more would result in a permanent performance penalty (to be read as "more often than not"), albeit a relatively smaller one. The asymmetrical solution would result in no penalties for a number of workloads, and rather nasty ones when software and scheduler have no idea what they're doing. I happen to think the second option is more likely because AMD believes the cases with optimal resource allocation will outweigh the other ones.

Just so you know where I stand, I'm no fan of heavy usage of efficiency/dense cores in 15W+ consumer focused products. I don't believe they are worth it outside cost oriented products (Phoenix 2) or focused usage like the LP E cores in MTL or the few small cores in Apple products. The idea of asymmetrical CCXs isn't enticing for me, but I really doubt AMD would opt for performance regression in latency sensitive workloads just so they can include more cores for MT processing in laptop chips.

That's just my uninformed opinion though, leaning more on observing late AMD behavior rather than engineering insight.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Diamond Member

Diamond Member

Golden Member

Platinum Member

Senior member

Lifer

Diamond Member

Platinum Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Senior member

Platinum Member

Senior member

Diamond Member

Elite Member

Senior member

Platinum Member

Senior member

Diamond Member

Platinum Member

Diamond Member