Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 227 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
I think it was working fine in Cinebench. Anywhere else ran the risk of hitting crazy concurrency problems:

But i guess it was AMD's way to prepare us all for very parallel future, release as retarded product as possible to exposes such weaknesses before launching 64C chips.

On topic of this new chip -> i don't have any problems with it and it will perform great beyond typical scheduling woes. ZEN5C even with reduced clock is lots of perf.

Still, let's not underestimate AMD's stupidity like @naukkis did, they surely can release lazy and stupid products.

Did you read the article you posted? It was basically system restore and the linker fighting over a bitmap. This was a code/compiler/OS problem that has nothing to do with the hardware. The only thing the 32core processor did was probably mask the problem enough for it not to be initially noticed by the programmer.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Did you read the article you posted? It was basically system restore and the linker fighting over a bitmap.

And we can very reasonably infer that said fighting was made worse by said Threadripper having bad cache latencies, in turn scaling worse that in would have been.
So in the era where typical L2 to L2 latency was 10-20ns, an abomination with hundred+ of ns arrived making all lock contention worse and things like false L2 sharing bite much more.

There were plenty of benchmarks where this chip fell over, lagging behind 16C Theadripper, but i guess this is wrong thread to point that out.

Anyway i stay with my opinion that AMD was completely stupid to unleach such chip in workstation market and sorry for overestimating the signal to noise ratio of this forum. It has become no longer tolerable to me.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,753
3,975
136
And we can very reasonably infer that said fighting was made worse by said Threadripper having bad cache latencies, in turn scaling worse that in would have been.
So in the era where typical L2 to L2 latency was 10-20ns, an abomination with hundred+ of ns arrived making all lock contention worse and things like false L2 sharing bite much more.

There were plenty of benchmarks where this chip fell over, lagging behind 16C Theadripper, but i guess this is wrong thread to point that out.

Anyway i stay with my opinion that AMD was completely stupid to unleach such chip in workstation market and sorry for overestimating the signal to noise ratio of this forum. It has become no longer tolerable to me.

Bhahaha! He just rage quit! Reminds me of others. But this one was dumb. It was over a trivial argument.
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
And we can very reasonably infer that said fighting was made worse by said Threadripper having bad cache latencies, in turn scaling worse that in would have been.
So in the era where typical L2 to L2 latency was 10-20ns, an abomination with hundred+ of ns arrived making all lock contention worse and things like false L2 sharing bite much more.

There were plenty of benchmarks where this chip fell over, lagging behind 16C Theadripper, but i guess this is wrong thread to point that out.

Anyway i stay with my opinion that AMD was completely stupid to unleach such chip in workstation market and sorry for overestimating the signal to noise ratio of this forum. It has become no longer tolerable to me.

Still going to reply.

Yeah those latencies you state are not a thing. L2 and above is local to the core. If another process needs that data, it needs to be flushed down the hierarchy. The problem with numa domains is you can have more than one location to evict data to and extra synchronization has to take place. It's basically like having multiple sockets on the same socket with some glue.

I don't think you understand locks either. The article you linked to was a file lock which is different from a synchronization lock. They work on the same concept, but in totally different areas. Although you will have to invoke a synchronization lock to insure single access to the file lock.

As bad as any of the previous generations were, they were a step in the right direction to where we are today.
 

Noid

Platinum Member
Sep 20, 2000
2,381
190
106
MSI released an undocumented chipset driver 5.11.02.217 internally
( not on the official AMD x570 chipset release download page )

MSI is a late release I guess ... Tom's hardware says it's support for the 8000's long ago
Odd that it's been around for many months, yet not noted at AMD
( I have not found the release notes yet )

driver station has this info .... ( not the release notes )

Driver NameOS Version
AMD Processor Power Management Support - AMD Ryzen Power PlanWindows 10(64-bit)8.0.0.13
Windows 11(64-bit)8.0.0.13
AMD PCI Device DriverWindows 10(64-bit)1.0.0.90
Windows 11(64-bit)1.0.0.90
AMD I2C DriverWindows 10(64-bit)1.2.0.124
Windows 11(64-bit)1.2.0.124
AMD UART DriverWindows 10(64-bit)1.2.0.116
Windows 11(64-bit)1.2.0.116
AMD GPIO2 DriverWindows 10(64-bit)2.2.0.130
Windows 11(64-bit)2.2.0.130
PT GPIO DriverWindows 10(64-bit)3.0.0.0
Windows 10(64-bit)3.0.0.0
AMD PSP DriverWindows 10(64-bit)5.25.0.0
Windows 11(64-bit)5.25.0.0
AMD IOV DriverWindows 10(64-bit)1.2.0.52
Windows 11(64-bit)1.2.0.52
AMD SMBUS DriverWindows 10(64-bit)5.12.0.38
Windows 11(64-bit)5.12.0.38
AMD AS4 ACPI DriverWindows 11(64-bit)1.2.0.46
AMD SFH I2C DriverWindows 10(64-bit)1.0.0.86
Windows 11(64-bit)1.0.0.86
AMD SFH DriverWindows 10(64-bit)1.0.0.336
Windows 11(64-bit)1.0.0.336
AMD MicroPEP DriverWindows 10(64-bit)1.0.41.0
Windows 11(64-bit)1.0.41.0
AMD Wireless Button DriverWindows 10(64-bit)1.0.0.2
Windows 11(64-bit)1.0.0.2
AMD PMF-6000Series DriverWindows 10(64-bit)22.0.3.0
Windows 11(64-bit)22.0.3.0
AMD PPM Provisioning File DriverWindows 10(64-bit)8.0.0.26
Windows 11(64-bit)8.0.0.26
AMD 3D V-Cache Performance Optimizer DriverWindows 10(64-bit)1.0.0.7
Windows 11(64-bit)1.0.0.7
AMD AMS Mailbox DriverWindows 10(64-bit)3.0.0.635
Windows 11(64-bit)3.0.0.635
AMD S0i3 Filter DriverWindows 10(64-bit)1.0.0.17
Windows 11(64-bit)1.0.0.17
AMD CIR DriverWindows 10(64-bit)3.2.4.135
AMD USB Filter DriverWindows 11(64-bit)2.1.11.304
AMD USB4 CM DriverWindows 10(64-bit)1.0.0.37
AMD SFH1.1 DriverWindows 10(64-bit)1.1.0.12
Windows 11(64-bit)1.1.0.12
AMD PMF-7040Series DriverWindows 10(64-bit)23.2.3.0
Windows 11(64-bit)23.2.3.0
AMD PMF-8000Series DriverWindows 10(64-bit)23.5.9.0
Windows 11(64-bit)23.5.9.0
AMD PMF-7736Series DriverWindows 10(64-bit)no
Windows 11(64-bit)23.1.17.0
AMD Interface DriverWindows 10(64-bit)2.0.0.14
Windows 11(64-bit)2.0.0.14
AMD DRTM DriverWindows 11(64-bit)1.0.16.4


Submitted By:Fdrsoft (admin)Submitted On:09 Nov 2023File Size:62.1MoDownloads:2449File Version:5.11.02.217 WHQLFile Author:Fdrsoft
Log in to cast your vote
 
Last edited:

naukkis

Senior member
Jun 5, 2002
726
610
136
imho, I am hoping for a PHX2 CCX * 2 configuration. That way it scales up from PHX2, rather than being a new paradigm.

CCX0: 2x Zen5 + 4x Zen5c
CCX1: 2x Zen5 + 4x Zen5c

Didn't think that possibility. It's more sensible configuration than pairing normal and c-cores to different CCX's. Wonder how today's MT-applications scale - will such a configuration affect some 3-to-4 equal threaded(do they exists ) applications or is such a configuration doing fine in most of applications?
 
Reactions: Tlh97

Hitman928

Diamond Member
Apr 15, 2012
5,423
8,333
136
Didn't think that possibility. It's more sensible configuration than pairing normal and c-cores to different CCX's. Wonder how today's MT-applications scale - will such a configuration affect some 3-to-4 equal threaded(do they exists ) applications or is such a configuration doing fine in most of applications?

That configuration only makes sense if the 4 core frequency is within the 'c' core reach and in the 'c' core part of the curve that is more efficient than the 'p' core. However, if that is the case, then it doesn't make sense to have more than 2 'p' cores to begin with. They would take up additional die space for no added value.
 
Reactions: Tlh97 and StefanR5R

naukkis

Senior member
Jun 5, 2002
726
610
136
That configuration only makes sense if the 4 core frequency is within the 'c' core reach and in the 'c' core part of the curve that is more efficient than the 'p' core. However, if that is the case, then it doesn't make sense to have more than 2 'p' cores to begin with. They would take up additional die space for no added value.

Actually that's not how today's CPU's boost. Even with only p-cores cores differ in their boost frequencies. Prime core is fastest and so on. There's probably at least 1GHz speed difference between P and C-cores on 4-thread load - question I make was that does it matter - or is just two P-cores sufficient if there ain't much of workloads that rely on 3-4 evenly fast threads?
 

coercitiv

Diamond Member
Jan 24, 2014
6,285
12,325
136
It's more sensible configuration than pairing normal and c-cores to different CCX's.
While pairing p and c-cores to different CCXs will surely introduce penalties of sorts, I fail to see how splitting the p cores in two CCXs would be better. From my understanding the whole idea of using smaller/denser cores is to cater to MT workloads, which care more about throughput than latency. Meanwhile most consumer workloads are still comparatively lightly threaded and much more latency sensitive.

With that in mind, why split P cores and wreak havoc in the speed sensitive workloads? Think about browsers or games, which can definitely make use of 4 P cores.

The only scenario where the dense cores should stay close to perf cores would be in power saving, but I really doubt that's the goal of this generation. (power savings are more of an indirect goal here, obtained through better MT efficiency and probably better idle behavior)

PS: also I'm confused about @JoeRambo , that looked like a benign contradictory discussion... almost perfectly normal for the forum. Hope he reconsiders.
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,423
8,333
136
Actually that's not how today's CPU's boost. Even with only p-cores cores differ in their boost frequencies. Prime core is fastest and so on. There's probably at least 1GHz speed difference between P and C-cores on 4-thread load - question I make was that does it matter - or is just two P-cores sufficient if there ain't much of workloads that rely on 3-4 evenly fast threads?

P-core boost differences are typically very small, especially if all of the cores are in close proximity on the same piece of silicon. That's just max boost though which doesn't come into play in this context because you're not hitting max boost clocks past 1 - 2 cores being loaded. Then the question becomes, what is the 3 - 4 cores boost frequency of the P-cores? If the C-cores can't hit that same frequency, it makes no sense to have a 2p4c+2p4c split because you'd have a significant drop-off in performance past 2 cores being loaded. If the C-cores can hit that frequency but are at the end of their frequency range and thus less efficient than the P-cores in that range, then it makes no sense to have a 2p4c+2p4c split because you are using more power for no performance improvement. Additionally, once you move to the 2nd CCX, you are now bringing in 2 P-cores that will never boost above a 7-8 core loaded frequency, which the C-cores could easily achieve, so why make them P-cores at all? You are then using more space for no performance or efficiency gain. The proposed configuration makes no sense.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
726
610
136
With that in mind, why split P cores and wreak havoc in the speed sensitive workloads? Think about browsers or games, which can definitely make use of 4 P cores.
Because with such a split what they got is 4-core cpu, which probably would suck on games. With 6 core hybrid CCX they basically got 12-core cpu divided to dual CCX - not so far from 7900x.
 

naukkis

Senior member
Jun 5, 2002
726
610
136
P-core boost differences are typically very small, especially if all of the cores are in close proximity on the same piece of silicon. That's just max boost though which doesn't come into play in this context because you're not hitting max boost clocks past 1 - 2 cores being loaded. Then the question becomes, what is the 3 - 4 cores boost frequency of the P-cores? If the C-cores can't hit that same frequency, it makes no sense to have a 2p4c+2p4c split because you'd have a significant drop-off in performance past 2 cores being loaded. If the C-cores can hit that frequency but are at the end of their frequency range and thus less efficient than the P-cores in that range, then it makes no sense to have a 2p4c+2p4c split because you are using more power for no performance improvement. Additionally, once you move to the 2nd CCX, you are now bringing in 2 P-cores that will never boost above a 5-6 core loaded frequency, which the C-cores could easily achieve, so why make them P-cores at all? You are then using more space for no performance or efficiency gain. The proposed configuration makes no sense.

You still can split you 4-core load to 4 p cores even if they are in different CCX. Actually you got twice the L3 with split CCX. Thing is that multithreaded shared job threads will almost newer scale to equal strong threads - so there's usually need for 1 prime core to do main thread which splits child threads. By splitting p-cores to both CCX have equal cabability to work with such a jobs - and doing asymmetric CCX configuration is something AMD hasn't done and should newer done - non symmetrical configurations should be avoidable for being really hard for scheduler to optimize workloads.
 
Reactions: Schmide

adroc_thurston

Platinum Member
Jul 2, 2023
2,818
4,147
96
is something AMD hasn't done and should newer done - non symmetrical configurations should be avoidable for being really hard for scheduler to optimize workloads.
It's very easy to schedule everything there since anything with a QoS priority gets scheduled onto bigs and everything not lives in the dense ghetto.
 

naukkis

Senior member
Jun 5, 2002
726
610
136
It's very easy to schedule everything there since anything with a QoS priority gets scheduled onto bigs and everything not lives in the dense ghetto.

That scheduling downgrades that "12" core cpu pretty much to 4-core cpu. That's the main point against doing such braindead splitting of cores.
 

moinmoin

Diamond Member
Jun 1, 2017
4,989
7,758
136
I personally expect the CCX setup still to be balanced.
  • 4x Zen 5 CCX + 8x Zen 5c CCX would be unbalanced
  • 4x Zen 5 CCX + 4x Zen 5c CCX+ 4x Zen 5c CCX would be unprecedented with the odd amount of CCXs
  • That does leave only the 2x PHX2 like configuration of 2x Zen 5 + 4x Zen 5c CCX + 2x Zen 5 + 4x Zen 5c CCX.
One thing to keep in mind is that AMD already indicated that Zen 5c contains more optimizations than Zen 4c does compared to the respective standard core. That may (or may not) make the PHX2 like configuration more suitable for high performance use cases Strix Halo is supposed to excel in.
 

StefanR5R

Elite Member
Dec 10, 2016
5,633
8,107
136
With that in mind, why split P cores and wreak havoc in the speed sensitive workloads?
Because with such a split what they got is 4-core cpu, which probably would suck [...]
Maybe there are 8 cores in one CCX, of which half are normal and half are dense. (Leaving 4 more dense for another CCX.) Still, a single CCX would be preferable in multithreaded applications with notable inter-thread communication.

You still can split you 4-core load to 4 p cores even if they are in different CCX.
And get RAM level latency (and respective energy cost) for inter-thread synchronization. :-(

Actually you got twice the L3 with split CCX.
In some ideal workloads, split caches perform almost the same as a unified cache with a size = the sum of the sizes of the split caches.
In other workloads, the effect of split caches, each of size X, is not much better than one unified cache of the same size X.
And then there are workloads which perform worse with split caches of size X than with a unified cache of size X.
Finally, there are workloads which may profit from split caches under lucky circumstances, if these workloads would suffer from the noisy neighbor problem otherwise.

Existing popular operating system kernels can't figure out of which type a given workload is, therefore they don't make any attempt to schedule multithreaded processes aligned with cache boundaries.

(PS: Of course smaller split caches are better than bigger unified caches if the workloads aren't sensitive to the splits and if the split helps the CPU designers make these caches very fast. Cf. Zen vs. Skylake-SP and successors.)

--------

AMD are allegedly working on a 16-core dense CCX for Zen 6 server. Maybe they'll have a nice surprise for us with a 12-core CCX for Zen 5 mobile? (OK, sounds too good to be likely.)
 

naukkis

Senior member
Jun 5, 2002
726
610
136
And get RAM level latency (and respective energy cost) for inter-thread synchronization. :-(

This wasn't point that I replied. With split CCX it's still possible to load 4 p-cores first and not limited to populate one CCX before other. Cinebench would be doing just as fine with dual 2+4 CCX's than with pairing all p-cores to one CCX. And because that thread-syncronization doing symmetrical CCX with P-cores on both will be vastly better - just imagine how to split two equal priority 4-thread jobs to such a asymmetric CCX configuration - as scheduler should split those jobs so both jobs are given equal resources. If not the case something like gaming would be utterly painful when engine except similar latencies from both jobs and what they got is something unpredictable.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
726
610
136
No it works and works really well.
Everything not upper QoS priority lives in a ghetto and the moment stuff gets relevant it gets promoted to the premium quad.

C-cores should give something like 80% of P-cores speed. Both 2+4 CCX's should be faster than 4 p core CCX on multithreaded jobs. With quite a fast c-cores p-cores are there just for giving priority core speed boost - like all Android systems are done for years.

And additional point is cost. Why would AMD use additional money and resources to design 4-core CCX just for Strix Point instead of using CCX:'s they have already designed?
 

Hitman928

Diamond Member
Apr 15, 2012
5,423
8,333
136
You still can split you 4-core load to 4 p cores even if they are in different CCX. Actually you got twice the L3 with split CCX. Thing is that multithreaded shared job threads will almost newer scale to equal strong threads - so there's usually need for 1 prime core to do main thread which splits child threads. By splitting p-cores to both CCX have equal cabability to work with such a jobs - and doing asymmetric CCX configuration is something AMD hasn't done and should newer done - non symmetrical configurations should be avoidable for being really hard for scheduler to optimize workloads.

You get a performance penalty for splitting a job across CCXs in many cases, especially anything latency sensitive. By doing a 2P+4C split, you are only going to be facing performance penalties with no efficiency or density gains.
That scheduling downgrades that "12" core cpu pretty much to 4-core cpu. That's the main point against doing such braindead splitting of cores.

This makes no sense either. How does a 12 core become a 4 core by doing 4P+8C? You are getting the most possible performance for 1-4 cores. You will only get into performance penalties once you reach >4 core usage but those loads will typically be the least sensitive to multi-CCX environments to begin with. With the 2p+4c CCX configuration, you are guaranteeing performance penalties for anything beyond 2 cores and are wasting die area on 2 additional P-cores.
 
Reactions: Tlh97

adroc_thurston

Platinum Member
Jul 2, 2023
2,818
4,147
96
C-cores should give something like 80% of P-cores speed.
No they're specifically clocked low.
It's a low power island (sort of, not a real one, a real one comes later).
Both 2+4 CCX's should be faster than 4 p core CCX on multithreaded jobs.
No one cares about that stuff in a premium laptop chip. What are you on?
Why would AMD use additional money and resources to design 4-core CCX just for Strix Point instead of using CCX:'s they have already designed?
That may sound weird, but Zen5 cluster interconnect is designed to be explicitly modular, goes from 4 to 16 cores.
 

coercitiv

Diamond Member
Jan 24, 2014
6,285
12,325
136
That's the main point against doing such braindead splitting of cores.
Language aside, the solution you like more would result in a permanent performance penalty (to be read as "more often than not"), albeit a relatively smaller one. The asymmetrical solution would result in no penalties for a number of workloads, and rather nasty ones when software and scheduler have no idea what they're doing. I happen to think the second option is more likely because AMD believes the cases with optimal resource allocation will outweigh the other ones.

Just so you know where I stand, I'm no fan of heavy usage of efficiency/dense cores in 15W+ consumer focused products. I don't believe they are worth it outside cost oriented products (Phoenix 2) or focused usage like the LP E cores in MTL or the few small cores in Apple products. The idea of asymmetrical CCXs isn't enticing for me, but I really doubt AMD would opt for performance regression in latency sensitive workloads just so they can include more cores for MT processing in laptop chips.

That's just my uninformed opinion though, leaning more on observing late AMD behavior rather than engineering insight.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |