Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 54 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

SteinFG

Senior member
Dec 29, 2021
458
521
106
Last edited:

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
how blurry the shot was should have been a dead giveaway it doesn't have the noisy fine grain a slipped shot takes from a partner testing an es. what's with the yellow animal?
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
Look that picture in your own link. https://i0.wp.com/chipsandcheese.co...23/04/zen4_ring_vs_broadwell_drawio.png?ssl=1

And you just calculated total bandwidth - what matters is individual bandwidth between ring stops. Ifop bandwith is in same league as any other ring traffic. Ring link speed is 32B * ring clock @4ghz equals to 128GB/s. Zen4 could also use double-link ifop from CCD to provide that bandwidth from server IOD.
I see where you are coming from and as stated above, I am in no way saying that this is not a likely option. Just to point this out, CnC is making educated guesses on that pic as well.
Making the IFoP part of the ring surely makes the topology and therefore the whole layout simpler.
But it also has some disadvantages:
  • One more stop on the ring (or even two because of the wide mode), increasing average hops.
  • congestion of the ring. The thing is: From a topology PoV the IOD traffic sits below L3 traffic because only if there is a cache miss, the data gets loaded from memory. So it does not seem beneficial to have both kinds of traffic on the same level. But as can be seen, that has been done before.
  • non uniform latency to memory (almost negligible)
 
Reactions: lightmanek

moinmoin

Diamond Member
Jun 1, 2017
4,975
7,736
136
Several notes:
  • The interesting part about AMD's ring (if it is one) is that unlike Intel's ring bus it's not visible on die shots but appears to be an integral part of L3$ itself.
  • Being a victim cache and all cores having their own slices likely means that for writes the cores write their miss data to the L3$ more directly (without taking the ring bus? Though it is unclear how the slices work when cores are disabled as that increases the size of the slices each remaining core has write access to).
  • Aside the slide owned by the core the whole of L3$ is only available to read accesses. Accesses go through the L3$ tags which also covers L2$ of all cores. Only if that's a miss RAM is being accessed. So it would make sense to split the paths there, if it's a hit use the ring bus to get the data, if it's a miss forward the request to the IOD/IMC. Now those L3$ (and L2$ shadow) tags themselves are not centralized but interwoven with the L3$ though.
Slides for Zen 3:



Source: https://www.slideshare.net/AMD/zen-3-amd-2nd-generation-7nm-x8664-microprocessor-core
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,804
3,268
136
Where are you getting your third point from , I don't believe that to be true, and have said before (zen1) that they have buffers facing each slide for handling bursty periods. That would be a significant downgrade

Also 1-2 core per ccd with full l3 server dies would be completely pointless
 

naukkis

Senior member
Jun 5, 2002
722
610
136
Several notes:
  • The interesting part about AMD's ring (if it is one) is that unlike Intel's ring bus it's not visible on die shots but appears to be an integral part of L3$ itself.
  • Being a victim cache and all cores having their own slices likely means that for writes the cores write their miss data to the L3$ more directly (without taking the ring bus? Though it is unclear how the slices work when cores are disabled as that increases the size of the slices each remaining core has write access to).
Both Intel and AMD slice their L3 by lower-address bits. So only address-bit matching 1/8 of L3-accesses will hit local L3 - other 7/8 has to go through interconnect to L3 slice that matches that slicing. As local slice is bit faster there have been some optimizing effort in Intel for it - but today they probably more likely want to slow down that local access L3 to match remote slices to prevent possible yet undiscovered side-channels.
 

naukkis

Senior member
Jun 5, 2002
722
610
136
I see where you are coming from and as stated above, I am in no way saying that this is not a likely option. Just to point this out, CnC is making educated guesses on that pic as well.
Making the IFoP part of the ring surely makes the topology and therefore the whole layout simpler.
But it also has some disadvantages:
  • One more stop on the ring (or even two because of the wide mode), increasing average hops.
  • congestion of the ring. The thing is: From a topology PoV the IOD traffic sits below L3 traffic because only if there is a cache miss, the data gets loaded from memory. So it does not seem beneficial to have both kinds of traffic on the same level. But as can be seen, that has been done before.
  • non uniform latency to memory (almost negligible)
There's only need for one Ifop stop on ring because that can saturate also dual-link ifop. And with one-link connection makes it possible to feed whole Ifop-bandwidth to one core which sure has been one design point (anyone benchmarked that yet?)

And there's absolutely no reason to have two simultaneous interconnections to cores and L3-slices because those aren't dual-ported. So if slice/core port isn't used for L3 access it's free for Ifop access. Ring can be saturated from other cores traffic but from core perspective one interconnect is all what is needed. As main traffic is L3 separating IOD accesses from L3 traffic won't speed up things at all. - And memory latency uniformity - memory request is made only after it has missed L3 - so there's nothing to gain here - ring has to be accessed anyway.
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
There's only need for one Ifop stop on ring because that can saturate also dual-link ifop. And with one-link connection makes it possible to feed whole Ifop-bandwidth to one core which sure has been one design point (anyone benchmarked that yet?)

And there's absolutely no reason to have two simultaneous interconnections to cores and L3-slices because those aren't dual-ported. So if slice/core port isn't used for L3 access it's free for Ifop access. Ring can be saturated from other cores traffic but from core perspective one interconnect is all what is needed. As main traffic is L3 separating IOD accesses from L3 traffic won't speed up things at all. - And memory latency uniformity - memory request is made only after it has missed L3 - so there's nothing to gain here - ring has to be accessed anyway.
Yep, one core can almost saturate the whole IFoP link, that was tested by C'n'C IIRC. So yep, good point 😃
Regarding latency: Yes, the ring needs to be accessed regardless. But if the IFoP would be connected to each slice in a hub and spoke manner, then RAM latency would be uniform. But as stated before, this would be almost negligible. Like 55ns vs. 60ns at worst.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Aside the slide owned by the core the whole of L3$ is only available to read accesses. Accesses go through the L3$ tags which also covers L2$ of all cores. Only if that's a miss RAM is being accessed. So it would make sense to split the paths there, if it's a hit use the ring bus to get the data, if it's a miss forward the request to the IOD/IMC. Now those L3$ (and L2$ shadow) tags themselves are not centralized but interwoven with the L3$ though.

Sounds complicated, but i am certain that Intel and AMD do L3 caching in very simple and logically the same way: each core has L2 miss handling controller that hashes by address and sends query to L3 cache slice that serves that address and if that is a miss, same controller will send request to MC or IOD router on same ring.
Where they differ is implementation:
AMD is running a tight ring with few stops and clock is sinchronous with cores and is enjoying great latency/bw advantage as result.
Intel has more ring stops for IO, MC, physical size is larger and it is running asynchronous clocks that are also negatively impacted by things like E-Core clocks. End result is anemic bandwidth and latency penalty that negates advantages of having MC on ring.
 
Reactions: Vattila and Joe NYC

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
Any ideas what ryzen ai will bring to the table other than what was outlined today in a press release? in future it seems amd will bring it to more client than their initial planning allows for. i await another what about excel post from igor.
 

eek2121

Platinum Member
Aug 2, 2005
2,934
4,035
136
Mark Papermaster finally confirmed Hybrid for Zen. I guess, PHX2 is immediately incoming.

Didn’t read like a confirmation to me. (and PHX2 was not mentioned?)

Please don’t misunderstand me, AMD would make me very happy if they would roll out such a design on the desktop and laptop.

However, from leaks thus far, it sounds like at least some Zen 5 parts will be the same as Zen 4, with the same core counts and such.
 

CakeMonster

Golden Member
Nov 22, 2012
1,394
503
136
Because it's really now where one size doesn't fit all; we're not even remotely close to that. You're going to have a set of applications that actually are just fine with today's core count configurations because certain software and applications are not rapidly changing. But what you're going to see is that you might need, in some cases, static CPU core counts, but additional acceleration.

Maybe this is a defense of still doing 16c on Zen5 since those cores will no doubt be bigger.

But what you'll also see is more variations of the cores themselves, you'll see high-performance cores mixed with power-efficient cores mixed with acceleration. So where, Paul, we're moving to now is not just variations in core density, but variations in the type of core, and how you configure the cores. It's not only how you've optimized for either performance or energy efficiency, but stacked cache for applications that can take advantage of it, and accelerators that you put around it.

I interpret this as possibly CPU's with both V-cache and E-cores (probably the rumored 'c' cores for AMD). I hope we won't have to deal with a scheduler that have to take 3 types of cores into account at the same time, like high cache, high clock, and high energy efficiency. Probably not though, as we're likely to not see more than 2 CCX'es, at least not in the consumer market.

I think for Zen5 I'd ideally still get the regular Zen5 with 16c to avoid the scheduler issues with V-cache (and iffyness with voltages we're currently seeing). If there's a version with b.L with Zen5c small cores, that might be tempting if there's user cases for more cores, since I have much more faith in the scheduling for b.L working properly after two generation of Intel CPU's with it. However, I've still yet to see a case with my current 7950X where I'd have preferred one Zen4c CCX with more cores for my current uses cases.
 

CakeMonster

Golden Member
Nov 22, 2012
1,394
503
136
Any ideas what ryzen ai will bring to the table other than what was outlined today in a press release? in future it seems amd will bring it to more client than their initial planning allows for. i await another what about excel post from igor.
All I'm reading seems to be specialized cases that were probably planned several years ago, not really addressing the big AI boom in the last months. I suspect Intel and AMD are scrambling to add AI features to their CPU lines right now. Well, by 'right now' I mean they probably spotted the trend long before us, but it still takes many years until its in a mass market CPU. I keep thinking back to that interview Ian did with Mike Clarke, where he hinted at them working on Z8 in 2021, which is kind of depressing with regards to adding new features.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |