- Mar 3, 2017
- 1,626
- 5,909
- 136
Only if it loads 1c/1t pairs
It doesn't, it loads all threads on one socket and only then moves on and kinda stops scaling after 160t altogether.Doesn't it?
It doesn't, it loads all threads on one socket and only then moves on and kinda stops scaling after 160t altogether.
No it'll load all 128t on one socket then move on, in 1c/2t increments.It would be a combination in that case, right
No it'll load all 128t on one socket then move on, in 1c/2t increments.
It's a single task silly parallel workload of rendering aka the thing you do NOT want to spill out of the socket.explain the benefit of scheduling tasks this way?
It's a single task silly parallel workload of rendering aka the thing you do NOT want to spill out of the socket.
Please just quote your estimate for socket level SIR2017 bumps for Turin over Genoa (both 96c and 128c) and be done with it.
It consumes all 128t available on socket before spilling to the next P.I'm asking about the how and why of thread scheduling inside the socket
Have your skin in the game.Why would I have one
It consumes all 128t available on socket before spilling to the next P.
Cinememe doesn't scale to 2p in any relevant way, shape or form.but it is to the second.
To clarify, SIR = SpecInt Rate?Cinememe doesn't scale to 2p in any relevant way, shape or form.
Just quote your SIR bumps in %%.
Yes, socket SIR score is how vendors love guiding their things.To clarify, SIR = SpecInt Rate?
Gotcha.Yes, socket SIR score is how vendors love guiding their things.
You see it everywhere, Intel, AMD, ARM, whatever.
I.e. Turin is %redacted% amount faster in SIR, EMR is low teens over SPR, GNR is ~2x EMR and so on and so forth.
Not the end-it-all metric but a useful proxy.
That’s not a perfect apples/apples. You’d want to compare single socket configurations or compare dual socket configurations. A dual socket setup is always less than 2x of a single socket.Oh, and for similar cores.... so 64 cores gets 615 and 60 cores of SR get 495. So at the closest we can bench, its 25% faster for for 7% more Genoa cores.
View attachment 86497
Best that I could find. From experience with mine I am sure the SR chips are far inferior. This and the benchmarks I have posted elsewhere show that also.That’s not a perfect apples/apples. You’d want to compare single socket configurations or compare dual socket configurations. A dual socket setup is always less than 2x of a single socket.
Looking at this again it, vs Z4So, if the slides are to believed (and they look quite authentic), the core ends up being much more similar to Alder Lake than I anticipated. But still noticeably fatter.
The biggest unknown for me is how do they plan to feed the beast? There are no mentions of any decoder changes, surely it would be an absurd bottleneck, if not changed?
- The same 12-way 48KB L1 cache as Colden Cove (hopefully without the latency penalty)
- 8-wide dispatch (+2 vs Alder Lake and Zen 4)
- 6 ALUs (+1 vs Alder lake +2 vs Zen 4)
- 4 loads / 2 stores per cycle (vs 3/2 for Golden cove, 2 /1 for Zen 4)
- - if I'm reading this right, these are 512bit (64 byte) ? That's a massive uplift from Zen 4 if true (4x the throughput in ideal AVX-512 scenarios)
Anyway looking forward to comparisons to the Arrow Lake core. In the end, they couuld end up pretty similar in width - so it would all come down to execution.
From the slide, it says 4 cycle L1D (compared to 5 cycles for GLC). So no regression even when they doubled the LD/ST width.
- The same 12-way 48KB L1 cache as Colden Cove (hopefully without the latency penalty)
Wait until you see the ROB and PRF sizes.Does not seem terribly bloated
Do you think you can do us a favor and prepare this data into a table which compares various architectures (e.g. Zen 3, Zen 4, Zen 5, GLC)? Please and thank you!Looking at this again it, vs Z4
Does not seem terribly bloated, would have indeed seems akin to the Z2 -> Z3 evolution. The unknowns however do seem like the kind of big ticket items. I think the zero bubble conditional branch could be tied to the "2 basic block fetch".
- +2 rename/dispatch
- +2 ALUs
- +1 LD/cycle
- 512b FP width
- 64B LD/ST queues
- 48K L1D
- OOO structures increased
- Usual generational architectural improvements scattered around
- New BP with larger BTBs -> zero bubble conditional branches sounded like the patent I listed before where a second BP scans the other conditional branch
- Decode width unknown, doubtful it is going to be beyond 6 wide if at all they even increase.
- uop cache unknown
- "2 basic block fetch" --> Does this mean 2x fetch and decode blocks akin to Tremont?
Low Power core
However a major departure from Zen 3/4 series are the unified schedulers for INT and FP back to Zen 2 style. Would be interesting to see latencies with Zen 5.
- Probably the low power core option is not having 512b FP pipes or 64B LD/ST queues (they mentioned FP 512 variants, which would mean 512 pipes and data structures not standard across all core)
- Denser node/efficiency optimized libs as usual
- Cache reduction as usual
- If the 2x basic block fetch is akin to what I described, they could clock gate the second fetch block aggressively for mobile
If they implemented something like what is described below in their patents, it wont be as bloated as other designs.Wait until you see the ROB and PRF sizes.
I have a table from Zen 1 to Zen 4, but Zen 5 has too many unknowns so not posting at the moment.Do you think you can do us a favor and prepare this data into a table which compares various architectures (e.g. Zen 3, Zen 4, Zen 5, GLC)? Please and thank you!