Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 49 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kedas

Senior member
Dec 6, 2018
355
339
136
Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"
 
Last edited:
Reactions: Tlh97 and Gideon

DrMrLordX

Lifer
Apr 27, 2000
21,709
10,983
136
Only things I really miss about Win11 at this point are the two-click to get to the program list thing that you can't fix and the way it groups up task bar items instead of letting them have their own tabs.

So um

About Zen3D

Anyone else find it funny that we're already seeing pre-orders for Milan-X EPYC but nothing for Vermeer-X?
 
Reactions: Joe NYC

Joe NYC

Platinum Member
Jun 26, 2021
2,072
2,585
106
Anyone else find it funny that we're already seeing pre-orders for Milan-X EPYC but nothing for Vermeer-X?

Just a speculation, but I think AMD may be taking a hard turn to Milan-X. There was some cost analysis article I came across recently, showing the SRAM to be super cheap to produce (as I have been saying).

Also, I am not sure if AMD is going to be switching to B2 stepping with regular Milan or make it a Milan-X only.

As I said, speculation only, but it seems that Google and Microsoft want to jump on the Confidential Computing, but it has still been only in pre-release state. Maybe it needs the B2 stepping for full release. And maybe AMD will say B2 comes only with Milan-X.

Milan-X is going to have a bit of a price increase, so maybe really wants to make hay while the sun shines...

As far as Vermeer-X, I guess we will hear something in 4 days, but there are no leaks at all, as you said. Nothing. No hints coming from any reviewers, so I don't think they have it. January 4 release looks to be very much in doubt. It is probably going to continue to be treated like a stepchild...
 

DrMrLordX

Lifer
Apr 27, 2000
21,709
10,983
136
Just a speculation, but I think AMD may be taking a hard turn to Milan-X.

That is possible. It sort of depends on how many of the B2-stepping dice are flexible enough to bin for either Vermeer-X or Milan-X. Vermeer-X isn't going to make them anywhere near as much money as Milan-X. If Raphael is indeed going to be announced/launched at CES, AMD doesn't have long to wait before burying Alder Lake and Raptor Lake in one shot (like they even care; Vermeer is surprisingly competitive with Alder Lake as it is).
 
Reactions: Tlh97 and Joe NYC

Joe NYC

Platinum Member
Jun 26, 2021
2,072
2,585
106
That is possible. It sort of depends on how many of the B2-stepping dice are flexible enough to bin for either Vermeer-X or Milan-X. Vermeer-X isn't going to make them anywhere near as much money as Milan-X.

I think the B2 stepping will improve the binning a little bit, say just 100 MHz, then the binning at clock speed AMD has been offering becomes almost a non issue.

Also, with half or more cores disabled, in 16 and 32 core SKUs, finding 2 or 4 really good cores out of 8 should not be a problem at all.

I think AMD has an easy trajectory to reach 25% server market share by mid 2022, so that is the priority, or to even exceed that.

If, as some people (namely MLID) say that it is the substrate that is the bottleneck, not silicon from TSMC than shifting capacity to Milan from Vermeer would result in being able to sell more silicon and more expensive silicon per substrate used.

If Raphael is indeed going to be announced/launched at CES, AMD doesn't have long to wait before burying Alder Lake and Raptor Lake in one shot (like they even care; Vermeer is surprisingly competitive with Alder Lake as it is).

I think Raphael may at best be teased, officially put on road map (after it disappeared from it).

Absolutely zero leaks form anywhere, including Mobo makers means no chance of launch. for Raphael anywhere close to CES.

Vermeer-X does not need any platform changes, so even with zero leaks, it is still possible Vermeer could get launched, maybe with another date - shipping date shortly after CES... That would be within the realm of possibilities.
 

biostud

Lifer
Feb 27, 2003
18,281
4,806
136
I think they’ll roll out B2 to all lines, for EPYC and threadripper it will only be b2 as they will not need to validate different steppings for servers/professional use. The b2 will will probably silently take over ryzen, as the older stepping goes out of production/stock.
 
Reactions: Tlh97 and Joe NYC

tomatosummit

Member
Mar 21, 2019
184
177
116
I think the B2 stepping will improve the binning a little bit, say just 100 MHz, then the binning at clock speed AMD has been offering becomes almost a non issue.
*redacted*

I think Raphael may at best be teased, officially put on road map (after it disappeared from it).
Regarding b2 clocks, attributed to either stepping or just another extra year of process maturity, it'll depend more on what clocks the top stacked cache part will have. If the 6950x3d can hit 5 or 4.9ghz still then that'll be the limit and you can probably drop the rest of the line in increments from there. It's a tragedy if the halo part has to drop clocks.
Another question is will the non stacked parts have the same or slightly reduced peak turbos.
Not that I think it matters too much, the all core clock speeds will be more telling, could non-stacked parts could have a higher sustained all core for example.

As for the raph, there was the gigabyte leak, which revealed the existence of the 6?series motherboards so the engineering information is out there.
But it's been said too many times at this point that amd has gotten very good at not leaking much and that's especially true for performance metrics.

Regardless raphael is in a troublesome spot. It's biggest competition is going to be vermeer-x so revealing it too early would be damaging to their product cycle, no matter how much people like us are craving for any informtion.
Also will raphael have a stacked cache option on or very close to launch? Without 3d cache it might not fully outperform vermeer-x. There are only questions for raph and keeping it all under wraps is the best option for now.
 

biostud

Lifer
Feb 27, 2003
18,281
4,806
136
The last official about B2 says nothing about higher max clocks, but might be able to sustain all core turbo boost @ max frequency for longer periods of time.
I really don't think B2 is anything groundbreaking.
 
Reactions: Tlh97 and Joe NYC

Joe NYC

Platinum Member
Jun 26, 2021
2,072
2,585
106
Regarding b2 clocks, attributed to either stepping or just another extra year of process maturity, it'll depend more on what clocks the top stacked cache part will have. If the 6950x3d can hit 5 or 4.9ghz still then that'll be the limit and you can probably drop the rest of the line in increments from there. It's a tragedy if the halo part has to drop clocks.
Another question is will the non stacked parts have the same or slightly reduced peak turbos.
Not that I think it matters too much, the all core clock speeds will be more telling, could non-stacked parts could have a higher sustained all core for example.

Well, there is a good reason for that. Without V-Cache, on a cache miss, the core is idle at high clock speed, doing nothing, keeping cool, but with V-Cache, there will be fewer cache misses, so core doing more work, generating more heat.

So, the work done is more important metric than all core clock speed. Because comparing clock speed of core without and with V-Cache is comparing apples with oranges.


Regardless raphael is in a troublesome spot. It's biggest competition is going to be vermeer-x so revealing it too early would be damaging to their product cycle, no matter how much people like us are craving for any informtion.
Also will raphael have a stacked cache option on or very close to launch? Without 3d cache it might not fully outperform vermeer-x. There are only questions for raph and keeping it all under wraps is the best option for now.

Also, Vermeer vs. Vermeer-X, as far as the Osborne Effect. Because they are actually swappable.

As far as Raphael without V-Cache vs. Vermeer X, my guess would be that Raphael would on average, outperform Vermeer X. But Vermeer-X will win some ...

I am not sure how much of an Osborne effect there would be between the 2. Raphael is a whole new platform, with higher cost than a Vermeer-X upgrades for people who may already have the mobo and RAM.

The V-Cache option for Raphael, I think, will depend on when Raphael launches. If it launches late in the year, V-Cache may be available on launch. If it comes out early, say mid-year, then probably not...
 
Reactions: Tlh97 and Vattila

Joe NYC

Platinum Member
Jun 26, 2021
2,072
2,585
106
Doesn't sound like it. Depending on when Zen 5 launches it could be a mid cycle refresh.

I think it depends more on whether the technical challenges have been solved, and it is manufacturable.

First, there is TSMC timeline which says N5 will become available for stacking, as the bottom die, only in H2 2022.

Then, I think there is another challenge to get more than 1 layer of cache to work. I am guessing that AMD will want to have that feature / option for Zen 4 processors...

As a way to fix the broken Moore's law in a different way...
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Well, there is a good reason for that. Without V-Cache, on a cache miss, the core is idle at high clock speed, doing nothing, keeping cool, but with V-Cache, there will be fewer cache misses, so core doing more work, generating more heat.

On an OOP processor, the core is never idle. It may push latency into some sequences, but you would have to work hard, like custom purposefully bad code hard to make a stall that long with such large register file (zen 168).
 

Kedas

Senior member
Dec 6, 2018
355
339
136
I'm pretty sure B2 is mainly a needed optimization for V-Cache support.
And they probably also took the opportunity to fix a few things that needed fixing in the microcode before, anything more than that I would be surprised.

I'm even a bit surprised that the B2 stepping isn't out there yet, maybe soon or people didn't notice, Zen3 production must have switched to B2 some time before they started making the V-Cache versions last year.

edit: nevermind here it is: https://min.news/en/tech/b43fdca64913b7776ea173b95ec971fa.html

edit2: why not 6nm, because TSMC 6nm doesn't support V-cache...
 
Last edited:
Reactions: Schmide and Joe NYC

Kedas

Senior member
Dec 6, 2018
355
339
136
Kind of a head-scratcher since TSMC is mostly pushing their N7 customers on to N6 where possible. But Vermeer/Milan will stay N7, even with stacked cache.
Well it increases the wafer production capacity with 20% for TSMC (hence the push) and on top of that customers have about 15% more dies per wafer, so APUs on 6nm is an easy choice.
Zen4 release will mostly be determined by TSMC 5nm V-cache stacking support.
 
Reactions: Joe NYC

DrMrLordX

Lifer
Apr 27, 2000
21,709
10,983
136
Well it increases the wafer production capacity with 20% for TSMC (hence the push) and on top of that customers have about 15% more dies per wafer, so APUs on 6nm is an easy choice.

Makes sense, hence my confusion about all Vermeer products staying on N7. And where did you hear that N6 didn't support stacked cache?
 

Mopetar

Diamond Member
Jan 31, 2011
7,936
6,233
136
Well, there is a good reason for that. Without V-Cache, on a cache miss, the core is idle at high clock speed, doing nothing. . . .

If that occurs and the pipeline is completely stalled due to dependencies it will just start utilizing SMT. The core isn't really idle either. It's still running and just inserting NOPs because it doesn't know at the time how long it will need to stall because it can't know how bad the cache miss is until it's worked it's way through to that point. Either it runs the hyper thread while that's being worked out or it tries to execute other instructions that don't have dependencies.

It may be generating less heat just because control lines are being forced to 0 and as a result fewer transistors are switching, but most of that logic is still operating, it's just that the results are being discarded. The only way for it to idle is for the core to communicate the stall to the OS scheduler and for that to down-clock the core. Of course it could just load another thread and start that if it knows the other thread will be stalled for a while.
 

Kedas

Senior member
Dec 6, 2018
355
339
136
Makes sense, hence my confusion about all Vermeer products staying on N7. And where did you hear that N6 didn't support stacked cache?
I haven't seen it as 'Not supported' but the half nodes are missing if they list it 3, 5, 7.
Could be carelessness or correct that half nodes don't support stacking.
 

Doug S

Platinum Member
Feb 8, 2020
2,318
3,663
136
I haven't seen it as 'Not supported' but the half nodes are missing if they list it 3, 5, 7.
Could be carelessness or correct that half nodes don't support stacking.

TSMC considers nodes like N6 and N4 (which are NOT true half nodes) to be optimizations of N7 and N5, respectively. So it is quite possible that support for stacking on N7 means it will also support N6.
 

Kedas

Senior member
Dec 6, 2018
355
339
136
TSMC considers nodes like N6 and N4 (which are NOT true half nodes) to be optimizations of N7 and N5, respectively. So it is quite possible that support for stacking on N7 means it will also support N6.
Yes but the thing is that it is also shrinking and for stacking size is important hence extra work that needs to be done again so they may have decided to skip that part. (certainly since the amount of stacking request is limited for now)
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
It's still running and just inserting NOPs because it doesn't know at the time how long it will need to stall because it can't know how bad the cache miss is until it's worked it's way through to that point.

NOPs are actual instructions and although they were designed to do no operation, they do take up space and used to have the effect of incrementing the instruction pointer. Although superfluous today because they are removed by the decoder, they are not inserted into the pipeline. Moreover, by the time the pipeline gets to the point where there is any possibility of complete stall, everything is in the μop cache anyways.

Regardless, with deep pipelines where instructions have latencies of 14+ clocks and register files with 150+ entries, there is plenty of things to do even with a 60+ clock to memory.
 

Mopetar

Diamond Member
Jan 31, 2011
7,936
6,233
136
NOPs are actual instructions and although they were designed to do no operation, they do take up space and used to have the effect of incrementing the instruction pointer.

Maybe it's architecture dependent, but I don't think you'd want to advance the program counter on a NOP, at least not in every case. Even in a simple pipeline there are plenty of cases where it can't advance just due to physical limitations. Anything with OOE an an actual program running probably has something else in the instruction queue that can be executed, but if you've got a simple program that's specifically designed to benchmark certain kinds of performance or better understand those characteristics of the chip it might not have that.

Suppose we've got some program written and specifically designed to test out cache performance in the CPU. Even the fastest caches are still usually at least 3 clock cycles which means that the next operation needs to be delayed until that data becomes available (wether that means being written to the register file or forwarded in some manner) so the processor needs some mechanism to stall on a specific instruction until it can actually execute it with the correct data.

But it's been well over a decade since I took a course in CPU architectures so it's entirely possible that the state of the art has advanced beyond that and I'm operating on some outdated assumptions or ideas. I mean it would be ideal if someone could figure out how to keep a CPU pipeline completely fed with potentially useful calculations, but I'm not sure if that's possible in reality for anything as complex as x86 and if that were the case it just likely means performance is being left on the table somewhere else to accommodate such a design.
 

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Maybe it's architecture dependent, but I don't think you'd want to advance the program counter on a NOP, at least not in every case. Even in a simple pipeline there are plenty of cases where it can't advance just due to physical limitations. Anything with OOE an an actual program running probably has something else in the instruction queue that can be executed, but if you've got a simple program that's specifically designed to benchmark certain kinds of performance or better understand those characteristics of the chip it might not have that.

Typically on x86 you say instruction pointer because instructions are variable size and you can't just increment. Risc and arm the the term program counter is more apt as you have fixed instruction size and use with relative offset data.

It the past nops were used for timing, alignment, self modifying code, and a few other edge cases. With out of order execution and decoupled decoders they really serve zero purpose.

Suppose we've got some program written and specifically designed to test out cache performance in the CPU. Even the fastest caches are still usually at least 3 clock cycles which means that the next operation needs to be delayed until that data becomes available (wether that means being written to the register file or forwarded in some manner) so the processor needs some mechanism to stall on a specific instruction until it can actually execute it with the correct data.

That's the beauty of out of order execution. It's just a bunch of queues and mapped registers. The load store unit need only move data in and out of the register file. When the data makes it there, it is sent down a pipeline to be worked on. As dependencies are met more data can be fed into the pipelines. Instructions don't execute in a cycle but multiple instructions can move through a pipeline with latency producing a cycle average throughput. The memory hierarchy (register file, cache, main memory) need only evict when more space is needed or an explicit barrier or flush is called.

Any program written to measure the memory hierarchy is not executing and measuring one instruction at a time. Typically it's iterating over a sequence of memory, measuring throughput, then increasing the span. (repeat)

So lets put this in context. We're talking about the difference in stalling a huge 32 MiB vs 96 MiB L3 multi way victim cache. So a span operation greater than 32 MiB 16 way. I think that falls into the "like custom purposefully bad code hard" category to stall it.
 
Reactions: Mopetar

NostaSeronx

Diamond Member
Sep 18, 2011
3,687
1,222
136
Yes but the thing is that it is also shrinking and for stacking size is important hence extra work that needs to be done again so they may have decided to skip that part. (certainly since the amount of stacking request is limited for now)
The shrink only occurs on the New Tapeout side. The Re-Tapeout side, has no shrink and gets the benefit of SAQP 193i to Single-Patterned EUV.

Die A w/ Mask Set A-193i-SAQP is the same exact Die A w/ Set-B-EUV-SP.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |