Speculation: Ryzen 4000 series/Zen 3

maddie · Jul 16, 2019

amd6502 said:
What do you mean by this? GF cancelled 7nm finfet.

SMT is symmetric. The front end and L1 has changed in Zen2, and it's much more capable despite only 4 decode/cycle. They almost doubled up op-cache.

https://www.anandtech.com/show/14525/amd-zen-2-microarchitecture-analysis-ryzen-3000-and-epyc-rome/8

They feed the L2 instruction code effectively using a new unit. Then have shrunk L1i (halved?) but doubled op-cache size.

The op cache can also now feed up to 8 ops per cycle versus the old 6 ops maximum; link above is new zen2 and this link is zen1 https://www.anandtech.com/show/1057...lers-micro-op-cache-memory-hierarchy-revealed

The decoder still chugs only 4. It may or may not bottleneck the performance. It just depends how the op cache is performing. They did a great job so I think it's likely rare that the 4-wide decode is an issue. Because Zen3 likely is mobile focused I think doubling up decoder might not happen unless there are energy efficiency tricks. (I remember Kaveri getting doubled up decode and I think it has a lot to do with why these little APU's made such good space heaters and why 8c Steamroller was cancelled.) Maybe they are designing a 6-wide decode.

This is total speculation and likely totally wrong; but it's my best guess.

I think they will widen the core a little more and do 4-way multithreading, so with four threads and wider core they may need to widen the decoder. I don't think it will be SMT4 though, but think they will add a "Threadrip" mode, that allows a pair of opportunistic threads to run on top of SMT2 and help keep the execution units busy. This would be similar to big-little in the acorn world. These small threads would run completely without speculation (taking turns pausing on branches) and out-of-order execution would be very limited.

For consumer enabling Threadrip would be a benefit; with quadcore APU having 8 strong threads, and 8 "small" threads. Building a kernel with -j16 would be a big speedup over a kernel build with only SMT2 enabled. Little threads would also have no vulnerability to spec execution. The OS can use it for itself and system processes. Browsers can be made to use it (eg offloading incessant and useless javascript threads associated with bg tabs). Little threads would be most useful for high latency and high FPU code and could be useful in parallel compute datacentres.

Zen3 would be primarily mobile focused, secondarily server focused; and hopefully the first to see this core would be a quadcore sub 10W APU, followed by 2 CCX chiplets for the server and consumer markets.

As far as product lines, I think unlike 3000 gen, consumer 4000 MCM's would be strictly single CPU chiplet APUs, with 8 CU Vega built into the IOX, and available for both Zen2 and Zen3 chiplets--Zen2 arriving late H1 and Zen2 in late H2 or early 2021.

I wanted him to answer the question as the claim was that the front end was not efficient because the SMT yield was greater.

statement I replied to:
"The SMT yield is higher, which means that the front end is still not efficiently feeding the execution units."

amd6502 · Jul 17, 2019

maddie said:
I wanted him to answer the question as the claim was that the front end was not efficient because the SMT yield was greater.

statement I replied to:
"The SMT yield is higher, which means that the front end is still not efficiently feeding the execution units."

I don't see how that would work. The front end is probably pushed to its max during full load (both threads active). If the SMT yield is higher it kind of points to the front end doing its job. However, there isn't a way of knowning; perhaps if the the front end were improved, the SMT yield would rise even further. Basically, I don't think there's any info, and that statement doesn't make sense.

But it does seem plausible the 4-wide decoder could be holding back performance. (The likelyhood also increases if AMD goes beyond SMT2.)

It seems the improved SMT yield is probably mostly from the wider core (up to 3 loads per cycle now).

Now the big question is, where will they widen the core next (assuming it's widened in Zen3).

It could very well be that the width remains the same (4+3).

Would increasing the maximum writes/stores per cycle from 1 to 2 cause a lot of issues or complexity?

tamz_msc · Jul 17, 2019

Abwx said:
At least 4, the two i quoted an,d the two i linked, dunno for the rest but that s obscure benches set apart for 7 Zip and CB, i see no Spec, no Web Xprt (an Intel bench, but not "good enough for the purpose..), anything where AMD does well is removed and Blender/CB used as a cover..

It's all in your head - you're looking for hidden agendas where there exist none.

And as expected, you have no idea what to look for in Geekbench ST numbers. Here let me do the calculations for you.

From your link,
3700X
Integer score = 5284; Floating Point score = 5483; median frequency = 4238 MHz
Score/GHz Integer = 1247; Floating point = 1294
9900K
Integer score = 6381; Floating Point score = 6045; median frequency = 5038 MHz
Score/GHz Integer = 1267; Floating Point = 1200

So CFL vs Zen 2, CFL is 1.6 percent ahead in integer, 7.3 percent behind in Floating Point

Now Ice Lake (https://browser.geekbench.com/v4/cpu/13476154)
Integer score = 5410; Floating Point score = 5609; median frequency = 3825 MHz
Score/GHz Integer = 1414; Floating point = 1466

ICL vs Zen 2, ICL is 13.4 percent ahead in integer, 13.3 percent ahead in Floating Point.

Therefore Zen 3 needs to have at least 13% IPC improvement to catch up to Ice Lake in Geekbench, which is why I said 15% uplift is necessary.

Thunder 57 · Jul 17, 2019

Let me start by saying I thought Zen was server first, everything else be damned? You stated multiple times that you expect Zen 3 to focus on mobile, and I disagree.

amd6502 said:
What do you mean by this? GF cancelled 7nm finfet.

SMT is symmetric. The front end and L1 has changed in Zen2, and it's much more capable despite same old 4 decode/cycle. They almost doubled up op-cache.

https://www.anandtech.com/show/14525/amd-zen-2-microarchitecture-analysis-ryzen-3000-and-epyc-rome/8

They feed the L2 instruction code effectively using a new unit. Then have shrunk L1i (halved?) but doubled op-cache size.

The op cache can also now feed up to 8 ops per cycle versus the old 6 ops maximum; link above is new zen2 and this link is zen1 https://www.anandtech.com/show/1057...lers-micro-op-cache-memory-hierarchy-revealed

The decoder still chugs only 4. It may or may not bottleneck the performance. It just depends how the op cache is performing. They did a great job so I think it's likely rare that the 4-wide decode is an issue. Because Zen3 likely is mobile focused I think doubling up decoder might not happen unless there are energy efficiency tricks. (I remember Kaveri getting doubled up decode and I think it has a lot to do with why these little APU's made such good space heaters and why 8c Steamroller was cancelled.) Maybe they are designing a 6-wide decode.

I'm surprised at what they did with the uop cache, doubled it?? Much larger than what Intel will have even with Sunny Cove. Cutting the L1i in half to help allow that seems like a great move.

This is total speculation and likely totally wrong; but it's my best guess.

I think they will widen the core a little more and do 4-way multithreading, so with four threads and wider core they may need to widen the decoder. I don't think it will be SMT4 though, but think they will add a "Threadrip" mode, that allows a pair of opportunistic threads to run on top of SMT2 and help keep the execution units busy. This would be similar to big-little in the acorn world. These small threads would run completely without speculation (taking turns pausing on branches) and out-of-order execution would be very limited.

For consumer enabling Threadrip would be a benefit; with quadcore APU having 8 strong threads, and 8 "small" threads. Building a kernel with -j16 would be a big speedup over a kernel build with only SMT2 enabled. Little threads would also have no vulnerability to spec execution. The OS can use it for itself and system processes. Browsers can be made to use it (eg offloading incessant and useless javascript threads associated with bg tabs). Little threads would be most useful for high latency and high FPU code and could be useful in parallel compute datacentres.

Zen3 would be primarily mobile focused, secondarily server focused; and hopefully the first to see this core would be a quadcore sub 10W APU, followed by 2 CCX chiplets for the server and consumer markets.

As far as product lines, I think unlike 3000 gen, consumer 4000/5000 MCM's would be strictly single CPU chiplet APUs, with 8 CU Vega built into the IOX, and available for both Zen2 and Zen3 chiplets--Zen2 arriving late this year and Zen2 5000 in late H2 or early 2021. Monolithic mobile quadcore Zen3 4000 APU arriving mid 2020 and mid H2 for AM4.

I have to admit you lost me a couple times there. I may have to read this again in the morning when my brain is working again. I guess for know I'll say I agree on there being no SMT4. But then it looks like you say it will be wider and may need to widen the decoder. I was saying that and I thought you were just saying how Zen 2 does a much better job at decoding already? Not that I don't believe you, the 2x uop cache is surely helpful here.

amd6502 · Jul 17, 2019

Thunder 57 said:
Let me start by saying I thought Zen was server first, everything else be damned? You stated multiple times that you expect Zen 3 to focus on mobile, and I disagree.

I'm surprised at what they did with the uop cache, doubled it?? Much larger than what Intel will have even with Sunny Cove. Cutting the L1i in half to help allow that seems like a great move.

I have to admit you lost me a couple times there. I may have to read this again in the morning when my brain is working again. I guess for know I'll say I agree on there being no SMT4. But then it looks like you say it will be wider and may need to widen the decoder. I was saying that and I thought you were just saying how Zen 2 does a much better job at decoding already? Not that I don't believe you, the 2x uop cache is surely helpful here.

Well, mobile and server are very alike; both go for low power budget and high efficiency.

With Rome and Naples their server offering is already so strong that they have leadership position already, while in mobile, they could benefit greatly and achieve leadership position from the efficiency advantages of a 7nm monolithic.

A focus on low power and higher threads would also complement the HPC-datacenter focused Zen2.

Zen1 (Naples) = general server.
Zen2 (Rome) = general and HPC/supercomputing.
Zen3 (Milan?) = ??? = highest threadcount at low power?

Going beyond SMT2 would actually be more of a benefit for certain servers. But it also benefits 2 to 4 core mobile CPUs. 4c/8t is enough threads for the mainstream, but having 16t and close the multithread of a 6c/12t processor in a quadcore APU is a nice feature that will appeal to the power users.

As for the front end, the change looks impressive. It looks very area and power efficient. They were able to halve the L1i (and L1 in general) by making the L2 much smarter in prefetching code. From Anand the article linked:

Ian Cutress said:
AMD is still using a hashed perceptron prefetch engine for L1 fetches, which is going to be as many fetches as possible, but the TAGE L2 branch predictor uses additional tagging to enable longer branch histories for better prediction pathways. This becomes more important for the L2 prefetches and beyond, with the hashed perceptron preferred for short prefetches in the L1 based on power.
[....]
One other major change is the L1 instruction cache. We noted that it is smaller for Zen 2: only 32 KB rather than 64 KB, however the associativity has doubled, from 4-way to 8-way. Given the way a cache works, these two effects ultimately don’t cancel each other out, however the 32 KB L1-I cache should be more power efficient, and experience higher utilization. The L1-I cache hasn’t just decreased in isolation – one of the benefits of reducing the size of the I-cache is that it has allowed AMD to double the size of the micro-op cache. These two structures are next to each other inside the core, and so even at 7nm we have an instance of space limitations causing a trade-off between structures within a core. AMD stated that this configuration, the smaller L1 with the larger micro-op cache, ended up being better in more of the scenarios it tested.

lopri · Jul 17, 2019

4-way SMT sounds silly. Has it ever been attempted before? Resource contention is going to be a huge issue as well as scheduling. Add worse energy efficiency as well. (no ARM SOCs tried to incorporate SMT) Benefits of SMT is already shaky as it is. (see 9700K v. 9900K) Besides which, why would you want virtual cores when AMD/Intel are just now starting to give more physical cores for less?

IntelUser2000 · Jul 17, 2019

lopri said:
4-way SMT sounds silly. Has it ever been attempted before? Resource contention is going to be a huge issue as well as scheduling. Add worse energy efficiency as well.

Energy efficiency of SMT is better than outright increasing resources. Extracting ILP generally needs square of the performance it brings, while for SMT its often better than linear.

4-way SMT has been done. The IBM Power series uses 8-way SMT. However, the IBM Power chips put significant effort into boosting SMT effectiveness, which won't be ideal on consumer chips. Sun's chips I think also used 4-way SMT.

On Intel/AMD chips, 2-way SMT adds <5% to core area(not total die). On Power 5, its SMT added 24% to core area. On top of that, going above 2-way SMT offers diminishing returns. 2-way SMT might offer 30% benefit. Doubling that again might offer 15%.

Difficulty of implementing SMT is also not about power or area, much as validation required to make things work. This is why the Merom core didn't have SMT, since the Haifa team had no experience. Nehalem brought back SMT because it was the same team that designed Netburst.

Software-wise, the application doesn't really know the difference between 8 core/16 threads and 16 core/16 threads. Performance also reflects this. Those that benefit from 16 cores benefit from 16 threads.

Atari2600 · Jul 17, 2019

I see here that 7nm+ is supposed to give a 10% uplift in performance (I assume == clocks) and 20% density increase.

Of course, Zen3 will only be able to take advantage of that if the 12nm IO chip is not the clock limiter. Assuming its not, maybe a 5% clock increase is possible?

So a 4.5GHz ceiling would become ~4.75GHz.

[But the density increase is more than I expected., could see some nice refinements out of that.]

Abwx · Jul 17, 2019

tamz_msc said:
Now Ice Lake (https://browser.geekbench.com/v4/cpu/13476154)
Integer score = 5410; Floating Point score = 5609; median frequency = 3825 MHz
Score/GHz Integer = 1414; Floating point = 1466

ICL vs Zen 2, ICL is 13.4 percent ahead in integer, 13.3 percent ahead in Floating Point.

Therefore Zen 3 needs to have at least 13% IPC improvement to catch up to Ice Lake in Geekbench, which is why I said 15% uplift is necessary.

Frequency of Zen 2 in this test is 4.25Ghz, assuming ICL scale at 100% with frequency (wich would include bandwith and latency....) it would score 6253, wich is 6.4% better than Zen 2, dunno how you manage to reach 13.4%, sure that by selecting the relevant subscores that s not difficut.

As pointed by a member you cant remove latency and RAM bandwith that easily as this is used in GB computation of total performance and not including it will skew the result.

For instance the LZMA test in isolation can give very good numbers if the executed file is very small, but once you increase the size then this will be very sensitive in respect of these characteristics, best exemple is Zen 1 whose 7ZIP numbers were nowwhere close to the iNTEGER exe capabilty, in X264 where they matter less it did perform much better comparatively, in FP this is generaly not as sensitive.

moinmoin · Jul 17, 2019

Thunder 57 said:
Oh yes I know. I didn't word that well. What I meant was that the time already spent building on Zen 2, in addition to another 12-18 months, is not enough time to just nuke the CCX concept and do something else. Besides, it's working well for them, why mess with success? (Not directed at you, but whoever my original quote was responding to).

I personally also don't think AMD will move away from CCX. Though the handling of L3$ is an area where we may see changes either way. There have to be more efficient ways in tapping into this ever increasing cache pool.

Thunder 57 said:
I'm not sure where that came from, but that quote is not attributable to me. I even clicked my handle and reread my original post to make sure I wasn't losing my mind.

Sorry about that, I apparently messed up multiple quoting there (it was by turtile). Corrected the post.

exquisitechar · Jul 17, 2019

From what I know, Zen 3 was originally supposed to be the bigger core update and Zen 2 a minor one. The latter part has been officially confirmed by AMD, the planned Zen 2 IPC increase was less than 10% originally, before they decided to be more aggressive and even pulled forward the TAGE branch predictor from Zen 3. I still think Zen 3 will be a significant improvement in that regard, maybe even on par with Zen 2's IPC improvement. No idea about the clock speeds, we will likely see another increase though.

Abwx said:
Frequency of Zen 2 in this test is 4.25Ghz, assuming ICL scale at 100% with frequency (wich would include bandwith and latency....) it would score 6253, wich is 6.4% better than Zen 2, dunno how you manage to reach 13.4%, sure that by selecting the relevant subscores that s not difficut.

As pointed by a member you cant remove latency and RAM bandwith that easily as this is used in GB computation of total performance and not including it will skew the result.

For instance the LZMA test in isolation can give very good numbers if the executed file is very small, but once you increase the size then this will be very sensitive in respect of these characteristics, best exemple is Zen 1 whose 7ZIP numbers were nowwhere close to the iNTEGER exe capabilty, in X264 where they matter less it did perform much better comparatively, in FP this is generaly not as sensitive.

That's not what that member said. He said that crypto, int and fp already benefit from faster memory. The memory score just skews the results, I believe I heard that even the author of the benchmark said that it will be separated from the overall score in future releases. Still, I do want to see more comparisons before claiming that Zen 3 needs a 15% IPC increase to match Sunny Cove. As for The Stilt's testing, I have my issues with his testing suite too. It's a bit vector heavy if you ask me.

mikk · Jul 17, 2019

Abwx said:
There s the Spec comparison at AT wich show Zen 2 being faster than CFL, he tested at their rated frequencies and made a frequency normalisation, dunno if he accounted for the fact that his Ryzen set up wasnt boosting at the proper frequency with his first tests.

This is a pretty much useless IPC test because they only tested Specfp and I'm not even sure if this is a good comparison because they tested 8C vs 12C SKUs. A frequency normalisation is a bad decision as well, they have to clock all CPUs at the same frequency because they can't guarantee a perfect clock scaling, this is a really flawed IPC comparison from AT. By the way in Geekbench the IPC is exactly the same as well if both are using the same RAM. As you can see the IPC test from Stilt is by far the best.

Thunder 57 · Jul 17, 2019

moinmoin said:
I personally also don't think AMD will move away from CCX. Though the handling of L3$ is an area where we may see changes either way. There have to be more efficient ways in tapping into this ever increasing cache pool.

I find cache structure interesting and I'm coming up with a few ideas as to what they may try do, but I don't think I have anything plausible yet.

Sorry about that, I apparently messed up multiple quoting there (it was by turtile). Corrected the post.

No worries

NostaSeronx · Jul 17, 2019

amd6502 said:
What do you mean by this? GF cancelled 7nm finfet.

It might have been revived... with my rumor divison saying it has been stealth restarted.

7nm FinFET is EUV(~175 wph) with e-beam inspection(slow-ish) with development restarting in 2019 to HVM planned for 2021.
3nm GAA is also EUV(>200 wph) but with actinic inspection(fast-ish) with development in 2022 to HVM planned for 2024.

What's happening, who's funding it, no answers as of yet.

Thunder 57 · Jul 17, 2019

NostaSeronx said:
It might have been revived... with my rumor divison saying it has been stealth restarted.

7nm FinFET is EUV(~175 wph) with e-beam inspection(slow-ish) with development restarting in 2019 to HVM planned for 2021.
3nm GAA is also EUV(>200 wph) but with actinic inspection(fast-ish) with development in 2022 to HVM planned for 2024.

What's happening, who's funding it, no answers as of yet.

If only there were an eyeroll reaction. You have been right about FD-SOI and CMT making a comeback exactly 0% of the time. Even a broken clock is accurate sometimes.

coercitiv · Jul 17, 2019

tamz_msc said:
And as expected, you have no idea what to look for in Geekbench ST numbers.

I suspect you're not far ahead either. Using a benchmark with extremely high variance on memory speed and less than adequate ways of reporting memory speed, timings and clocks in general is a recipe for disaster for IPC comparisons.

Here's another 9900K bench:
https://browser.geekbench.com/v4/cpu/13833746
Score/GHz Integer = 6810/5 = 1362

Suddenly Ice Lake is just 4% faster and all AMD needs to do is a Zen 2 refresh similar to Zen+.

Just because Geekbench scores are all we have on Ice Lake doesn't mean we should use these scores to evaluate IPC gains. I can come up with R7 2700X scores vs. R7 2700 scores that indicate Zen+ is 5% faster than... Zen+ in ST integer.

Timorous · Jul 17, 2019

ApTeM said:
The usual questions spring to mind:

Will it support currently existing motherboards (300/400/500 series chipsets)?

What kind of IPC increase are we talking about?

Will AMD manage to squeeze more frequencies?

What node will it use?

What will be its TDP?

Will it support AVX512 instructions?

When and if we can expect Ryzen 4000 CPUs with modern onboard graphics (e.g. Navi10/Navi20)?

1. Yes, can't see them changing socket until DDR5.
2. 5-10% on average
3. maybe another 10%
4. 7nm +
5. Same but at higher clocks.
6. No
7. Q1 2021.

As a side note I also expect the IO die will be made on a smaller process and it could then conceivably contain some L4 cache which all cores can share to further improve effective latency and cross core data sharing.

IntelUser2000 · Jul 17, 2019

Atari2600 said:
I see here that 7nm+ is supposed to give a 10% uplift in performance (I assume == clocks)

Assuming its not, maybe a 5% clock increase is possible?

So a 4.5GHz ceiling would become ~4.75GHz.

Maybe nothing. The numbers for process has to do with individual transistors, and won't necessarily reflect what we'll see. Also with desktop chips, the frequencies its running at are so high that its amazing they get there in the first place.

coercitiv said:
Here's another 9900K bench:
https://browser.geekbench.com/v4/cpu/13833746
Score/GHz Integer = 6810/5 = 1362

Actually,

That chip is running at near 5.3GHz. The one abwx and tamz_msc is arguing about runs at nearly 250MHz lower.

It's better taking manufacturer claims at face value than arguing over user-submitted scores.

maddie · Jul 17, 2019

IntelUser2000 said:
Energy efficiency of SMT is better than outright increasing resources. Extracting ILP generally needs square of the performance it brings, while for SMT its often better than linear.

4-way SMT has been done. The IBM Power series uses 8-way SMT. However, the IBM Power chips put significant effort into boosting SMT effectiveness, which won't be ideal on consumer chips. Sun's chips I think also used 4-way SMT.

On Intel/AMD chips, 2-way SMT adds <5% to core area(not total die). On Power 5, its SMT added 24% to core area. On top of that, going above 2-way SMT offers diminishing returns. 2-way SMT might offer 30% benefit. Doubling that again might offer 15%.

Difficulty of implementing SMT is also not about power or area, much as validation required to make things work. This is why the Merom core didn't have SMT, since the Haifa team had no experience. Nehalem brought back SMT because it was the same team that designed Netburst.

Software-wise, the application doesn't really know the difference between 8 core/16 threads and 16 core/16 threads. Performance also reflects this. Those that benefit from 16 cores benefit from 16 threads.

Well said.

A lot still don't acknowledge the area and power efficiency of SMT versus additional cores, up to a point of course. With 7nm+ only having a small increase in transistor density, the obvious way for AMD to aggressively increase thread count is to increase SMT values. A boon for some (a lot?) server loads.

I know a lot of us ridicule NostaSeronx, but it might be starting to resemble CMT slightly as we increase the integer-FP ratios of a core. Not identical, but a merging or fusion of design philosophies.

IntelUser2000 · Jul 17, 2019

maddie said:
Well said.

A lot still don't acknowledge the area and power efficiency of SMT versus additional cores, up to a point of course. With 7nm+ only having a small increase in transistor density, the obvious way for AMD to aggressively increase thread count is to increase SMT values. A boon for some (a lot?) server loads.

I don't agree with 4-way SMT, helper threads or otherwise. Besides needing applications to further optimize it, the gains are going to be diminished and it'll probably increase power consumption just as much as going to 2-way SMT. Even with the POWER chips with the uarch built for SMT, 8-way doesn't do much in most cases. 4-way is optimal. For Intel/AMD chips, that's 2-way.

Atari2600 · Jul 17, 2019

SMT4 is more trouble than it is worth.

AMD have plenty of other things to be doing other than throwing significant resource at a development path that will always deliver very incremental and very intermittent improvements.

tamz_msc · Jul 17, 2019

coercitiv said:
I suspect you're not far ahead either. Using a benchmark with extremely high variance on memory speed and less than adequate ways of reporting memory speed, timings and clocks in general is a recipe for disaster for IPC comparisons.

Here's another 9900K bench:
https://browser.geekbench.com/v4/cpu/13833746
Score/GHz Integer = 6810/5 = 1362

Suddenly Ice Lake is just 4% faster and all AMD needs to do is a Zen 2 refresh similar to Zen+.

Just because Geekbench scores are all we have on Ice Lake doesn't mean we should use these scores to evaluate IPC gains. I can come up with R7 2700X scores vs. R7 2700 scores that indicate Zen+ is 5% faster than... Zen+ in ST integer.

The 9900K in your link is running at 5.3 GHz. The variance in Geekbench is high, but you can always find useful information if you know what to look for. In this case you forgot to find out exactly what clock speeds the CPU was running at.

tamz_msc · Jul 17, 2019

Abwx said:
Frequency of Zen 2 in this test is 4.25Ghz, assuming ICL scale at 100% with frequency (wich would include bandwith and latency....) it would score 6253, wich is 6.4% better than Zen 2, dunno how you manage to reach 13.4%, sure that by selecting the relevant subscores that s not dif

I'm not looking at individual subtest scores, just the overall integer and floating point scores. How I got 13.4 percent is elaborated in my post. It's not my problem that you fail to grasp the method.

Ajay · Jul 17, 2019

amd6502 said:
I don't see how that would work. The front end is probably pushed to its max during full load (both threads active). If the SMT yield is higher it kind of points to the front end doing its job. However, there isn't a way of knowning; perhaps if the the front end were improved, the SMT yield would rise even further. Basically, I don't think there's any info, and that statement doesn't make sense.

But it does seem plausible the 4-wide decoder could be holding back performance. (The likelihood also increases if AMD goes beyond SMT2.)

It seems the improved SMT yield is probably mostly from the wider core (up to 3 loads per cycle now).

Now the big question is, where will they widen the core next (assuming it's widened in Zen3).

It could very well be that the width remains the same (4+3).

Would increasing the maximum writes/stores per cycle from 1 to 2 cause a lot of issues or complexity?

SMT is a way to extract ILP from the incoming instruction stream (binary asm code). Limited resourced SMT implementations extract the ILP by running another thread when (typically) there is load delay (especially if the load comes from DRAM vs cache). There are other cases, but this is the most prevalent by far. Ways in which the processor extracts ILP on a per thread basis are, for example, more effective prefetch and speculative execution. The more accurate these are, the higher the effective ILP of a given thread. The higher SMT yield probably is due in part the Zen2's lower mem bandwidth and higher latency vs Intel. I will eventually read more an have a better understanding myself on this topic.

Development for an SMT4 configuration has several pitfalls.

Requires more resources per core, considerably more IIRC from looking at IBMs power series.
Applications have to be re-written to run more threads and performance can vary quite allot depending on the amount of shared data structures.
I think compilers need to be smarter to get the max utilization rate, not a compiler guy, so I don't know this impact.
OSes need to be patched so that they can optimally manage the new architecture (as has happened twice with Windows and Zen already).

So SMT4 - useless for client. Could be useful for servers, though I think more cores per chiplet, and maybe more chiplets per CPU are feasible and offer better scalability for now.

Abwx · Jul 17, 2019

tamz_msc said:
I'm not looking at individual subtest scores, just the overall integer and floating point scores. How I got 13.4 percent is elaborated in my post. It's not my problem that you fail to grasp the method.

So those 13.4% include AVX512, you can see the SFFT score wich is highly inflating your "estimation".

So apparently AES, wich skewed the result way less, is no good but AVX512 is, since it help get an illusion of a point, i guess that s why i had difficulties grasping your "method"..

Also, it seems that Intel created instructions (or a hardware block) to improve the throughput in LZMA (7ZIP in their slide), that s exactly the same as the AES hardware blocks, but one more time never mind, all is good to boost the "IPC" numbers, and they rely a lot on those news instructions, among others AES vectorization, lol...

Anyway be prepared for an abrupt landing once the dust of those truncated numbers has settled...

https://www.computerbase.de/2019-05/intel-ice-lake-ueberblick/

Speculation: Ryzen 4000 series/Zen 3

Diamond Member

Senior member

Diamond Member

Platinum Member

Senior member

Elite Member

Elite Member

Golden Member

Lifer

Diamond Member

Senior member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Golden Member

Elite Member

Diamond Member

Elite Member

Golden Member

Diamond Member

Diamond Member

Lifer

Lifer