AMD Bristol/Stoney Ridge Thread

krumme · Feb 7, 2019

ET said:
True, but you and I have no clue about how to manage these costs for AMD, and if there's one thing that's really bad for business it's saving as much as possible. If you want to grow and seize market opportunities, you need to spend money. There's no way around it.

AMD has a very good place to get money from: about $6.5B revenue in 2018. And it has a very good idea what to do with it: spend it on R&D. As I said before: "most arguments against it are only speculations based on what people think AMD should do". And if your only argument is that cost cutting is important, well, that's really bad business if it leaves a company short of a competitive product in a specific market.

In the end, AMD has a lot more data for making that decision. The only point I'm trying to make is that what makes sense to you isn't necessarily the right decision, because your speculation isn't based on any facts, and the fact that your arguments get more vague underlines that.

I have read creative 22fdx soi bd excavator ideas weekly for 5 years plus. That's all good. But history have shown next to nothing happened. It went the excact other way. I think it's time to depart from it.

If you want to argue why making a new low freq cpu more or less directly derived from the bulldozer arch using some marginal process node is a good business case you are welcome. I am all ears.

dark zero · Feb 7, 2019

The only way BD survives is to add L3 cache on the APU

NTMBK · Feb 7, 2019

dark zero said:
The only way BD survives is to add L3 cache on the APU

Even with a big L3, Bulldozer sucked.

amd6502 · Feb 7, 2019

All valid points. It's unclear whether they should focus purely on Zen for the whole range or take a branch.

NostaSeronx said:
The total investment for a 22FDX CPU, a 22FDX GPU, and a 22FDX APU. Would ultimately fit within a single 14-nm project.

In most cases, the 22FDX CPU can re-use 14-nm Matisse I/O die and re-use a modified Orochi floorplan. Orochi didn't have a Northbridge or Southbridge, just HT links and DDR.
Orochi L3+uncore - 192.4 mm squared => There is no need for on-die L3/DRAM and needed HT links would be converted to Infinity Fabric links.
Orochi modules - 123.6 mm squared => (28nm) 78 mm square mm

https://blog.globalfoundries.com/wp-content/uploads/2018/12/AI_blog_2.jpg
While, the 7LP 3D SRAM project is switched to FDX 3D SRAM project.
2 x >78 mm squared CPU dies with each CPU having a 1x >36 square mm SRAM stack. With 1H-4H SRAM being 8(4@min) to 32(16@min) Megabyte high

$40M is not a trivial amount of money, and in the past they've been very disciplined about such bets. Almost all of their projects were either: (a.) doubled up with server (or higher volume embedded) or/and pretty long lived products (b.) ~ 3 to 6 years. For example Vishera (bigtime: a+b), Kaveri (barely: b), Carizzo/BR (a+b), Stoney (barely a+b), RR/Picasso (b), Summit (a), Pinnacles (b), and all of their GPU projects (a).

I don't think they would branch all those three categories to FDX nor can I imagine they would enter the server market with a NG-dozer (it would have to be a very niche server area), as Zen family is optimally suited for server (it was designed for servers after all). For gpu I see them 100% focused on Navi which should be very long lived (5+ years), but GCN 1.2 seems just fine as a general purpose and consumer iGPU.

So the only sensible minimally risky bet is a low end APU successor to Stoney that would have a lifespan of 5+ years. And it would be a good bet in my opinion, as I buy the marketing that 22FDX is indeed a goldilocks node (don't forget 12LP is pricier but also close to goldilocks) and think a 4 thread 5W-25W minimalist x86 APU will continue to be relevant far past 5 years. If 22FDX proves itself with such a bet, then in the future they might consider some GPU or CPU projects on 12/14FDX or whatever the leading is down the road.

If they net ~$10 a unit, to recover the investment, $4 million units are needed over the product lifetime, or about a million units a year. Worldwide desktop and notebook sales are on the order of 100M and 160M per year, probably not including DIY; tablets over 100M units https://www.statista.com/statistics...forecast-for-tablets-laptops-and-desktop-pcs/. So annually, 1M units would be 0.5% in a competing market of 200M composed of relevant low end laptops + x86 tablets + all in ones. I think (wild guess) even Athlon (2c RR die salvage) might exceed those sales.

amd6502 · Feb 7, 2019

dark zero said:
The only way BD survives is to add L3 cache on the APU

After PD they never used it as a performance core; it was deprived of an L3. I would like see Athlon 200GE comparisons with Bristol A12 redone with the Athlon's L3 disabled. We'd see the effectiveness of XV tested in a more level playing field.

For ultra low wattage, I also think an L3 isn't going to be that convenient, unless you could disable/sleep it whenever the APU goes into very lowest p-state for some significant amount of time.

ET · Feb 10, 2019

NTMBK said:
Even with a big L3, Bulldozer sucked.

Excavator shows over 30% IPC improvement in some tasks over previous construction core architectures (see Anandtech's comparison for example, but it was also shown in other tests, IIRC at Notebookcheck). Excavator has a very small cache, so a bigger cache should help that IPC improvement shine in more cases.

Insert_Nickname said:
BR and Stoney are nothing more then reheated 28nm Carrizo.

There was no Stoney equivalent for Carrizo, so that at least is new work. Granted it's simpler work than a new architecture, but it's still layout, validation, etc.

ET · Feb 10, 2019

amd6502 said:
$40M is not a trivial amount of money, and in the past they've been very disciplined about such bets. Almost all of their projects were either: (a.) doubled up with server (or higher volume embedded) or/and pretty long lived products (b.) ~ 3 to 6 years. For example Vishera (bigtime: a+b), Kaveri (barely: b), Carizzo/BR (a+b), Stoney (barely a+b), RR/Picasso (b), Summit (a), Pinnacles (b), and all of their GPU projects (a).

I don't really think that AMD has always been that disciplined. A lot of that seems to me to be decisions of the moment, and this list ignores stuff that didn't pan out, like the ARM CPUs. Won't surprise me if AMD had a Vishera upgrade in the pipeline which was scrapped. I'm not sure why you list Pinnacle Ridge as a long term product. The RX 590 is another example. Sure, they are both just refinements, but I think they land under 'low hanging fruit' rather than your (a) or (b).

Insert_Nickname · Feb 10, 2019

ET said:
Excavator shows over 30% IPC improvement in some tasks over previous construction core architectures (see Anandtech's comparison for example, but it was also shown in other tests, IIRC at Notebookcheck). Excavator has a very small cache, so a bigger cache should help that IPC improvement shine in more cases.

It also showed some regressions compared to Kaveri here and there. I agree a bigger cache would help, but it's a pretty moot point as Zen is 52% faster per clock. Anyway you cut it, the Bulldozer architecture generally was an unmitigated failure. Even with improvements.

AMD has thankfully been able to move on. But if they can make a few bucks on already established and paid for architecture, even if it was a failure, one can't blame them.

I do suspect there is a lot more Excavator in Zen then commonly realised. But that is pure speculation on my part.

ET · Feb 10, 2019

Insert_Nickname said:
It also showed some regressions compared to Kaveri here and there.

The regressions are likely because of its smaller cache size (1MB per module vs. 2MB for previous architectures). That was the point of my post, that Excavator with a larger cache would likely be a decent performer.

True, it would still be slower than Zen, but I think it would be closer to Zen than to Piledriver (which is what the 52% faster figure refers to).

NTMBK · Feb 10, 2019

ET said:
The regressions are likely because of its smaller cache size (1MB per module vs. 2MB for previous architectures). That was the point of my post, that Excavator with a larger cache would likely be a decent performer.

True, it would still be slower than Zen, but I think it would be closer to Zen than to Piledriver (which is what the 52% faster figure refers to).

Even with all those improvements, it still lags behind a Core i3 Sandy Bridge from 2011. And while increasing the L2 cache size would improve hit rate, it would also make L2 latency worse. There's a good reason why Zen has a small L2, backed up by a large L3.

ET · Feb 10, 2019

NTMBK said:
And while increasing the L2 cache size would improve hit rate, it would also make L2 latency worse.

Good point, but considering that faster L2 cache wasn't part of the architecture advancements discussed for Excavator, I'm not sure there's a significant difference there.

NTMBK · Feb 10, 2019

ET said:
Good point, but considering that faster L2 cache wasn't part of the architecture advancements discussed for Excavator, I'm not sure there's a significant difference there.

Lower latency was mentioned in the marketing slides.

Abwx · Feb 10, 2019

NTMBK said:
Lower latency was mentioned in the marketing slides.

That s latency of the cache..

A comparison clock/clock of Piledriver/Kaveri/Excavator :

http://www.planet3dnow.de/cms/18564...cavator-leistungsvergleich-der-architekturen/

And among others IMC latency :

Obviously there s a big difference between PD and XV, the AIDA tests display much improved theorical max throughput and it s not entIrely due to AVX2.

NTMBK · Feb 10, 2019

Abwx said:
That s latency of the cache..

A comparison clock/clock of Piledriver/Kaveri/Excavator :

http://www.planet3dnow.de/cms/18564...cavator-leistungsvergleich-der-architekturen/

And among others IMC latency :

Obviously there s a big difference between PD and XV, the AIDA tests display much improved theorical max throughput and it s not entIrely due to AVX2.

I know, I was discussing the latency of the L2, that's why I posted that image But yes, overall memory latency got worse for some reason. Different memory controller, perhaps? Didn't AMD move to some generic IP on Carrizo, instead of their in house IMC on Kaveri?

ET · Feb 10, 2019

NTMBK said:
Lower latency was mentioned in the marketing slides.

But in the context of the L1 cache (which got larger). It's possible that it does refer to L2, but I see no particular reason to interpret it that way.

NTMBK · Feb 10, 2019

ET said:
But in the context of the L1 cache (which got larger). It's possible that it does refer to L2, but I see no particular reason to interpret it that way.

The section is titled "Improved caches", indicating that it referred to both L1 and L2.

naukkis · Feb 10, 2019

ET said:
True, it would still be slower than Zen, but I think it would be closer to Zen than to Piledriver (which is what the 52% faster figure refers to).

AMD said that Zen IPC is 52% higher than Excavator. Difference to Piledriver is more than that.

Abwx · Feb 10, 2019

NTMBK said:
I know, I was discussing the latency of the L2, that's why I posted that image But yes, overall memory latency got worse for some reason. Different memory controller, perhaps? Didn't AMD move to some generic IP on Carrizo, instead of their in house IMC on Kaveri?

The article i linked has a relevant slide, actually latency is the same as Kaveri at the L1 level, nothing is said about the L2 but considerig the clock gating implementation it s likely slower latency wise.

Besides the smaller L2 will increase reliance on main memory pool, at some point this will be a limiting factor for IPC vs frequency, indeed increasing substancialy the CPU throughput/Hz but halving the L2 was contradictory and undoubtly dictated by the economics...

https://www.planet3dnow.de/cms/wp-content/gallery/amd-carrizo-techday/11-Carrizo-Architecture.png

NTMBK · Feb 10, 2019

Abwx said:
The article i linked has a relevant slide, actually latency is the same as Kaveri at the L1 level, nothing is said about the L2 but considerig the clock gating implementation it s likely slower latency wise.

Besides the smaller L2 will increase reliance on main memory pool, at some point this will be a limiting factor for IPC vs frequency, indeed increasing substancialy the CPU throughput/Hz but halving the L2 was contradictory and undoubtly dictated by the economics...

https://www.planet3dnow.de/cms/wp-content/gallery/amd-carrizo-techday/11-Carrizo-Architecture.png

So you found a slide saying that L1 latency stayed the same, and I found one saying that an unspecified cache latency improved... So what you're saying is that the L2 latency went down, like I was saying?

amd6502 · Feb 10, 2019

Insert_Nickname said:
I do suspect there is a lot more Excavator in Zen then commonly realised. But that is pure speculation on my part.

Vice verse too; the lower latency L2 was taken from Zen. Each core has as much as a Zen core, even though Zen runs two threads and XV just single thread.

Looking at NTMBK's slide posted another interesting thing to see is the oversized ID which was doubled up in XV from PD. If there were to be a NostaDozer that would maybe get shrunk again to PD arrangement or PD front end+zen-like op cache.

These are some guesstimate ballpark numbers:

Zen ~ 1 B/ccx transistors or 250M per core inclusive the caches.
BD ~ 100M per core or 213M per module inclusive larger L2 cache.
XV ~ 115M transistors inclusive caches.
jaguar ~ 30M per core inclusive caches

VirtualLarry · Feb 10, 2019

amd6502 said:
XV ~ 115M transistors inclusive caches.
jaguar ~ 30M per core inclusive caches

If Windows could handle the scheduling, I wouldn't mind a BIG.little hybrid of those two.

amd6502 · Feb 10, 2019

NTMBK said:
On the topic of production cost- take a look at the actual layout of the Raven Ridge die:

Notice the large amounts of space that are devoted to neither the GPU nor the CPU. Now imagine that the GPU was cut down by 8/11ths (from 11CU to 3CU), and the CPU was cut down by half (from 4 core to 2 core). You still need all the other stuff in place to manage IO, decode video, etc. At that point the two CPU cores really aren't that big a part of the remaining die area- the non-CPU, non-GPU part is a much bigger proportion of that cut down die.

Now compare that theoretical die with one that replaces the two Zen cores with a (shrunk and tweaked) Excavator module. Is the die smaller? Sure, but not that much smaller. All the non-CPU stuff hasn't shrunk. Is the dramatic loss in system responsiveness worth that small tradeoff in die area? Is the slight saving in production costs worth the millions in up-front costs to port Excavator to 12nm FinFET, instead of reusing Zen?

A 6CU dual core variant would be about 70mm2 less (subtract half a CCX area and about half green vega area).

So down from 210mm2 to about 140mm2. With the lower margins, maybe this wasn't worth the cost to them. The GPU at 6CU would still match or even exceed 8CU BR which was handicapped with 2400 dual channel limit.

So it's this or a 3CU version just under 120mm2 versus some FDSOI low end product of roughly similar die size and somewhat lesser capability.

Or, if it's not worth the ~100M investment, make do with die salvage, and mostly exit the atom sub 10W and ultrabudget market.

VirtualLarry said:
If Windows could handle the scheduling, I wouldn't mind a BIG.little hybrid of those two.

I really don't think the software part would be a big deal. How well it would work with common software like Chrome and firfox is another question.

An XV module and two jaguars would be under 300M transistors (~20% more than a zen core) and handle twice the threads, and the whole APU could be under 1.5B transistors and same size as Stoney..

DrMrLordX · Feb 10, 2019

NTMBK said:
Even with all those improvements, it still lags behind a Core i3 Sandy Bridge from 2011. And while increasing the L2 cache size would improve hit rate, it would also make L2 latency worse. There's a good reason why Zen has a small L2, backed up by a large L3.

Can somebody weigh in on how this is not true instead of just downvoting? Is XV still behind Sandy Bridge? I'd like to see an honest comparison here if someone disagrees.

amd6502 · Feb 10, 2019

At least as far as Integer (or mixed int-fpu) it's no contest. 2c/4t Sandy is outclassed by 4c/4t XV in multithread.

https://browser.geekbench.com/geekbench3/compare/8705545?baseline=7350750

The only win for Sandy is memory performance, and a few limited single thread int and fpu tests.

DrMrLordX · Feb 10, 2019

amd6502 said:
At least as far as Integer (or mixed int-fpu) it's no contest. 2c/4t Sandy is outclassed by 4c/4t XV in multithread.

https://browser.geekbench.com/geekbench3/compare/8705545?baseline=7350750

The only win for Sandy is memory performance, and a few limited single thread int and fpu tests.

Thanks, that's a good place to start. Though XV has 400 MHz clockspeed advantage there according to the benchmark. An i3-2130 would have fared better:

https://browser.geekbench.com/v4/cpu/11934784

Still sucks in MP though. For a broader perspective (note that Bench does not have the i3-2130 in its database):

https://www.anandtech.com/bench/product/1684?vs=1901

AMD Bristol/Stoney Ridge Thread

Diamond Member

Platinum Member

Lifer

Senior member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Lifer

Senior member

Lifer

Lifer

Lifer

Senior member

Lifer

Golden Member

Lifer

Lifer

Senior member

No Lifer

Senior member

Lifer

Senior member

Lifer