AMD Carrizo APU Details Leaked

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
alright

so what about integrating audio and radio processors?

will these be easier?
Integration is simply a matter of having the die space for additional components -- it's not a question of difficulty.

The more important bits get integrated first. Things like L2 and floating point units were integrated first. Audio doesn't really matter, in a relative sense, so it hasn't been integrated on most x86 processors.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,803
1,286
136
And widening the decode/retirement/data bus limits the core scaling. The way I see it, the area growth (front end, retirement, even the bypass between the FMAs) required from moving from a 2x128b to a "true" (see below) 4x128b design is significantly larger, more power hungry, probably less frequency than moving from 2x128b to a 2x256b design. So you can more FP units in a much narrower core... hence better core scaling.
The problem is small units scale better.
Internally the FMA units in Bulldozer to Steamroller are all 64-bit in size. Smaller units are more efficient and less power hungry than bigger units.
Now my question is, what's the front end you would put in to feed this CPU? If you say 4 wide, then for applications that can scale to 256b, you have excess decode width. If you say 2 wide, then you don't get the full utilization of a 4x 128b. Flexibility comes with some overhead. I like reconfigurable stuff, but if you want maximum perf/power and hence efficiency, you have to start targeting.
15h is already 4 wide and 30h-4Fh made it 8-wide.
so are they trying to move the floating point operations over to the gpu?
Possibly via HSAIL. AMD might also extend AMD64s ISA to include the ARM ISA and GPU ISA at a much later time. That way the ISA it self is physically Heterogeneous.
 
Last edited:

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
The problem is small units scale better.Internally the FMA units in Bulldozer to Steamroller are all 64-bit in size. Smaller units are more efficient and less power hungry than bigger units.

A 4x128b FMA will have 8x double precision (64b) FMA units. 2x256b FMA will have 8x double precision FMA units. So 'scaling' looks pretty equal so far. You're not going to build a new IEEE extended precision 256-bit FP format.

- A fully connected 4x 128-bits execution unit means you need to schedule 4 uops, and any of those uops can take data from the result of any of the other FMAs (including itself).
- A fully connected 2x 256-bit execution unit means you need to schedule 2 uops and any of those uops can take data from any of the other FMAs (including itself). So you see, if your workloads can readily take advantage of 256-bit vectors, 2x256b scales better than 4x128b.

15h is already 4 wide and 30h-4Fh made it 8-wide.

I believe if you're talking about Steamroller/Bulldozer, maybe you're thinking it as 4-wide x 2 threads, not 8-wide x 1 thread.

The 2x2x128b FMA that Bulldozer has is a pretty nice concept to deal with the AVX to AVX2 transition (or apps that can't even use AVX2) but it definitely is not logically equivalent to a 4x128b FMA.

Sorry to everyone else for veering way off topic. If you still disagree, I can take this to PM.
 
Last edited:

norseamd

Lifer
Dec 13, 2013
13,990
180
106
What benefit would you get with that idea?

that is what i am asking about

so integrating a small number of powerful cores with a large number of more efficient cores would be what i am thinking about. what benefits would there be to this?

would the gpu just do anything the arm cores might?

the arm cores are more powerful than the gpu cores per calculation unit right?
 

norseamd

Lifer
Dec 13, 2013
13,990
180
106
do the arm cores take up less room and heat than x86 cores do?

and gpu cores have only limited instruction sets right?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,803
1,286
136
Why not use 15h and 16h side by side then?

Quad-core Steamroller/Excavator + Quad-core Jaguar/Puma
 
Last edited:

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
what do you guys think the prospects for integrating x86 and arm cores are?
None. What do you want ARM cores to do that x86 cores can't?

What benefit would you get with that idea?
As far as I know, none.

that is what i am asking about

so integrating a small number of powerful cores with a large number of more efficient cores would be what i am thinking about. what benefits would there be to this?

would the gpu just do anything the arm cores might?

the arm cores are more powerful than the gpu cores per calculation unit right?
Why would it have to be ARM cores? I think it already something exists, and it's called Xeon Phi. Big.Little, if that's what you mean, might in certain cases be a possibility, but I don't really see the benefits of that. To me it just seems like a marketing thing designed by ARM to justify integrating gazillions of cores onto a small, thermally bounded SoC.

ARM cores, or any CPU cores, are indeed more powerful than GPU cores (e.g. you need ~100 Phi cores to get a few tflops, but you need thousands of GPU cores).
 

jpiniero

Lifer
Oct 1, 2010
16,186
6,635
136
so are they trying to move the floating point operations over to the gpu?

That was the original idea of Fusion. AMD's going to go bankrupt before they see the vision realized however.

Why not use 15h and 16h side by side then?
Quad-core Steamroller/Excavator + Quad-core Jaguar/Puma

Windows doesn't/can't support anything like Big.LITTLE. If it was possible, MS would have done it with the original Surface. So that pretty much makes it a non starter on x86, at least until Microsoft goes away. Ironically, I think it would be much more useful on high watt devices like desktops or even laptops where even on Broadwell idle is still 60 W.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
big.LITTLE isnt a solution. Its a hotfix for companies with cores unable to run proper low power operations as well as scale in performance when needed. Qualcomm and Apple for example isnt doing big.LITTLE.

And x86 CPUs certainly dont need it.
 

Fjodor2001

Diamond Member
Feb 6, 2010
4,017
444
126
big.LITTLE isnt a solution. Its a hotfix for companies with cores unable to run proper low power operations as well as scale in performance when needed. Qualcomm and Apple for example isnt doing big.LITTLE.

And x86 CPUs certainly dont need it.

No. A single CPU core type can never cover the same frequency range as well as two separate CPU core types, when performance and power consumption is taken into account. But it's already been discussed here.
 

DeathReborn

Platinum Member
Oct 11, 2005
2,786
789
136
I'd actually like to see a low-mid range APU with a pair of ARM cores on board for running Android/ChromeOS as well as Windows without having to reboot or run a VM.

Something like a Quad Steamroller/Excavator with a pair of A53 cores, yet it would be very niche but would be perfect for AIO's as a ultra low power mode for basic web browsing/emails etc.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
I'd actually like to see a low-mid range APU with a pair of ARM cores on board for running Android/ChromeOS as well as Windows without having to reboot or run a VM.

Something like a Quad Steamroller/Excavator with a pair of A53 cores, yet it would be very niche but would be perfect for AIO's as a ultra low power mode for basic web browsing/emails etc.

First problem is they cant share memory. And it serves no purpose either, because you can run Android and ChromeOS on x86.
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
No. A single CPU core type can never cover the same frequency range as well as two separate CPU core types, when performance and power consumption is taken into account. But it's already been discussed here.
Core scales well over a full order of magnitude. And over its full range, there isn't a single architecture that does better at just 1 point of the TDP curve.
 

Fjodor2001

Diamond Member
Feb 6, 2010
4,017
444
126
Core scales well over a full order of magnitude. And over its full range, there isn't a single architecture that does better at just 1 point of the TDP curve.

A Haswell core will never be as power efficient as an ARM7 when running low processing loads.

Check this out for how ARM Cortex-A7 and A-15 relate. A Haswell curve would be even further to the top right than the A15.



But anyway, I think that discussion is out of topic in this thread. I suggest the existing thread on that topic is woken up instead if needed.
 

NTMBK

Lifer
Nov 14, 2011
10,401
5,638
136
ARM cores, or any CPU cores, are indeed more powerful than GPU cores (e.g. you need ~100 Phi cores to get a few tflops, but you need thousands of GPU cores).

Don't fall for GPU marketing madness. In GPU marketing language, each Xeon Phi core would be referred to as 16 "cores"- it has 16 SIMD lanes, meaning it can operate on 16 elements in parallel. As such the Phi would be a 960 "core" GPU.
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
A Haswell core will never be as power efficient as an ARM7 when running low processing loads.

Check this out for how ARM Cortex-A7 and A-15 relate. A Haswell curve would be even further to the top right than the A15.



But anyway, I think that discussion is out of topic in this thread. I suggest the existing thread on that topic is woken up instead if needed.
This is a marketing slide comparing an in-order design to an A15, which isn't really efficient, and we're discussing desktops. It might be interesting though, something like quadcore Goldmont + dualcore Skylake.

Don't fall for GPU marketing madness. In GPU marketing language, each Xeon Phi core would be referred to as 16 "cores"- it has 16 SIMD lanes, meaning it can operate on 16 elements in parallel. As such the Phi would be a 960 "core" GPU.
Thank you for you clarification, there doesn't seem to be a lot of difference then.
 

NTMBK

Lifer
Nov 14, 2011
10,401
5,638
136
Thank you for you clarification, there doesn't seem to be a lot of difference then.

Well the Xeon is lacking the dedicated hardware that a GPU has, like hardware texture sampling units, rasterization engine, and so on. It's a purely compute oriented beast.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Its quite obvious that big.LITTLE isnt optimal. When the 2 biggest ARM companies have rejected it. And PR slides doesnt change that fact. While the 3rd largest mainly uses it to sell the notion of moar cores.

big.LITTLE is simply for companies that cant afford the R&D needed. They will suffer in extended production and software cost instead.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
I find it hilarious when people dismiss one companies slides as PR but using another companies PR slides to showcase what they want.
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
Well the Xeon is lacking the dedicated hardware that a GPU has, like hardware texture sampling units, rasterization engine, and so on. It's a purely compute oriented beast.

I mean the number of flops/core is about the same if I take your 960 Xeon cores number.
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
I find it hilarious when people dismiss one companies slides as PR but using another companies PR slides to showcase what they want.

Do you have any specific examples of that? Not all marketing slides are good/bad. Even bad marketing slides can provide useful information.
 

NTMBK

Lifer
Nov 14, 2011
10,401
5,638
136
I mean the number of flops/core is about the same if I take your 960 Xeon cores number.

Well yeah, FLOPS is a pretty straight forward number, it's literally how many floating point operations can you carry out per second if you were going flat out with every core fed. So we get:

2 (fused multiply add is 2 ops) * number of lanes per SIMD unit * number of SIMD units per core * number of cores * clock speed .
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |