AMD Carrizo APU Details Leaked

norseamd · Mar 27, 2014

apus have many benefits like speeding up large files and also as noted before reducing the load time on picture thumbs when you have a ton of them. they also benefit gaming although how that might work over a discrete gpu i do not know. maybe the physics calculations on open cl might be stronger. there is likely a lot you can do with open cl for gaming and such

norseamd · Mar 27, 2014

so floating point units used to be independent processors. they are now integrated. how is integrating gpus any different from the previous integrations?

coercitiv · Mar 27, 2014

OatisCampbell said:
As a general consumer who runs office apps and games, can this benefit me?

The HSA enabled benchmarks provided by AMD were JPEG decoding, image editing and spreadsheet calculations. The answer is Yes.

NostaSeronx · Mar 27, 2014

norseamd said:
will carrizo have a higher clock rate than kaveri does have

I wouldn't know but I think Carrizo will have nearly the same clocks as Kaveri.

norseamd · Mar 27, 2014

NostaSeronx said:
I wouldn't know but I think Carrizo will have nearly the same clocks as Kaveri.

like actually slower

why?

Homeles · Mar 27, 2014

norseamd said:
like actually slower

why?

Well, they're using high density, automated libraries more extensively. The core's also getting a lot wider -- double the ALUs and AGUs, double the FPU width... among other things. It'll undoubtedly be faster than Steamroller overall, but it might be clocked lower. But who knows.

NostaSeronx · Mar 27, 2014

Homeles said:
Well, they're using high density, automated libraries more extensively.

Personally, I think Steamroller was the start of the automated library. Steamroller could also be using high density libraries as well.

Homeles said:
The core's also getting a lot wider -- double the ALUs and AGUs, double the FPU width... among other things.

I wouldn't put hopes that AMD are going to double the ALUs or AGUs. The FPU width isn't AMDs game, it all about more units:
2 x 256b FMA = bad
4 x 128b FMA = good

Homeles said:
It'll undoubtedly be faster than Steamroller overall, but it might be clocked lower. But who knows.

Who knows.

TuxDave · Mar 27, 2014

NostaSeronx said:
Personally, I think Steamroller was the start of the automated library. Steamroller could also be using high density libraries as well.I wouldn't put hopes that AMD are going to double the ALUs or AGUs. The FPU width isn't AMDs game, it all about more units:
2 x 256b FMA = bad
4 x 128b FMA = good

It's a "mixed" good. For applications that can't unroll and fill out a full 256b vector, 4x 128b FMA would be more flexible. However for HPC applications or widely parallel applications, 2x 256b is much more power efficient due to the narrower decode/retirement/memory pipelines. 4x 128b just comes at a much higher cost around the FMA to keep them executed.

Homeles · Mar 27, 2014

NostaSeronx said:
Personally, I think Steamroller was the start of the automated library.

Well, it's impossible to say, given the eyesore of a die shot we got from AMD. I was really disappointed with it. At least they're not Nvidia, though... tired of looking at pictures of silicon that look like a clown took a runny dump all over.

NTMBK · Mar 28, 2014

norseamd said:
is the l4 cache on skylake stacked?

No idea.

NTMBK · Mar 28, 2014

NostaSeronx said:
I wouldn't put hopes that AMD are going to double the ALUs or AGUs. The FPU width isn't AMDs game, it all about more units:
2 x 256b FMA = bad
4 x 128b FMA = good

Given that AMD's modules are still bottlenecked on instruction fetch, could they really keep 4 FMACs fed with enough SSE instructions? (AVX is obviously not an issue.)

NostaSeronx · Mar 28, 2014

NTMBK said:
Given that AMD's modules are still bottlenecked on instruction fetch, could they really keep 4 FMACs fed with enough SSE instructions? (AVX is obviously not an issue.)

Execution bit width != Instruction byte size.

256-bit AVX/AVX2 instruction byte size length equals less than four bytes for the operation. Steamroller could feed 6 FMACs + 6 AGUs with scalar instructions.

TuxDave said:
It's a "mixed" good. For applications that can't unroll and fill out a full 256b vector, 4x 128b FMA would be more flexible. However for HPC applications or widely parallel applications, 2x 256b is much more power efficient due to the narrower decode/retirement/memory pipelines. 4x 128b just comes at a much higher cost around the FMA to keep them executed.

2x 256b = 2 x 128b or 2 x 256b; It isn't efficient or effective.

4x 128b = 4x 128b, 2x 256b; It is efficient and effective.

HPC applications and or widely parallel applications are more worried about core scaling than internal FPU decode/retirement/data bus.

NTMBK · Mar 28, 2014

NostaSeronx said:
Execution bit width != Instruction byte size.

256-bit AVX/AVX2 instruction byte size length equals less than four bytes for the operation. Steamroller could feed 6 FMACs + 6 AGUs with scalar instructions.

Yes, I am well aware of that. But the point I am making is that in order to keep your 4 128bit FMACs fed with SSE, you would need to handle twice as many instructions in the front end as you would with AVX. And if splitting the FMACs into 128bit units gives no benefit in SSE, then there is no point and you may as well go with 2x256bit units, getting slightly improved performance in AVX code.

NostaSeronx · Mar 28, 2014

NTMBK said:
Yes, I am well aware of that. But the point I am making is that in order to keep your 4 128bit FMACs fed with SSE, you would need to handle twice as many instructions in the front end as you would with AVX. And if splitting the FMACs into 128bit units gives no benefit in SSE, then there is no point and you may as well go with 2x256bit units, getting slightly improved performance in AVX code.

Even Bulldozer could keep up with (4 128b FMACs and 4 128b MMX)FPU and (4 ALUs and 4 AGUs)per Core, with its "crippled" front-end.

NTMBK · Mar 28, 2014

NostaSeronx said:
Even Bulldozer could keep up with (4 128b FMACs and 4 128b MMX)FPU and (4 ALUs and 4 AGUs)per Core, with its "crippled" front-end.

Bulldozer only had 2 FMACs per module...

witeken · Mar 28, 2014

AtenRa said:
Gaming performance is not the primary objective for the APUs from AMD, compute is. Higher iGPU Shader count will bring higher computational performance with OpenCL and that is what AMD is after.

But why do consumers care about that?

el etro · Mar 28, 2014

witeken said:
But why do consumers care about that?

Is future. HSA and OpenCL looking.

AtenRa · Mar 28, 2014

OatisCampbell said:
As a general consumer who runs office apps and games, can this benefit me?

It has already started to benefit us, there are already consumer applications that can be OpenCL accelerated, like WinZip, Adobe Premier, Libre Office and more. Also, you get higher Gaming performance with every new generation of APUs at lower power consumption.

witeken said:
But why do consumers care about that?

Because that is what future applications will need. It is the reason AMD, Intel, NVIDIA, Qualcomm, Altera, Samsung etc are heavily investing in to iGPUs and heterogeneous Computing.

Homeles · Mar 28, 2014

AtenRa said:
It has already started to benefit us, there are already consumer applications that can be OpenCL accelerated, like WinZip, Adobe Premier, Libre Office and more. Also, you get higher Gaming performance with every new generation of APUs at lower power consumption.

Because that is what future applications will need. It is the reason AMD, Intel, NVIDIA, Qualcomm, Altera, Samsung etc are heavily investing in to iGPUs and heterogeneous Computing.

Yep. It's taking forever to truly go mainstream, though.

norseamd · Mar 28, 2014

any answers to my floating point question

AtenRa · Mar 28, 2014

norseamd said:
so floating point units used to be independent processors. they are now integrated. how is integrating gpus any different from the previous integrations?

Have a look at what HSA is.

norseamd · Mar 28, 2014

AtenRa said:
Have a look at what HSA is.

floating point units used to also be an add in card. are they that much easier to integrate than gpus?

NTMBK · Mar 28, 2014

norseamd said:
floating point units used to also be an add in card. are they that much easier to integrate than gpus?

Even when they sat in a separate socket, they were always units that processed specific types of x86 instructions, and as such were very tightly integrated into the CPU. GPUs are still nowhere near that tightly integrated.

norseamd · Mar 28, 2014

NTMBK said:
Even when they sat in a separate socket, they were always units that processed specific types of x86 instructions, and as such were very tightly integrated into the CPU. GPUs are still nowhere near that tightly integrated.

alright

so what about integrating audio and radio processors?

will these be easier?

TuxDave · Mar 28, 2014

NostaSeronx said:
2x 256b = 2 x 128b or 2 x 256b; It isn't efficient or effective.
4x 128b = 4x 128b, 2x 256b; It is efficient and effective.

HPC applications and or widely parallel applications are more worried about core scaling than internal FPU decode/retirement/data bus.

And widening the decode/retirement/data bus limits the core scaling. The way I see it, the area growth (front end, retirement, even the bypass between the FMAs) required from moving from a 2x128b to a "true" (see below) 4x128b design is significantly larger, more power hungry, probably less frequency than moving from 2x128b to a 2x256b design. So you can more FP units in a much narrower core... hence better core scaling.

2x 256b = 2 x 128b or 2 x 256b
4x 128b = 4x 128b, 2x 256b; It is efficient and effective

Edit: Ah I see what you mean. So what you're calling a 4x128b, you actually mean a reconfigurable machine that tries to get the best of both worlds. I would have two points to make:

1) If you want 4x 128b to act like a TRUE 4x 128b, you will have significantly more area/power/less frequency. It would impact the design even when you all you really want is 2x256b. So if you really wanted to go for that reconfigurable option without tanking your whole design, you would have to make some tradeoffs similar to what Bulldozer did in how the execution units can get their data.

Now my question is, what's the front end you would put in to feed this CPU? If you say 4 wide, then for applications that can scale to 256b, you have excess decode width. If you say 2 wide, then you don't get the full utilization of a 4x 128b. Flexibility comes with some overhead. I like reconfigurable stuff, but if you want maximum perf/power and hence efficiency, you have to start targeting.

AMD Carrizo APU Details Leaked

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Platinum Member

Diamond Member

Lifer

Platinum Member

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Golden Member

Lifer

Platinum Member

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer