AMD Carrizo APU Details Leaked

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

norseamd

Lifer
Dec 13, 2013
13,990
180
106
apus have many benefits like speeding up large files and also as noted before reducing the load time on picture thumbs when you have a ton of them. they also benefit gaming although how that might work over a discrete gpu i do not know. maybe the physics calculations on open cl might be stronger. there is likely a lot you can do with open cl for gaming and such
 

norseamd

Lifer
Dec 13, 2013
13,990
180
106
so floating point units used to be independent processors. they are now integrated. how is integrating gpus any different from the previous integrations?
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
like actually slower

why?
Well, they're using high density, automated libraries more extensively. The core's also getting a lot wider -- double the ALUs and AGUs, double the FPU width... among other things. It'll undoubtedly be faster than Steamroller overall, but it might be clocked lower. But who knows.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,803
1,286
136
Well, they're using high density, automated libraries more extensively.
Personally, I think Steamroller was the start of the automated library. Steamroller could also be using high density libraries as well.
The core's also getting a lot wider -- double the ALUs and AGUs, double the FPU width... among other things.
I wouldn't put hopes that AMD are going to double the ALUs or AGUs. The FPU width isn't AMDs game, it all about more units:
2 x 256b FMA = bad
4 x 128b FMA = good
It'll undoubtedly be faster than Steamroller overall, but it might be clocked lower. But who knows.
Who knows.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Personally, I think Steamroller was the start of the automated library. Steamroller could also be using high density libraries as well.I wouldn't put hopes that AMD are going to double the ALUs or AGUs. The FPU width isn't AMDs game, it all about more units:
2 x 256b FMA = bad
4 x 128b FMA = good

It's a "mixed" good. For applications that can't unroll and fill out a full 256b vector, 4x 128b FMA would be more flexible. However for HPC applications or widely parallel applications, 2x 256b is much more power efficient due to the narrower decode/retirement/memory pipelines. 4x 128b just comes at a much higher cost around the FMA to keep them executed.
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
Personally, I think Steamroller was the start of the automated library.
Well, it's impossible to say, given the eyesore of a die shot we got from AMD. I was really disappointed with it. At least they're not Nvidia, though... tired of looking at pictures of silicon that look like a clown took a runny dump all over.
 

NTMBK

Lifer
Nov 14, 2011
10,401
5,638
136
I wouldn't put hopes that AMD are going to double the ALUs or AGUs. The FPU width isn't AMDs game, it all about more units:
2 x 256b FMA = bad
4 x 128b FMA = good

Given that AMD's modules are still bottlenecked on instruction fetch, could they really keep 4 FMACs fed with enough SSE instructions? (AVX is obviously not an issue.)
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,803
1,286
136
Given that AMD's modules are still bottlenecked on instruction fetch, could they really keep 4 FMACs fed with enough SSE instructions? (AVX is obviously not an issue.)
Execution bit width != Instruction byte size.

256-bit AVX/AVX2 instruction byte size length equals less than four bytes for the operation. Steamroller could feed 6 FMACs + 6 AGUs with scalar instructions.
It's a "mixed" good. For applications that can't unroll and fill out a full 256b vector, 4x 128b FMA would be more flexible. However for HPC applications or widely parallel applications, 2x 256b is much more power efficient due to the narrower decode/retirement/memory pipelines. 4x 128b just comes at a much higher cost around the FMA to keep them executed.
2x 256b = 2 x 128b or 2 x 256b; It isn't efficient or effective.

4x 128b = 4x 128b, 2x 256b; It is efficient and effective.

HPC applications and or widely parallel applications are more worried about core scaling than internal FPU decode/retirement/data bus.
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,401
5,638
136
Execution bit width != Instruction byte size.

256-bit AVX/AVX2 instruction byte size length equals less than four bytes for the operation. Steamroller could feed 6 FMACs + 6 AGUs with scalar instructions.

Yes, I am well aware of that. But the point I am making is that in order to keep your 4 128bit FMACs fed with SSE, you would need to handle twice as many instructions in the front end as you would with AVX. And if splitting the FMACs into 128bit units gives no benefit in SSE, then there is no point and you may as well go with 2x256bit units, getting slightly improved performance in AVX code.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,803
1,286
136
Yes, I am well aware of that. But the point I am making is that in order to keep your 4 128bit FMACs fed with SSE, you would need to handle twice as many instructions in the front end as you would with AVX. And if splitting the FMACs into 128bit units gives no benefit in SSE, then there is no point and you may as well go with 2x256bit units, getting slightly improved performance in AVX code.
Even Bulldozer could keep up with (4 128b FMACs and 4 128b MMX)FPU and (4 ALUs and 4 AGUs)per Core, with its "crippled" front-end.
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
Gaming performance is not the primary objective for the APUs from AMD, compute is. Higher iGPU Shader count will bring higher computational performance with OpenCL and that is what AMD is after.

But why do consumers care about that?
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
As a general consumer who runs office apps and games, can this benefit me?

It has already started to benefit us, there are already consumer applications that can be OpenCL accelerated, like WinZip, Adobe Premier, Libre Office and more. Also, you get higher Gaming performance with every new generation of APUs at lower power consumption.

But why do consumers care about that?

Because that is what future applications will need. It is the reason AMD, Intel, NVIDIA, Qualcomm, Altera, Samsung etc are heavily investing in to iGPUs and heterogeneous Computing.
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
It has already started to benefit us, there are already consumer applications that can be OpenCL accelerated, like WinZip, Adobe Premier, Libre Office and more. Also, you get higher Gaming performance with every new generation of APUs at lower power consumption.



Because that is what future applications will need. It is the reason AMD, Intel, NVIDIA, Qualcomm, Altera, Samsung etc are heavily investing in to iGPUs and heterogeneous Computing.

Yep. It's taking forever to truly go mainstream, though.
 

NTMBK

Lifer
Nov 14, 2011
10,401
5,638
136
floating point units used to also be an add in card. are they that much easier to integrate than gpus?

Even when they sat in a separate socket, they were always units that processed specific types of x86 instructions, and as such were very tightly integrated into the CPU. GPUs are still nowhere near that tightly integrated.
 

norseamd

Lifer
Dec 13, 2013
13,990
180
106
Even when they sat in a separate socket, they were always units that processed specific types of x86 instructions, and as such were very tightly integrated into the CPU. GPUs are still nowhere near that tightly integrated.

alright

so what about integrating audio and radio processors?

will these be easier?
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
2x 256b = 2 x 128b or 2 x 256b; It isn't efficient or effective.
4x 128b = 4x 128b, 2x 256b; It is efficient and effective.

HPC applications and or widely parallel applications are more worried about core scaling than internal FPU decode/retirement/data bus.

And widening the decode/retirement/data bus limits the core scaling. The way I see it, the area growth (front end, retirement, even the bypass between the FMAs) required from moving from a 2x128b to a "true" (see below) 4x128b design is significantly larger, more power hungry, probably less frequency than moving from 2x128b to a 2x256b design. So you can more FP units in a much narrower core... hence better core scaling.

2x 256b = 2 x 128b or 2 x 256b
4x 128b = 4x 128b, 2x 256b; It is efficient and effective

Edit: Ah I see what you mean. So what you're calling a 4x128b, you actually mean a reconfigurable machine that tries to get the best of both worlds. I would have two points to make:

1) If you want 4x 128b to act like a TRUE 4x 128b, you will have significantly more area/power/less frequency. It would impact the design even when you all you really want is 2x256b. So if you really wanted to go for that reconfigurable option without tanking your whole design, you would have to make some tradeoffs similar to what Bulldozer did in how the execution units can get their data.

Now my question is, what's the front end you would put in to feed this CPU? If you say 4 wide, then for applications that can scale to 256b, you have excess decode width. If you say 2 wide, then you don't get the full utilization of a 4x 128b. Flexibility comes with some overhead. I like reconfigurable stuff, but if you want maximum perf/power and hence efficiency, you have to start targeting.
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |