The Official AVX2 Thread

Abwx · Jun 15, 2012

BenchPress said:
Yes, now look at the score for the 3000 GFLOPS GTX 680. Pathetic! How can GPGPU possibly be the future when such a last-gen high-end GPU loses against a 230 GFLOPS CPU without AVX2?

Since the previous generation GT580 do so much better it is obvious
than the GTX680 is not optimised for open CL , wich the HD7970 is ,
isnt it , and in this comparison the CPU get destroyed...

Flashy that you didnt aknowledge this....

ShintaiDK · Jun 15, 2012

I see the thread ran into AVX2 again.

pelov · Jun 15, 2012

BenchPress said:
Sigh. I've already covered why those benchmarks are worthless:

Right, you kind of explained everything except why AMD wouldn't have AVX2 support which you never mentioned at all.

Let's say that AMD -- all of this is hypothetical -- were to introduce an FMA of their own that serves much the same purpose of an Intel-favored FMA. But now, what if that Intel FMA would gain more traction because of Intel's market share and was coupled with an upcoming ISA... oh I dunno, let's call this ISA AVX2. Then let's say, hypothetically of course, that AMD supported both FMAs despite having their own favorite (and before Intel) and implementing both on a half-way HSA-built APU -- let's call this APU Trinity.. Would that AMD support AVX2 as an ISA?

Hypothetically speaking, of course.

You must forgive me. I don't see how AVX2 favors Intel at all when AMD is already headed in the same direction.

Olikan · Jun 15, 2012

ShintaiDK said:
I see the thread ran into AVX2 again.

i am a wizard, i can see the future!!!

bronxzv · Jun 15, 2012

pelov said:
I wonder if we can find some openCL GPU-accelerated programs that are already here that outperform optimized CPU-only software on comparable platforms and $$$...

Hmmm... well that's quite an improvement, wouldn't you say? And unlike AVX2 which is sometime in the future, you can have this now. Also unlike AVX2, only AMD supports this and Intel doesn't whereas AMD will almost certainly support AVX2.

But wait! there's more!

Photoshop too? Wh0ataete~!@!@

What was the max theoretical throughput of AVX2? 4x throughput of SSE4? Looks like that won't be enough to match these results we've seen here on a single APU without discrete graphics, never mind an AMD APU with AVX2 as well and coupled with a GCN GPU, or, God forbid we see an AMD APU with AVX2 and shared memory address space with GCN on-die graphics!!! Could you imagine the slaughter?

Developers are already supporting openCL and will likely support HSA. In fact, if these are the improvements with GPGPU on a Llano without HSA can you imagine the Trinity benchmarks? Or the Kaveri benchmarks? Sweet Jesus! And openCL already has more traction in the consumer space than CUDA has ever had.

Tip from somebody who's been hanging around tech forums since the 90s:

Don't ever get excited about an instruction set. They always promise to deliver the world (or so the fanboys claim) and they always disappoint. I can't for the life of me remember the last time an instruction set delivered on its purported promises yet every single time we see a new instruction set, regardless of the microarchitecture, there's always someone thumping it as the best thing since the fleshlight.

this looks much like different algorithms used on both platforms like a two passes separable Gaussian on the GPU and a slow single pass method on the CPU, I'm sure you understand it's too good to be true

pelov · Jun 15, 2012

bronxzv said:
this looks much like different algorithms used on both platforms like a two passes separable Gaussian on the GPU and a slow single pass method on the CPU, I'm sure you understand it's too good to be true

Or it's updated software run on CPU only compared to openCL with GPU acceleration enabled.

No. Definitely a conspiracy.

BenchPress · Jun 15, 2012

pelov said:
Right, you kind of explained everything except why AMD wouldn't have AVX2 support which you never mentioned at all.

It's not about AMD versus Intel. It's about heterogeneous versus homogeneous computing. If AMD implements AVX2, that means they're not confident enough in HSA to become a dominant throughput computing technology.

But they just won't have a choice. There's too many great things about AVX2, and developers are quite excited about adopting it. So not implementing AVX2 would be the end of AMD.

You must forgive me. I don't see how AVX2 favors Intel at all when AMD is already headed in the same direction.

Exactly. AMD is heading in the same direction, meaning AVX2 becomes ubiquitous while HSA will be abandoned just like 3DNow!. And again, this isn't about AMD versus Intel. I don't need AVX2 to favor Intel to make my point that it's superior to HSA.

bronxzv · Jun 15, 2012

pelov said:
Or it's updated software run on CPU only compared to openCL with GPU acceleration enabled.

No. Definitely a conspiracy.

not a conspiracy, lazy testing I'll say, why not simply testing the very same code since there is good OCL JIT compilers for AVX ?

pelov · Jun 15, 2012

So let me get this straight...

You think AMD won't have any FPUs with AVX2 in their upcoming APUs/CPUs while embracing HSA and shared memory address spaces?

Uh... Why? Are you assuming that they'll only use the GPU side for compute?

pelov said:
not a conspiracy, lazy testing I'll say, why not simply testing the very same code since there is good OCL JIT compilers for AVX ?

That's an AMD APP and outperforms Intel's APP in the same task. It doesn't provide anywhere near the performance benefit. It would have changed those benchmarks from an absolute slaughter to an absolute slaughter.

Also aliens

BenchPress · Jun 15, 2012

pelov said:
Or it's updated software run on CPU only compared to openCL with GPU acceleration enabled.

No. Definitely a conspiracy.

Then why didn't they run the same OpenCL code on the CPU to prove that the non-OpenCL code is more optimal? The benchmarks are absolutely worthless without such a comparison. Tom's Hardware should be ashamed to publish these results.

bronxzv · Jun 15, 2012

pelov said:
That's an AMD APP and outperforms Intel's APP in the same task. It doesn't provide anywhere near the performance benefit. It would have changed those benchmarks from an absolute slaughter to an absolute slaughter.

Also aliens

it looks like you don't get the different algorithms scenario, it's something that I have often seen in practice, in this world not in the twilight zone

people simply optimize for a platform and keep the baseline unchanged so that at the end there is so much changes in the optimized version that you don't know what is being compared

if there was really a 100:1 speedup ratio be assured AMD will use it in their slides

pelov · Jun 15, 2012

In these and subsequent tests, you’ll notice the obvious results gap next to our HP notebook where OpenCL results should be—because today’s Sandy Bridge-based HD Graphics engines don’t support OpenCL (and we still haven't been able to get our hands on any Ivy Bridge-based Core i5 machines). Still, we left the Intel platform in this mix for comparison, because there are some cases in which the performance of Intel’s CPU working only in software makes for an interesting counterpoint to GPU-based acceleration. After all, with GPU-assist still in its toddler stage and many applications not yet optimized for the new technology, it’s important to keep one eye on how non-accelerated platforms behave.

In these GIMP tests, though, the benefits of OpenCL-based GPU acceleration are glaring. Even stating the difference as a percentage or multiple seems irrelevant. The point is that without acceleration, these filters are nearly unusable on any system. Workflow comes to a complete stop as the system creeps through adding the blur one block at a time. With OpenCL turned on, suddenly we see very even, expected performance scaling as we edge up from mobile to desktop APU and APU into discrete. Note how it’s not just the graphics processor doing all of the work. Depending on the test, the CPU side still contributes another 20% to 40% to the end result.

asdf

bronxzv said:
if there was really a 100:1 ratio be assured AMD will use it in their slides

They have.

bronxzv said:
t looks like you don't get the different algorithms scenario, it's something that I have often seen in practice, in this world not in the twilight zone

people simply optimize for a platform and keep the baseline unchanged so that at the end there is so much changes in the optimized version that you don't know what is being compared

I do. The problem here is that Intel hasn't bothered so why should they? When AMD's optimizations/compiler works better than Intel's own -- and again, only provides a negligible performance gain in the range of 30% (the fact that 30% gain is negligible in these benchmarks should be smacking you in the face right now) -- then it doesn't matter. It's no secret Intel wants nothing to do with openCL other than "officially" support it because Apple is one of the biggest companies pushing it as a standard. Supporting anything GPU-accelerated means the death of Intel and anything x86.

Riek · Jun 15, 2012

BenchPress said:
Then why didn't they run the same OpenCL code on the CPU to prove that the non-OpenCL code is more optimal? The benchmarks are absolutely worthless without such a comparison. Tom's Hardware should be ashamed to publish these results.

You can see the impact of the cpu already

FX8150 +7970 and llano 3850+7970 gpu.

So you can see cpu differences (be it due to AVX, FMA, :: support in bulldozer or just because the code path is more optimized). If their is little difference it means the cpu impact is rather small.
Some cpu test look like single threaded while openCL would have improved performance also, it would still fall short of the gpu capabilites... that one can be verified by llano and llnoa+7970. Thats even with the extra cost of transport.

Also if the application performs better due to running on a cpu and/or gpu due to openCL redesign... avx2 will do diddly squad for that app with just a recompile. They will need the openCl program to have that extract that performance (better grouping of the data).. (why again where you dismissing openCL?)

bronxzv · Jun 15, 2012

pelov said:
They have.

ah, do you have a link to share?

pelov said:
I do. The problem here is that Intel hasn't bothered so why should they?

they (Tom) should do it because it will be better information for their readers

pelov said:
When AMD's optimizations/compiler works better than Intel's own -

so let's use the AMD JIT if it's more mature, the key goal is to use the same code to garentee the same algorithm is used, Tom says that they have increased the radius of the filter kernels to ensure a good speedup (now that's a "methodology"), if one algorithm has a quadratic behavior and the other one a linear behavior it's clear that you will get exactly what we see here

pelov · Jun 15, 2012

bronxzv said:
so let's use the AMD JIT if it's more mature, the key goal is to use the same code to garentee the same algorithm is used, Tom says that they have increased the radius of the filter kernels to ensure a good speedup (now that's a "methodology"), if one algorithm has a quadratic behavior and the other one a linear behavior it's clear that you will get exactly what we see here

Great. Go do it or ask them to do it. You've read the article and you know why they didn't bother (and nor does Intel).

It's also not just an openCL AMD/Apple/ARM thing but MS has a horse in the race. Microsoft is using the GPU for compute for gaming.

bronxzv said:
ah, do you have a link to share?

http://ir.amd.com/phoenix.zhtml?c=74093&p=irol-2012analystday

They had a couple of slides with performance gains in GPU-accelerated apps in the order of 5x+

bronxzv · Jun 15, 2012

pelov said:
Great. Go do it or ask them to do it. You've read the article and you know why they didn't bother (and nor does Intel).

actually I have just read the begining of the article where they say the speedup wasn't there so they have increased the size of the filter

pelov said:
http://ir.amd.com/phoenix.zhtml?c=74093&p=irol-2012analystday

They had a couple of slides with performance gains in GPU-accelerated apps in the order of 5x+

I was speaking of the 100x gain at Tom for one the bench, anyway thanks I'll have a look

Schmide · Jun 15, 2012

BenchPress said:
A cutting-edge 3000 GFLOPS GPU is losing from a 230 GFLOPS CPU. That's the only relevant reality here. You can't say GPGPU is a success and then ignore this. And again, the average GPU isn't a GTX 680 or HD 7970. It's much weaker. So a CPU with AVX2 is going to win across the whole board.

As much as this argument is nVidia isn't going to eat its own, Intel isn't going to eat its own either. Eat its own means, they specifically segment their own processors to prevent them from competing with each other.

nVidia crippled its double precision floating point to prevent it competing with its Tesla line.

For luxmark room variance

i7 2600k 329 328 283 278
i7 3820 275
i5 2500k 194 164 149

lot of variance here and hyper threading seems to love this benchmark.

ATI 6970 Cayman XT 367
ATI 6x70 (Turks) 133 (The ball park figure for Tahiti desktop. It would probably be closer to 100)

Edit: If we go price to performance CPU/GPU

ATI 7770 Capeverde 347 (a $95 GPU scoring greater than all of the above CPUs)

So we can say with some confidence that the benchmark will follow the tiers for the segments.

It would be very irresponsible to say AVX2 is going to win across the whole board.

pelov · Jun 15, 2012

bronxzv said:
I was speaking of the 100x gain at Tom for one the bench, anyway thanks I'll have a look

The 100x gain took me by surprise as well, but most of the gains are in the order of 2x-10x. Still very significant but not altogether surprising considering the rise of brute force GPU computing in HPC and the desktop/server.

The Fusion Developer's Summit ended yesterday but I can't seem to find public PDFs. If the PDFs are uploaded they should certainly have quite a bit more info.

bronxzv · Jun 15, 2012

pelov said:
http://ir.amd.com/phoenix.zhtml?c=74093&p=irol-2012analystday

They had a couple of slides with performance gains in GPU-accelerated apps in the order of 5x+

in which PDF?, there is nothing in sight in the HSA PDF (or I'm blind/tired)

ViRGE · Jun 15, 2012

BenchPress said:
GPGPU hasn't proven a whole lot. Else why is it that a 3000 GFLOPS GPU loses against a 230 GFLOPS CPU? And that's still a CPU without AVX2, and the average system will have a much weaker GPU!

I'd be very careful using that comparison. The issue is entirely down to drivers; NVIDIA has specifically been avoiding any OpenCL optimization for Kepler at this time. As a result that one API in particular is far slower than other APIs or what the hardware is capable of.

BenchPress · Jun 15, 2012

Riek said:
You can see the impact of the cpu already

FX8150 +7970 and llano 3850+7970 gpu.

I'm not talking about the impact of the CPU. I'm talking about the impact of the code. Badly written code will be slow both on an Atom or Core i7. That doesn't prove the CPU is worse than the GPU. It only proves the code isn't optimized.

So to get an idea of what the code quality is like, they should have run the same OpenCL code on the CPU. It should be slower than the code specifically tuned for the CPU. If it's not, that means they're comparing fast code on the GPU versus slow code on the CPU. NVIDIA tried the same thing a few years ago, but they got caught.

Also if the application performs better due to running on a cpu and/or gpu due to openCL redesign... avx2 will do diddly squad for that app with just a recompile. They will need the openCl program to have that extract that performance (better grouping of the data).. (why again where you dismissing openCL?)

What? OpenCL is a layer in between the application and the hardware. It's impossible for AVX2 only to help OpenCL but not other code directly targetting the hardware.

In fact that's one of AVX2's strengths. You don't have to shoehorn the code into the limited programming model of OpenCL, and you can use any language you like. Compilers with AVX2 support are already out there.

Olikan · Jun 15, 2012

hum....how did i became the OP?

ROLF...that first post...soooo 4chan

piesquared · Jun 15, 2012

Where's the official HSA and OpenCl threads?

If I find a need for one I'll make it. But when half a dozen threads over the past week end up being sidelined by AVX2 discussion, it's helpful to have it all in one location.
-ViRGE

bronxzv · Jun 15, 2012

pelov said:
You've read the article and you know why they didn't bother (and nor does Intel).

I've read the article up to this point now :

http://www.tomshardware.com/reviews/photoshop-cs6-gimp-aftershot-pro,3208-5.html

“OpenCL not only gives GPU acceleration, but we can also use OpenCL in the CPU to provide good multi-threading, which GIMP lacked, and vectorization support,” says Oliveira.

in other words the CPU path is scalar single-threaded code

not using Open CL for the CPU test is even more lame than I was thinking since they say explicitely that they target Open CL to use vector instructions and multi-core on the CPU

ShintaiDK · Jun 15, 2012

Olikan said:
hum....how did i became the OP?

ROLF...that first post...soooo 4chan

And my post now looks like trolling or something

The Official AVX2 Thread

Lifer

Lifer

Diamond Member

Platinum Member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Elite Member, Moderator Emeritus

Senior member

Platinum Member

Golden Member

Senior member

Lifer