The Official AVX2 Thread

pelov · Jun 17, 2012

CPUarchitect said:
No it doesn't. Some high-end GPUs running OpenCL are losing against CPUs, prior to AVX2. You can't ignore this. Mainstream GPUs are much weaker too. And while HSA indeed aims to improve GPGPU efficiency, it will not be widely supported.

That's akin to running CUDA on an AMD GPU and saying it sucks at GPU-computing.

CPUarchitect said:
NVIDIA has decided to back away from GPGPU in the consumer market, as evident by the Kepler design sacrifices

From the desktop, yes. From HPC? Workstation? The Kepler desktop cards are essentially the same architecture as their bigboy Tesla/Quadro cards except for GPGPU focused necessities like bus width and additional vRAM, ECC, etc. The desktop GTX680s are neutered in CUDA because nVidia was losing sales on Quadro/Tesla cards and that's where their margins are higher.

CPUarchitect said:
And Intel isn't going to implement HSA either

No, but they've implemented openCL. Their "HSA" is a ton of little x86 cores aka MIC.

CPUarchitect said:
Hence HSA will be limited to a fraction of the market, also decreasing the "cross-platform" value of OpenCL.

openCL is tied only to HSA in that HSA can benefit openCL performance. You don't need an HSA-style architecture for good openCL performance. Hell, even Intel's CPUs benefit from AMD's openCL APP when it comes to throughput.

CPUarchitect said:
Homogeneous throughput computing, spearheaded by AVX2, is far more promising than GPGPU will ever be.

Except that it's not here and tied to only one manufacturer. So you're right, except that that scenario only works if you plan on buying Intel for everything...ever. While you might be okay with that (and if you are I'd highly suggest you look into buying yourself an Itanium PC. I hear their throughput is amazing) but I'm not. Proprietary ISAs = slow progress and no competition. Take a look at the blistering pace we've been moving at on the desktop over the years where on-die GPUs are outpacing CPU advancements 3:1 or 4:1. But you're right, that Haswell GPU should only be good at drawing pretty little triangles rather than helping with compute because utilizing that growing GPU (in die size and performance) is definitely a bad idea.

AVX2 = x86.

If you can name me the last time we saw this kind of improvement over a single generation in x86 with this "competition" between AMD and Intel then please let me know. Btw, that's not an ISA optimized benchmark but all architectural.

That's what true competition looks like and is capable of. Instead we get two manufacturers, a silly instruction set limited to a proprietary x86 and you're claiming it's the winner by default. Good luck with that.

edit - and it's not just qualcomm. The A15 is rumored to have nearly the same throughput thus should perform roughly the same as the S4. I'd rather head to an ISA that isn't tied to a specific hardware manufacturer with real competition (that's where HSA/openCL comes in as ARM is part of the HSA foundation) rather than Intel and AMD beefing up the GPU and a comparative stall on CPU performance.

CPUarchitect · Jun 17, 2012

pelov said:
And you know you're cherry-picking yet still denying it

It's not cherry picking. I'm not ignoring the other results. But they just don't matter. You can't say GPGPU is a promising technology when a major GPU manufacturer's cutting-edge high-end consumer GPU can't keep up in an OpenCL benchmark with a CPU without AVX2.

If I wanted to cherry pick, I wouldn't pick a 3000 GFLOPS GPU versus a 230 GFLOPS CPU, I wouldn't pick a benchmark aimed at GPUs, and I'd wait for AVX2 to arrive.

So call it what you want, it doesn't change the fact that for the average consumer, AVX2 will have a much bigger impact on his/her computing experience than GPGPU.

Dresdenboy · Jun 17, 2012

pelov said:
AMD/Intel aren't going in separate directions. In fact, they're headed in the exact same direction but taking parallel routes: AMD branching off x86 and adopting an open standard while Intel is looking to cement x86 into the GPU portion as well - -though that hinges on whether or not Larrabee 2.0 aka MIC is another flop. If it is then they might have screwed the GPGPU pooch for good.

Intel isn't bound to x86 by its own interests but by a huge software base. Not so long ago it looked like Intel was about to replace IA32 by IA64.

pelov · Jun 17, 2012

CPUarchitect said:
So call it what you want, it doesn't change the fact that for the average consumer, AVX2 will have a much bigger impact on his/her computing experience than GPGPU.

For the average desktop x86 user? That's a maybe.

openCL bypasses that by applying to everybody, regardless of CPU/GPU proprietary hardware (you can see that by AMD's openCL APP with Intel's own x86 chips using CPU only code). Throw in GPU compliance where available (and it's available on future and current ARM chips as well as x86 and GPUs) and competition across a variety of ISAs and platforms with a baseline architecture that openCL/HSA can benefit from then you've got a far wider target audience (both x86 and ARM, both GPU and CPU and regardless of the architecture) and you're still arguing that x86-tied AVX2 is going to have a bigger impact?

We're not living in an x86 world anymore. Most software developed is being developed for ARM. Most of the competition is on ARM or between ARM and x86 and not between x86 manufacturers. The faster that proprietary ISA dies the better off we all are. Emulate it if you need it or grab its apps via cloud. The sooner it dies the better.

CPUarchitect · Jun 17, 2012

pelov said:
That's akin to running CUDA on an AMD GPU and saying it sucks at GPU-computing.

No it's not. AMD doesn't support CUDA, at all. NVIDIA does support OpenCL, the best it can.

AVX2 is language agnostic. So I don't see why we'd have to use NVIDIA's proprietary language. There's not even a reason to assume it would be any faster. And being forced to rewrite their code twice isn't very appealing to developers.

From the desktop, yes. From HPC? Workstation?

NVIDIA indeed still caters for HPC. But how is that relevant here? This thread was created as a response to Kaveri, a consumer product.

If you want to discuss the HPC market, I suggest a GK110 versus Knights Corner thread.

CPUarchitect · Jun 17, 2012

pelov said:
If you can name me the last time we saw this kind of improvement over a single generation in x86 with this "competition" between AMD and Intel then please let me know. Btw, that's not an ISA optimized benchmark but all architectural.

That's what true competition looks like and is capable of.

Krait achieves this through a higher decode and issue width, and more aggressive out-of-order execution. x86 already has all of that, and making things even wider or more aggressive is prohibitively expensive. It would in fact lower performance/Watt.

We've reached the limit for ILP. And TLP doesn't scale very well either because it demands considerable developer effort to multi-thread applications (although Haswell's TSX technology will address that as well). So the focus has shifted on extracting DLP to increase performance.

Regardless of whether that's done through homogeneous or heterogeneous throughput computing technology, there's a new instruction set involved. So you can't use that as an argument against AVX2. In fact GPGPU is way more disruptive in terms of what developers have to do to adopt it.

Sooner or later mobile architectures will follow as well. They're only several years behind (with Krait maximizing the ILP extraction). So they too will have to choose between either adding a vector instruction set extension suitable for SPMD processing, or demand that developers rewrite things for the GPU.

Since apparently you favor a speedup that takes minimal effort and is widely applicable, your support should go to homogeneous computing.

Dresdenboy · Jun 17, 2012

I know, I'm late to the AVX2 256b Integer SIMD + FP FMA3 + gather instructions vs. HSA/OpenCL party here, but it's never to late to join a not yet finished discussion ;-)

CPUarchitect said:
Actually scalar floating-point code can become up to 16x faster. But the worst case for parallel compute limited code is 2x. Bandwidth also goes up by 2x and there's gather support, so that should be a pretty consistent minimum.

Well, in some situations AVX1 FP code or AVX2 FMA3/gathering FP code might even be more than 16X faster than old fashioned scalar FP code of the last century. The trick here is that scalar code might leave some L/S throughput on the table by using only 32 of 64 or more bits of internal busses

In general:
One of Intel's first OpenCL related publications was about using OpenCL to run multiple parallel loops of calculations simultaneously using SIMD operations. They said that it is difficult to fully utilize SIMD throughput all the time with scalar code and some vector data structures. Using OpenCL it was easier to fully utilize SIMD registers because it "only needs loops" (independend parallel operations on data) as the base of parallelism. A compiler tuned OpenCL solution reached >80% of the performance of a hand assembler optimized vector solution.

Extending the point of someone during AFDS, OpenCL is CUDA is not the hot topic since it's a rather low level discussion. The POI is how to develop software using todays compute capabilities. In other words, it's like discussing, which hammer to use to hit the nail vs. how to build a house.

And there is a big hurdle to overcome when using GPGPU or other new types of computing or changing ISAs: changing an existing software base to use this new stuff. HSA is there to address that.

And discussions about HSA vs. AVX2 miss the mark. It's more like: how can we do the most with a given power budget (since this is still limited, while we can add transistors with smaller nodes) and how can we make it easier to access these capabilities.

Even Intel spoke about heterogenous computing in the past, talking the possibility of having fat and small cores mixed together (like ARM's big.LITTLE).

Video Codec engines are another sign of this change. Have specialized units, which can be powered off when idle (costing only die area, not leakage or TDP headroom), but are much more efficient at specific tasks than generic scalar/SIMD processing units (CPU cores), or even generic parallel/wide SIMD processing units (GPGPU cores).

But even using CUDA or OpenCL, the programmer has a lot to do to make use of better suited computing hardware. HSA would remove this task and allow some flexibility. A bit like Java VMs in the past on different target hardware.

If Gentoo can be optimized for any target platform, why should compute software be kept from being optimized for a target compute system? And soon there will be all big OS's on mainstream CPU platforms, think Windows 8. This means common OS platforms.

bronxzv · Jun 17, 2012

Dresdenboy said:
data structures. Using OpenCL it was easier to fully utilize SIMD registers because it "only needs loops" (independend parallel operations on data) as the base of parallelism.

it looks like some old story, current tools like Intel Cilk+ / Parallel Studio XE 2011 allow you to do just that for native code (even parallelize + vectorize across function calls if the source code is available) with the full expressive power of C++ instead of refactoring all your code to match Open CL and all its clumsy restrictions

Edrick · Jun 18, 2012

pelov said:
Most software developed is being developed for ARM. Most of the competition is on ARM or between ARM and x86 and not between x86 manufacturers.

Did you pull that out of your arse? Or do you read a few websites and think you are an expert on all related industries?

Edrick · Jun 18, 2012

CPUarchitect said:
So call it what you want, it doesn't change the fact that for the average consumer, AVX2 will have a much bigger impact on his/her computing experience than GPGPU.

This +1

Munky · Jun 18, 2012

bronxzv said:
which is the exact purpose of using voxels for near-field global illumination approximation, I was just pointing out that triangles aren't the only primitive anymore

Which game engine is doing anything with voxels? AFAIK the last time voxels were used was when Blood was released in 1997, using a modified Build engine from Duke 3D, and even that was in limited quantities. What advantage is there to voxels over triangles? Triangles have proven themselves as the de-facto primitive for the last 15 years, and I see no mention of voxels in any modern game engine.

bronxzv · Jun 18, 2012

Munky said:
Which game engine is doing anything with voxels?

for example Unreal 4, http://blog.icare3d.org/2012/06/unreal-engine-4-demo-with-real-time-gi.html

there is a bunch of voxels related papers in the GPU Pro series of book, notably in the context of global illumination and/or realtime worldspace ambiant occlusion

the general idea is to replace triangle meshes by other simpler representations (voxels, octrees, PBR, cubemaps, 2D maps,...) for some passes with lower sampling frequency requirements, the key idea is generally to compute the alternate representations on the fly depending on the current observer/camera position to avoid the huge memory which will be required if everything was in the alternate form in the main model database

Munky · Jun 18, 2012

bronxzv said:
for example Unreal 4, http://blog.icare3d.org/2012/06/unreal-engine-4-demo-with-real-time-gi.html

there is a bunch of voxels related papers in the GPU Pro series of book, notably in the context of global illumination and/or realtime ambiant occlusion

the general idea is to replace triangle meshes by other simpler representations (voxels, PBR, image based,...) for some passes with lower sampling frequency requirements

There's a bunch of papers on all kinds of technologies that never materialize. Even if Unreal 4 does use voxels, it's not replacing triangles, but merely using voxels to enhance lighting.

Olikan · Jun 18, 2012

CPUarchitect said:
So call it what you want, it doesn't change the fact that for the average consumer, AVX2 will have a much bigger impact on his/her computing experience than GPGPU.

yeah... for desktop users, HSA is pointless...

HSA is for servers, HPC, consoles and phones... and at some extent laptops

and hey!...desktops are dyeing...

bronxzv · Jun 18, 2012

Munky said:
There's a bunch of papers on all kinds of technologies that never materialize. Even if Unreal 4 does use voxels, it's not replacing triangles, but merely using voxels to enhance lighting.

mmm do you have something interesting to tell us?

Munky · Jun 18, 2012

bronxzv said:
mmm do you have something interesting to tell us?

Not sure if serious....

bronxzv · Jun 18, 2012

Munky said:
Not sure if serious....

sorry but I can't see how you contributed to this thread so far

btw GigaVoxels / cone tracings / volume mipmapping ideas first appeared in GPU Pro 1 published in 2010 and 2 years later we have an example of a major engine using such techniques with Unreal 4, I'm quite sure CryENGINE will use soon (if not already yet) such hybrid representations for the geometry given the huge speedup we see in practice

Munky · Jun 18, 2012

bronxzv said:
sorry but I can't see how you contributed to this thread so far

btw GigaVoxels / cone tracings / volume mipmapping ideas first appeared in GPU Pro 1 published in 2010 and 2 years later we have an example of a major engine using such techniques with Unreal 4, I'm quite sure CryENGINE will use soon (if not already yet) such hybrid representations for the geometry given the huge speedup we see in practice

Oh no? Let's see, someone came here claiming how triangles don't matter any more and voxels are the future, when reality is nothing like it. I don't care if a certain game engine uses voxels, they are not replacing triangles as the defacto primitive, period. I could name you a bunch of technologies that were already implemented in HW, not just in one or two game engines based off some thesis paper, and those never amounted to much either. For examples, see Nvidia's NV1 quadrangle-based renderer, Ati's Trueform, and OpenGL accumulation buffer.

bronxzv · Jun 18, 2012

Munky said:
they are not replacing triangles as the defacto primitive

that's why I said they are replacing triangles only for *some passes*, also your question was "Which game engine is doing *anything* with voxels?"

now where do you think we see hotspots with global illumination?
1) primary rays / scan conversion
2) virtual light source rays/cones

EDIT food for thought:
http://www.facebook.com/Kribi3D#!/p...6655068.120435.156090777738758&type=1&theater

Olikan · Jun 18, 2012

Munky said:
Oh no? Let's see, someone came here claiming how triangles don't matter any more and voxels are the future, when reality is nothing like it. I don't care if a certain game engine uses voxels, they are not replacing triangles as the defacto primitive, period. I could name you a bunch of technologies that were already implemented in HW, not just in one or two game engines based off some thesis paper, and those never amounted to much either. For examples, see Nvidia's NV1 quadrangle-based renderer, Ati's Trueform, and OpenGL accumulation buffer.

tesselation existed in hardware for like 5 years before dx11, IIRC

Dresdenboy · Jun 18, 2012

bronxzv said:
it looks like some old story, current tools like Intel Cilk+ / Parallel Studio XE 2011 allow you to do just that for native code (even parallelize + vectorize across function calls if the source code is available) with the full expressive power of C++ instead of refactoring all your code to match Open CL and all its clumsy restrictions

That was from SIGGRAPH 2010.

Is Intel's solution able to create code which adapts to the hardware ir's being run on?

bronxzv · Jun 18, 2012

Dresdenboy said:
That was from SIGGRAPH 2010.

Is Intel's solution able to create code which adapts to the hardware ir's being run on?

from what I gather Intel promote mostly solutions allowing to use the same source code but that must be compiled for all your targets i.e. your deliverables will have several native code paths (SSE/AVX/AVX2/etc.)

They support Open CL (also on "Xeon Phi") but don't push it very hard apparently

CPUarchitect · Jun 18, 2012

Dresdenboy said:
And discussions about HSA vs. AVX2 miss the mark. It's more like: how can we do the most with a given power budget...

It would be missing the mark to think this is only about power consumption. Developers gladly trade some power efficiency for ease of programming. So time and money are huge factors too. It's all about ROI for the manufacturers, the developers, and the consumers.

AVX2 costs very little in terms of die space, is straightforward to develop for, and offers twice the throughput (or up to 8x for scalar code). That's a fantastic ROI for all parties.

And even when looking at power consumption, the GPU isn't doing great. The HD 7970 is only twice as fast as the i7-3820 when taking peak TDP into account. Since AVX2 doubles the throughput and adds gather support, that gap should be closed entirely. And then homogeneous computing wins hands down on ease of programming, and total cost. Heck, the GPU still runs a driver on the CPU, so we should be adding up the power consumption of both...

And then there's still AVX-1024 to lower the CPU's power consumption.

bronxzv · Jun 18, 2012

CPUarchitect said:
It would be missing the mark to think this is only about power consumption.

btw I'm quite sure native AVX2 code will be more power efficient than Open CL code JIT compiled for running on the CPU

Denithor · Jun 18, 2012

It seems to me the whole point of AVX2 is to enable the CPU to perform more parallel functions, thereby widening the pipeline. Correct or not?

And the objective of openCL/HSA seems to be the same thing, only the parallel computing gets run on the GPU cores instead of in the CPU.

From what I've read in this and other threads it appears that Haswell will have 64 gpu-like cores to widen the pipeline.

Now, unless I'm mistaken, I think these can still coexist just fine. Haswell will handle parallelism up to 64 'whatevers' wide and the GPU will take over and run anything requiring more cores than that.

Make sense?

I think it's actually pretty cool that AMD/Intel have been able to pack as much GPU onto the die as they have. Once we have tools like openCL working with all the architectures we should see pretty significant improvements in the amount of work that can be done.

Also, one question - how much advantage do AMD's APUs get by being on-die with the GPU? In other words, if you compared an APU to an equivalent discrete GPU/CPU pair (same cores/speeds) how much better throughput would the APU offer?

The Official AVX2 Thread

Diamond Member

Senior member

Golden Member

Diamond Member

Senior member

Senior member

Golden Member

Senior member

Golden Member

Golden Member

Diamond Member

Senior member

Diamond Member

Platinum Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Platinum Member

Golden Member

Senior member

Senior member

Senior member

Diamond Member