The HSA roadmap (the architecture used by the AMD's APUs) runs till at least 2014. So yes it's definitely a longer term plan.
But it's really not just a hardware problem. They have to try and convince developers to adopt a quirky heterogeneous way of computing to access an integrated GPU that...
Not really. An APU is an "Accelerated" Processing Unit, meaning a CPU and GPU on a single die, with the explicit intention of using the GPU to perform generic high throughput workloads instead of the CPU. This is heterogeneous computing.
Haswell has a GPU too, but its CPU cores are more...
Don't forget vector workloads, and any scalar floating-point workload for that matter too. All of these benefit from having execution ports 0 and 1 available for vector or floating-point operations, while the new port 6 takes over the ALU, shift and branch operations from port 0 (with port 5...
It's the same process node, but they've added 33% more execution ports! So is it really that tough to imagine?
Also, the IPC gain from AVX2 is... wait for it... nada. AVX2 isn't about Instructions Per Clock, it's all about doing twice the amount of work per instruction.
That said...
That doesn't mean a slower CPU would give you equal performance in this game!
Think of some application tasks as a relay race. If you have four runners then each of them is active for only 1/4 of the time, but when they do run they run as fast as they can. With slower runners the activity is...
Yes but that's because the load ports are still 128-bit each and they have to sync up to handle 256-bit. Hence dealing with unaligned 256-bit data is very problematic. Haswell will make them 256-bit each so vmovups will become faster than two 128-bit loads.
And none of this is even relevant...
Indeed. In the same spot where contiguous data is extracted from a cache line, just eight times in parallel. It's really not a whole lot of extra circuitry. It's basically just simple unidirectional shifters with byte granularity and a narrow 32-bit output (combining multiple ones for 64-bit...
Because that's an averaged out result and the two arithmetic uops can execute on either port 0 or port 5. Also look at the result for the add instruction in your example: it's a single uop which can execute on port 0, 1, or 5, and hence the numbers for each port add up to 1.0.
No, that's...
I couldn't find any evidence of VMASKMOV loads being unexpectedly slow in any way. And I'm not lazy, I tested it in practice too: it has a 1 cycle reciprocal throughput. Also in the Intel thread you're linking to, engineer Mark Buxton explains why they are "extremely useful". The only thing...
AVX2 is not just about floating-point performance. It also doubles the throughput of parallel integer workloads. I didn't mean to "derail" this thread at all. In fact the thread was split into an official AVX2 thread, but the discussion here still ended up being about how Kaveri and HSA will...
Why contest my theory when you can't defend your own?
And if I was "just a software guy" like you then why am I able to present a perfectly plausible uop breakdown of Haswell's gather support, while you can't? Please don't make such assumptions to try and get personal because you're out of...
1 cycle. And that's not an estimate but a fact, confirmed both by timing an unrolled loop containing only VMASKMOV instructions, and its uop decomposition by IACA. And it seems pretty obvious what each of those uops do by comparing its functionality against VMOVMSK and VBLEND.
Reciprocal...
I don't think that's a correct conclusion. VMASKMOV consists of more uops than VMOVAPS, so if you're already occupying the ports for the extra uops then it adds to the critical path. It's just a coincidence that your VMOVAPS can use underutilized ports so you basically got it for free. That...
After the detailed analysis I concluded it will likely be capable of sustaining a peak throughput of one gather operation each cycle. The mask register can be initialized using vcmpeq on port 1, it can then be compacted by a vmovmsk on port 0, then port 3 can do the actual gather load, port 2...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.