Recent content by CPUarchitect

C
4th Generation Intel Core, Haswell summarized

The HSA roadmap (the architecture used by the AMD's APUs) runs till at least 2014. So yes it's definitely a longer term plan. But it's really not just a hardware problem. They have to try and convince developers to adopt a quirky heterogeneous way of computing to access an integrated GPU that...
- CPUarchitect
- Post #150
- Sep 18, 2012
- Forum: CPUs and Overclocking
C
4th Generation Intel Core, Haswell summarized

Not really. An APU is an "Accelerated" Processing Unit, meaning a CPU and GPU on a single die, with the explicit intention of using the GPU to perform generic high throughput workloads instead of the CPU. This is heterogeneous computing. Haswell has a GPU too, but its CPU cores are more...
- CPUarchitect
- Post #145
- Sep 18, 2012
- Forum: CPUs and Overclocking
C
4th Generation Intel Core, Haswell summarized

Don't forget vector workloads, and any scalar floating-point workload for that matter too. All of these benefit from having execution ports 0 and 1 available for vector or floating-point operations, while the new port 6 takes over the ALU, shift and branch operations from port 0 (with port 5...
- CPUarchitect
- Post #144
- Sep 18, 2012
- Forum: CPUs and Overclocking
C
4th Generation Intel Core, Haswell summarized

It's the same process node, but they've added 33% more execution ports! So is it really that tough to imagine? Also, the IPC gain from AVX2 is... wait for it... nada. AVX2 isn't about Instructions Per Clock, it's all about doing twice the amount of work per instruction. That said...
- CPUarchitect
- Post #136
- Sep 18, 2012
- Forum: CPUs and Overclocking
C
Wait, is an i5-3450 doing this?

That doesn't mean a slower CPU would give you equal performance in this game! Think of some application tasks as a relay race. If you have four runners then each of them is active for only 1/4 of the time, but when they do run they run as fast as they can. With slower runners the activity is...
- CPUarchitect
- Post #19
- Jul 20, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Yes but that's because the load ports are still 128-bit each and they have to sync up to handle 256-bit. Hence dealing with unaligned 256-bit data is very problematic. Haswell will make them 256-bit each so vmovups will become faster than two 128-bit loads. And none of this is even relevant...
- CPUarchitect
- Post #154
- Jul 3, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Indeed. In the same spot where contiguous data is extracted from a cache line, just eight times in parallel. It's really not a whole lot of extra circuitry. It's basically just simple unidirectional shifters with byte granularity and a narrow 32-bit output (combining multiple ones for 64-bit...
- CPUarchitect
- Post #152
- Jul 3, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Because that's an averaged out result and the two arithmetic uops can execute on either port 0 or port 5. Also look at the result for the add instruction in your example: it's a single uop which can execute on port 0, 1, or 5, and hence the numbers for each port add up to 1.0. No, that's...
- CPUarchitect
- Post #149
- Jul 2, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

I couldn't find any evidence of VMASKMOV loads being unexpectedly slow in any way. And I'm not lazy, I tested it in practice too: it has a 1 cycle reciprocal throughput. Also in the Intel thread you're linking to, engineer Mark Buxton explains why they are "extremely useful". The only thing...
- CPUarchitect
- Post #136
- Jun 30, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

AVX2 is not just about floating-point performance. It also doubles the throughput of parallel integer workloads. I didn't mean to "derail" this thread at all. In fact the thread was split into an official AVX2 thread, but the discussion here still ended up being about how Kaveri and HSA will...
- CPUarchitect
- Post #133
- Jun 30, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Why contest my theory when you can't defend your own? And if I was "just a software guy" like you then why am I able to present a perfectly plausible uop breakdown of Haswell's gather support, while you can't? Please don't make such assumptions to try and get personal because you're out of...
- CPUarchitect
- Post #132
- Jun 30, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

Could you tell me the uop breakdown that would result in a 2 cycle reciprocal issue throughput?
- CPUarchitect
- Post #129
- Jun 29, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

1 cycle. And that's not an estimate but a fact, confirmed both by timing an unrolled loop containing only VMASKMOV instructions, and its uop decomposition by IACA. And it seems pretty obvious what each of those uops do by comparing its functionality against VMOVMSK and VBLEND. Reciprocal...
- CPUarchitect
- Post #127
- Jun 28, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

I don't think that's a correct conclusion. VMASKMOV consists of more uops than VMOVAPS, so if you're already occupying the ports for the extra uops then it adds to the critical path. It's just a coincidence that your VMOVAPS can use underutilized ports so you basically got it for free. That...
- CPUarchitect
- Post #123
- Jun 28, 2012
- Forum: CPUs and Overclocking
C
AMD summit today; Kaveri cuts out the middle man in Trinity.

After the detailed analysis I concluded it will likely be capable of sustaining a peak throughput of one gather operation each cycle. The mask register can be initialized using vcmpeq on port 1, it can then be compacted by a vmovmsk on port 0, then port 3 can do the actual gather load, port 2...
- CPUarchitect
- Post #120
- Jun 28, 2012
- Forum: CPUs and Overclocking

RESOURCES

Top Bottom