Intel "Haswell" Speculation thread

ksec · Jan 14, 2012

Well, proper software optimization with AVX2 as well as other tweaks and New Cache System would brings us enough ( Double Digit % increase) IPC for a new Gen uArch CPU.

I am more concern about GPU, With AMD Radeon manage to get to 5W when idle i wonder if i still need a decent iGPU from Intel. Which still lacks in both Hardware and more so in Software Drivers. Wouldn't it be better now to focus on a super efficient GPU on the CPU die, to do all the graphics acceleration on UI and browsers etc, and leave the real power lift to a Real GPU.

AtenRa · Jan 14, 2012

bronxzv said:
why do you think Haswell will have a larger register file ?

That was between the 128bit SSEs vs 256bit AVX.

bronxzv · Jan 14, 2012

AtenRa said:
That was between the 128bit SSEs vs 256bit AVX.

ah sorry, I asked because you mentioned AVX2, Sandy Bridge has already 128+16 256-bit fp registers so it looks enough for Haswell, FMA code requires slightly less registers so it doesn't look like AVX2 is asking for any increase of the fp register file size

Tuna-Fish · Jan 14, 2012

bronxzv said:
Sandy Bridge has already 128+16 256-bit fp registers

Why did you split the register amount like that? AFAIK, SNB has 144 identical FP registers.

bronxzv · Jan 14, 2012

Tuna-Fish said:
Why did you split the register amount like that? AFAIK, SNB has 144 identical FP registers.

to emphasis the fact there is 16 registers used for the architected state and 128 rename registers but indeed they are all the same

btw one more reason for me to think there will be no more physical registers in Haswell is the AVX2 code generated by the Intel compiler, it uses the L1D cache much like some virtual registers even when logical registers are available, there is an example of mine here : http://software.intel.com/en-us/forums/showpost.php?p=170736

Nemesis 1 · Jan 14, 2012

Hasn't Intel covered this . Right now I believe intel has 3 registers but intel says 4 . The same applies to FMA3 intel may use 4 registers but intel will likely call it 5 registers . So ya there seems to be a virtual register . But thats not really the correct way to describe what intel is doing . They count the 4th 5th register as part of the 3-4 register setup

bronxzv · Jan 14, 2012

Nemesis 1 said:
seems to be a virtual register

I'm not sure what you have in mind but for the example I provided, an intrinsic in the source code ask to load a register then to use 4 times this register and instead the compiler keep using 4 times indexed addressing from "memory", it's an optimization apparently (only new versions of the compiler do that and timings aren't worse for AVX code on Sandy Bridge) so the value must be mapped to a buffer of some kind, I confess that I don't know exactly how it works and I'll be interested to learn

Tuna-Fish · Jan 14, 2012

bronxzv said:
to emphasis the fact there is 16 registers used for the architected state and 128 rename registers but indeed they are all the same

Remember that there are 32 architectural regs. (2 threads)
With the PRF, architectural registers use physical registers from the same array as the rename logic, and between 1 and 32 regs are used for architectural state (reg-to-reg copy makes 2 register names point to same physical register, and zeroing a register with an instruction that always zeroes it will make it point to a dedicated 0 register.)

bronxzv · Jan 14, 2012

Tuna-Fish said:
Remember that there are 32 architectural regs. (2 threads)

indeed, 32 in 64-bit mode and 16 otherwise, and yes I know they are from the same pool, anyway I'll write simply "144" next time...

Nemesis 1 · Jan 14, 2012

Tuna-Fish said:
Remember that there are 32 architectural regs. (2 threads)
With the PRF, architectural registers use physical registers from the same array as the rename logic, and between 1 and 32 regs are used for architectural state (reg-to-reg copy makes 2 register names point to same physical register, and zeroing a register with an instruction that always zeroes it will make it point to a dedicated 0 register.)

I was trying to find the article on how intel uses its operands it wasn't real clear but somehow intel uses the registers that exist 3 and the fourth operand somehow uses 1 of the 3 registers. I don't really get it as intel isn't real clear about this or I am not up to the level to understand what intel is saying.

Nemesis 1 · Jan 14, 2012

This is a best I could find its on SB but it has alot to offer . I gues I had register and ports mixed up .

http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/3

CPUarchitect · Jan 14, 2012

denev2004 said:
And...I'd like to ask is there any information about the percentage of the use of MUL, ADD and MAD in most software?

See the paper I've linked in post #156.

CPUarchitect · Jan 14, 2012

AtenRa said:
Those two could be the reasons why Intel chose the FMA3 instead of FMA4 that AMD chose and implemented with Bulldozer.

The only reason they chose FMA3 is because it allows the micro-instructions to be shorter, thus allowing to store more of them in the same amount of uop cache. FMA4 would have been the only instruction requiring four operands, so this it would have wasted uop space for every other instruction (note that the 4-operand blend instructions which are already supported by Sandy Bridge, are split into two uops).

denev2004 · Jan 15, 2012

CPUarchitect said:
The only reason they chose FMA3 is because it allows the micro-instructions to be shorter, thus allowing to store more of them in the same amount of uop cache. FMA4 would have been the only instruction requiring four operands, so this it would have wasted uop space for every other instruction (note that the 4-operand blend instructions which are already supported by Sandy Bridge, are split into two uops).

Really? I heard AMD might change sides to FMA3, at least support both ins.

bronxzv · Jan 15, 2012

denev2004 said:
Really? I heard AMD might change sides to FMA3, at least support both ins.

FMA4 was in the original Intel's AVX specs and was later revised to FMA3 (to be supported by AMD soon in Trinity + Piledriver, so *AMD will support FMA3 before Intel*), the industry standard x86 ISA is thus with FMA3

FMA3 is a more sensible choice for the reason given by CPUarchitect, some people wrongly assume that FMA3 will require a lot extra register to register moves vs. FMA4, it's not the case in practive since there is several variants of the fma instructions allowing to choose which operand is overwritten, here is a concrete example of typical FMA3 code with no single extra move:

vfmsub213ps ymm1, ymm8, YMMWORD PTR [6208+r13]
vfmadd132ps ymm8, ymm7, YMMWORD PTR [1152+r13]
vfmadd231ps ymm0, ymm9, YMMWORD PTR [6240+r13]
vfmadd231ps ymm3, ymm9, YMMWORD PTR [6336+r13]
vfmadd231ps ymm1, ymm9, YMMWORD PTR [6144+r13]
vfmadd132ps ymm9, ymm8, YMMWORD PTR [1088+r13]

Tuna-Fish · Jan 15, 2012

bronxzv said:
FMA3 is a more sensible choice for the reason given by CPUarchitect, some people wrongly assume that FMA3 will require a lot extra register to register moves vs. FMA4,

And even if it was the case in practice, with register renaming and PRF, reg -> reg moves are very nearly free. They cost no latency, and so long as you are not decode-limited, they cost no throughput either.

bronxzv · Jan 15, 2012

Tuna-Fish said:
And even if it was the case in practice, with register renaming and PRF, reg -> reg moves are very nearly free. They cost no latency, and so long as you are not decode-limited, they cost no throughput either.

indeed, they can even be removed by the decoder (AMD do just that in Bulldozer IIRC), thus the only (small) impact will be on code density thus slightly increased L1I miss rate

Idontcare · Jan 15, 2012

bronxzv said:
FMA4 was in the original Intel's AVX specs and was later revised to FMA3 (to be supported by AMD soon in Trinity + Piledriver, so *AMD will support FMA3 before Intel*), the industry standard x86 ISA is thus with FMA3

That is smart of AMD, to pursue support for both and on a timeline that bests Intel, but with the reality of lag between hardware availability and software adopting the new ISA extensions (think tesselation in DX11 for example) there is little benefit to leverage on AMD's behalf from this "win".

Arachnotronic · Jan 15, 2012

Idontcare said:
That is smart of AMD, to pursue support for both and on a timeline that bests Intel, but with the reality of lag between hardware availability and software adopting the new ISA extensions (think tesselation in DX11 for example) there is little benefit to leverage on AMD's behalf from this "win".

It's kind of pathetic that Intel isn't going to have FMA until 2013, but AMD has had it since 2011.

tweakboy · Jan 15, 2012

If you use 64bit HT app like Photoshop or Premiere or Sonar X1 its like having 8 cores.. amazing........

Tuna-Fish · Jan 16, 2012

bronxzv said:
indeed, they can even be removed by the decoder (AMD do just that in Bulldozer IIRC), thus the only (small) impact will be on code density thus slightly increased L1I miss rate

Not by the decoder, by the register rename phase. There is a distinction because the CPU can only rename 4 registers per clock and the 256-bit ops are already split, so if there is, for example, 2 256-bit ops and 2 reg-reg moves, they can't all be renamed in a single clock.

Oh, and Intel does this too, on SNB.

bronxzv · Jan 16, 2012

Tuna-Fish said:
Oh, and Intel does this too, on SNB.

are you real sure ? isn't it a planned change for IVB ?

"
In Ivy Bridge MOVs are executed by simply pointing one register at the location of the destination register.
"
below slide 6 here:

http://www.anandtech.com/show/4830/intels-ivy-bridge-architecture-exposed/2

gorydetails · Jan 16, 2012

What a false,,,mov takes execution port on sandy

CPUarchitect · Jan 16, 2012

Intel17 said:
It's kind of pathetic that Intel isn't going to have FMA until 2013, but AMD has had it since 2011.

Having a separate 256-bit MUL and 256-bit ADD unit per core is more powerful than two 128-bit FMA units per module. The theoretical peak throughput is the same, but according to this, FMA instructions on average only constitute about 1/3 of all floating-point instructions, resulting in 50% higher performance for Intel's configuration. Using the same numbers, Bulldozer should have a 12% advantage at legacy 128-bit SSE workloads, but benchmarks show that in practice it's slower in that scenario as well. That's probably due to the long latencies of its non-cascaded FMA units.

And until AMD widens both FMA units to 256-bit, Haswell will further increase Intel's lead.

Rumpelstiltskin · Jul 8, 2012

Haswell will be a significant upgrade, but I fear it may be one of the last significant upgrades for desktop PC enthusiasts. It's looking more and more like Haswell could deliver the final killing blow to AMD's APUs. Substantially more EUs could very well bring the IGP on par with or above Trinity's successor. A 95W TDP and AVX2/TSX could also be the final nail in the coffin on the CPU side. I also don't see Steamroller being able to come anywhere close to Haswell-E. It looks like AMD is already preparing to exit this market and focus on the ARM/low power mobile segment. Intel is aligning to compete in this segment as well. Even Microsoft seems to be abandoning PC enthusiasts with Windows 8. Get ready for much slower advancements after 2013, from 2014 onward it will be all about tablets.

Intel "Haswell" Speculation thread

Senior member

Lifer

Senior member

Golden Member

Senior member

Lifer

Senior member

Golden Member

Senior member

Lifer

Lifer

Senior member

Senior member

Member

Senior member

Golden Member

Senior member

Elite Member

Lifer

Diamond Member

Golden Member

Senior member

Member

Senior member

Junior Member