Intel "Haswell" Speculation thread

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

ksec

Senior member
Mar 5, 2010
420
117
116
Well, proper software optimization with AVX2 as well as other tweaks and New Cache System would brings us enough ( Double Digit % increase) IPC for a new Gen uArch CPU.

I am more concern about GPU, With AMD Radeon manage to get to 5W when idle i wonder if i still need a decent iGPU from Intel. Which still lacks in both Hardware and more so in Software Drivers. Wouldn't it be better now to focus on a super efficient GPU on the CPU die, to do all the graphics acceleration on UI and browsers etc, and leave the real power lift to a Real GPU.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
That was between the 128bit SSEs vs 256bit AVX.

ah sorry, I asked because you mentioned AVX2, Sandy Bridge has already 128+16 256-bit fp registers so it looks enough for Haswell, FMA code requires slightly less registers so it doesn't look like AVX2 is asking for any increase of the fp register file size
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
Why did you split the register amount like that? AFAIK, SNB has 144 identical FP registers.

to emphasis the fact there is 16 registers used for the architected state and 128 rename registers but indeed they are all the same

btw one more reason for me to think there will be no more physical registers in Haswell is the AVX2 code generated by the Intel compiler, it uses the L1D cache much like some virtual registers even when logical registers are available, there is an example of mine here : http://software.intel.com/en-us/forums/showpost.php?p=170736
 
Last edited:

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Hasn't Intel covered this . Right now I believe intel has 3 registers but intel says 4 . The same applies to FMA3 intel may use 4 registers but intel will likely call it 5 registers . So ya there seems to be a virtual register . But thats not really the correct way to describe what intel is doing . They count the 4th 5th register as part of the 3-4 register setup
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
seems to be a virtual register

I'm not sure what you have in mind but for the example I provided, an intrinsic in the source code ask to load a register then to use 4 times this register and instead the compiler keep using 4 times indexed addressing from "memory", it's an optimization apparently (only new versions of the compiler do that and timings aren't worse for AVX code on Sandy Bridge) so the value must be mapped to a buffer of some kind, I confess that I don't know exactly how it works and I'll be interested to learn
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,624
2,399
136
to emphasis the fact there is 16 registers used for the architected state and 128 rename registers but indeed they are all the same

Remember that there are 32 architectural regs. (2 threads)
With the PRF, architectural registers use physical registers from the same array as the rename logic, and between 1 and 32 regs are used for architectural state (reg-to-reg copy makes 2 register names point to same physical register, and zeroing a register with an instruction that always zeroes it will make it point to a dedicated 0 register.)
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Remember that there are 32 architectural regs. (2 threads)

indeed, 32 in 64-bit mode and 16 otherwise, and yes I know they are from the same pool, anyway I'll write simply "144" next time...
 
Last edited:

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Remember that there are 32 architectural regs. (2 threads)
With the PRF, architectural registers use physical registers from the same array as the rename logic, and between 1 and 32 regs are used for architectural state (reg-to-reg copy makes 2 register names point to same physical register, and zeroing a register with an instruction that always zeroes it will make it point to a dedicated 0 register.)

I was trying to find the article on how intel uses its operands it wasn't real clear but somehow intel uses the registers that exist 3 and the fourth operand somehow uses 1 of the 3 registers. I don't really get it as intel isn't real clear about this or I am not up to the level to understand what intel is saying.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
Those two could be the reasons why Intel chose the FMA3 instead of FMA4 that AMD chose and implemented with Bulldozer.
The only reason they chose FMA3 is because it allows the micro-instructions to be shorter, thus allowing to store more of them in the same amount of uop cache. FMA4 would have been the only instruction requiring four operands, so this it would have wasted uop space for every other instruction (note that the 4-operand blend instructions which are already supported by Sandy Bridge, are split into two uops).
 

denev2004

Member
Dec 3, 2011
105
1
0
The only reason they chose FMA3 is because it allows the micro-instructions to be shorter, thus allowing to store more of them in the same amount of uop cache. FMA4 would have been the only instruction requiring four operands, so this it would have wasted uop space for every other instruction (note that the 4-operand blend instructions which are already supported by Sandy Bridge, are split into two uops).
Really? I heard AMD might change sides to FMA3, at least support both ins.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Really? I heard AMD might change sides to FMA3, at least support both ins.

FMA4 was in the original Intel's AVX specs and was later revised to FMA3 (to be supported by AMD soon in Trinity + Piledriver, so *AMD will support FMA3 before Intel*), the industry standard x86 ISA is thus with FMA3

FMA3 is a more sensible choice for the reason given by CPUarchitect, some people wrongly assume that FMA3 will require a lot extra register to register moves vs. FMA4, it's not the case in practive since there is several variants of the fma instructions allowing to choose which operand is overwritten, here is a concrete example of typical FMA3 code with no single extra move:

vfmsub213ps ymm1, ymm8, YMMWORD PTR [6208+r13]
vfmadd132ps ymm8, ymm7, YMMWORD PTR [1152+r13]
vfmadd231ps ymm0, ymm9, YMMWORD PTR [6240+r13]
vfmadd231ps ymm3, ymm9, YMMWORD PTR [6336+r13]
vfmadd231ps ymm1, ymm9, YMMWORD PTR [6144+r13]
vfmadd132ps ymm9, ymm8, YMMWORD PTR [1088+r13]
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,624
2,399
136
FMA3 is a more sensible choice for the reason given by CPUarchitect, some people wrongly assume that FMA3 will require a lot extra register to register moves vs. FMA4,

And even if it was the case in practice, with register renaming and PRF, reg -> reg moves are very nearly free. They cost no latency, and so long as you are not decode-limited, they cost no throughput either.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
And even if it was the case in practice, with register renaming and PRF, reg -> reg moves are very nearly free. They cost no latency, and so long as you are not decode-limited, they cost no throughput either.

indeed, they can even be removed by the decoder (AMD do just that in Bulldozer IIRC), thus the only (small) impact will be on code density thus slightly increased L1I miss rate
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
FMA4 was in the original Intel's AVX specs and was later revised to FMA3 (to be supported by AMD soon in Trinity + Piledriver, so *AMD will support FMA3 before Intel*), the industry standard x86 ISA is thus with FMA3

That is smart of AMD, to pursue support for both and on a timeline that bests Intel, but with the reality of lag between hardware availability and software adopting the new ISA extensions (think tesselation in DX11 for example) there is little benefit to leverage on AMD's behalf from this "win".
 
Mar 10, 2006
11,715
2,012
126
That is smart of AMD, to pursue support for both and on a timeline that bests Intel, but with the reality of lag between hardware availability and software adopting the new ISA extensions (think tesselation in DX11 for example) there is little benefit to leverage on AMD's behalf from this "win".

It's kind of pathetic that Intel isn't going to have FMA until 2013, but AMD has had it since 2011.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,624
2,399
136
indeed, they can even be removed by the decoder (AMD do just that in Bulldozer IIRC), thus the only (small) impact will be on code density thus slightly increased L1I miss rate

Not by the decoder, by the register rename phase. There is a distinction because the CPU can only rename 4 registers per clock and the 256-bit ops are already split, so if there is, for example, 2 256-bit ops and 2 reg-reg moves, they can't all be renamed in a single clock.

Oh, and Intel does this too, on SNB.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
It's kind of pathetic that Intel isn't going to have FMA until 2013, but AMD has had it since 2011.
Having a separate 256-bit MUL and 256-bit ADD unit per core is more powerful than two 128-bit FMA units per module. The theoretical peak throughput is the same, but according to this, FMA instructions on average only constitute about 1/3 of all floating-point instructions, resulting in 50% higher performance for Intel's configuration. Using the same numbers, Bulldozer should have a 12% advantage at legacy 128-bit SSE workloads, but benchmarks show that in practice it's slower in that scenario as well. That's probably due to the long latencies of its non-cascaded FMA units.

And until AMD widens both FMA units to 256-bit, Haswell will further increase Intel's lead.
 

Rumpelstiltskin

Junior Member
Jul 8, 2012
11
0
66
Haswell will be a significant upgrade, but I fear it may be one of the last significant upgrades for desktop PC enthusiasts. It's looking more and more like Haswell could deliver the final killing blow to AMD's APUs. Substantially more EUs could very well bring the IGP on par with or above Trinity's successor. A 95W TDP and AVX2/TSX could also be the final nail in the coffin on the CPU side. I also don't see Steamroller being able to come anywhere close to Haswell-E. It looks like AMD is already preparing to exit this market and focus on the ARM/low power mobile segment. Intel is aligning to compete in this segment as well. Even Microsoft seems to be abandoning PC enthusiasts with Windows 8. Get ready for much slower advancements after 2013, from 2014 onward it will be all about tablets.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |