Intel "Haswell" Speculation thread

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tuna-Fish

Golden Member
Mar 4, 2011
1,624
2,399
136
I've often wondered just how much opportunity is even out there for ISA-based performance improvements to IPC. After tweaking on x86 for what now, 40yrs?

Is there much left to improve upon (in the ISA itself) or is it really the microarchitecture that delivers the performance/watt and IPC improvements here out?

Well, gather is the obvious example, as is transactional memory. In terms of normal register-to-register instructions, ignoring SIMD lane crossing, we pretty much have everything that's even remotely general-purpose. But there is a lot of more expensive to implement stuff that a lot of programmers would love.

For another example, I know what I just implemented would be speed up a lot by a little bit of content-addressable memory.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Well, that’s where AMD going with Fusion. But right now it seems AMD has the lead in Compute FP power and I don’t see Haswell to be able to change that with AVX only.

Apples and Oranges. Thats like saying Nvidia GTX580 has the lead in Compute FP power over Intel SB. No kidding, different products.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Apples and Oranges. Thats like saying Nvidia GTX580 has the lead in Compute FP power over Intel SB. No kidding, different products.

Haswell will be an APU chip much like Llano, Trinity and the future Fusion chip from AMD in 2013.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Haswell will be an APU chip much like Llano, Trinity and the future Fusion chip from AMD in 2013.

Yes, some Haswell models will be an APU chip like Llano. But you are comparing Haswell's CPU FP performance to Llano's GPU FP performance. No one has any idea how well Haswell's GPU will perform in Compute FP. Sticking to the CPU side, I do not think any chip will match Haswell. However, on the GPU side (I can guess), that Trinity may be superior. But I still think your comparison is unfair and somewhat skewed, based on what we know today.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Yes, some Haswell models will be an APU chip like Llano. But you are comparing Haswell's CPU FP performance to Llano's GPU FP performance. No one has any idea how well Haswell's GPU will perform in Compute FP. Sticking to the CPU side, I do not think any chip will match Haswell. However, on the GPU side (I can guess), that Trinity may be superior. But I still think your comparison is unfair and somewhat skewed, based on what we know today.

Since AMD and Intel are trying to merge/fuse CPUs with GPUs (APUs), when talking about APUs we should not divide the CPU from the GPU compute capabilities. Haswell APUs will go head to head against AMDs future Fusion chips that will incorporate AMDs next gen CPUs and GPU GCN architecture.

Perhaps Haswell's CPU FP units will have higher compute power than AMDs, but i believe that it will not be enough to compete against future Bulldozer based + GCN APUs.
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
But Nehalem doesn't
Indeed, but what's your point? Physical register files are more about reducing power consumption than anything else. How does it relate to the discussion of increasing throughput?
What I means is just even we can't improve it, we should think of ways to not to reduce it a lot...
Why would Intel reduce it in any way? They've got a good architecture to build on. I'm not expecting them to radically change anything that would jeopardize IPC. They are more likely to make incremental changes, like extending macro-op fusion to create 3-operand instructions.
Isn't Intel's AVX unit made by 2 128 unit, means legacy code can be also supported?
Yes, there are two 128-bit "lanes" which perform the same operation, so SSE instructions use only the lower half.

But as TuxDave already indicated that's not what I meant. Not having two FMA units will result in no performance gain for older code which only contains MUL and ADD instructions (and for numerical consistency reasons even new code may opt not to use FMA since it rounds the result differently). So there's good incentive for Intel to not promote one unit to FMA, but both.

Also note that even with one FMA unit and 256-bit integer operations they are forced to double the cache bandwidth(s). Without it there would hardly be any gains. But it would be a shame to still only partially take advantage of that extra bandwidth with a crippled floating-point implementation. I really don't see why on a 22 nm process they would hold back and make damaging compromises like that.

Two FMA units simply results in the best cost/gain ratio.
BTW, there is another opinion

"Anyway my opinion is that the optimal design choice is probably FMA+ADD. This is not just for power/cost reasons but also performance: a FMA unit will have very similar latency for MUL but significantly higher latency for ADD than a standalone unit. AMD had to compromise ADD latency on Bulldozer which doesn't seem ideal to me at this point in time."
Yes, FMA+ADD is probably a better alternative than FMA+MUL. I've done a bit of research and found some interesting data in Figure 3 (a) and Table 1 of this paper: Latency Sensitive FMA Design. The average instruction mix contains 10.7% FMA, 12.5% ADD and 7.7% MUL. So if 100 instructions consisted of this distribution it would take FMA+FMA units 15.45 cycles, FMA+ADD units 18.4 cycles, and FMA+MUL units 23.2 cycles to execute the floating-point instructions. Note that this assumes there are no dependencies between them, which is obviously wrong. But note that in practice with dependencies FMA+FMA units would perform better than the alternatives because there are more combinations of instructions that can execute simultaneously. Furthermore this paper shows it's highly application-specific. For instance 173.applu, which is a physics application, would theoretically execute the floating-point instructions 50% faster on FMA+MUL versus FMA+ADD, and about the same on FMA+FMA (again, only in theory). FMA+FMA also runs ADD-heavy cases optimally while FMA+MUL would not. So which ever asymmetric design you choose a significant number of applications won't benefit.

But here's the most perplexing result of all: On a MUL+ADD configuration the average instruction mix would ideally execute in 17.85 cycles. That's faster than the FMA+ADD configuration! In other words, more hardware is leading to lower performance. To understand this you need to realize that FMA instructions can't be split between the two units; you're forcing them to use a single unit, the same unit responsible for executing MUL, leaving the ADD unit idle for part of the time. So the only guarantee for higher performance is to have two FMA units.

Another vital result of this paper is their comparison of a "cascaded" FMA implementation, where executing an ADD takes a shortcut so it doesn't suffer from the high latency. The area and power consumption are practically the same, while total performance improves by 4-6% over an non-cascaded implementation. So higher latency is no argument against replacing an ADD unit with an FMA unit!

I rest my case. Haswell has to have dual 256-bit FMA units.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Perhaps Haswell's CPU FP units will have higher compute power than AMDs, but i believe that it will not be enough to compete against future Bulldozer based + GCN APUs.

Again, you are not taking into account the compute power of the Haswell IGP, which is an unknown at this point. That coupled with the new Haswell instructions, should put up some decent numbers.

We will get some idea of how well (or poor) Intel's IGP can handle Compute FP with IB in a few months. IB will be the first Intel to support OpenCL.
 
Last edited:

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Again, you are not taking into account the compute power of the Haswell IGP, which is an unknown at this point. That coupled with the new Haswell instructions, should put up some decent numbers.

We will get some idea of how well (or poor) Intel's IGP can handle Compute FP with IB in a few months. IB will be the first Intel to support OpenCL.

Im taking both CPU and iGPU compute power in consideration, im just saying that, even if Haswells CPU FP unit is much faster than AMDs, it will not be enough to counter the much faster iGPU unit in future Fusion GCN APU.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
Crap, you mean we aren't going to get something for nothing after all?
We'd get a mixed bag if they go for an FMA+ADD or FMA+MUL configuration, which is sometimes slower than what we have today (see my previous post).

However we do get a guaranteed floating-point performance improvement if they go with dual FMA units. And it doesn't require a recompile of old software to benefit from it either. Of course we don't completely get it for "nothing" either, but it's a relatively low hardware cost and unlike the alternatives all software wins. It offers a pretty compelling ROI.
I've often wondered just how much opportunity is even out there for ISA-based performance improvements to IPC. After tweaking on x86 for what now, 40yrs?

Is there much left to improve upon (in the ISA itself) or is it really the microarchitecture that delivers the performance/watt and IPC improvements here out?
Both. On the ISA side there's AVX2 and BMI1/2 that will have a significant impact on some software, but will require a recompile. That's stuff that's currently on the roadmap. Going forward we should expect to see AVX-1024 reduce the power consumption of the front-end while also offering latency hiding.

As far as the micro-architecture goes we can still get an IPC benefit for legacy code from extending macro-op fusion to obtain 3-operand instructions. x86 has long been criticized for not having non-destructive instructions so that would fix it, without even breaking compatibility.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
ummmm....

I dont think so...

If you want to talk about ram speed... DDR3-2100 was possible on LGA1366.

If you ask me... the idiot at intel wanted a different look on intel a platform vs AMD because AMD started copying names off intel.



Are you the only one that knows what i mean? Did i not make sense?

Incase u guys didnt know... sandy-e has ram slots sandwitching the cpu.

Now tell me was there really truely a reason to stack the ram modules next to the cpu socket?
Is someone really going to tell me it made a revolutionized difference in how the cpu accepts ram?



no im talking about the physical ram layout of the board...

Do you know how tough that makes use having to cool stuff down? Also the limitations of the mosfets which can be fitted on the board?
True i know we never needed that many mosfets but... it makes the mosffet placement more staggard, and makes cooling it more difficult..

Having designed traces for latency and niose sensitive circuitry, there are definitely reasons for compacting them and seperating signal wires like that. IDC is likely right in that this will allow for lower latency and higher clock speeds. Having too many traces in parallel causes issues with signal noise which reduces the frequency you can run that signal at.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Im taking both CPU and iGPU compute power in consideration, im just saying that, even if Haswells CPU FP unit is much faster than AMDs, it will not be enough to counter the much faster iGPU unit in future Fusion GCN APU.

That may very well be the case. But without knowing anything about Haswell's IGP, it is a bold claim to make.

On a side note, I would love to see Intel mix a Haswell CPU with a Knights Corner co-processor. That would be a nice little cruncher.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
A penny for your thoughts.

I'll have to speak on very broad terms so I can't go into what ISA improvements we HAVE found...

But let me take another angle on the question. If we've been poking at x86 forever and it had ISA improvements, why didn't we put it a long time ago? Believe it or not but out of all the features I pushed, only 1 was rejected because we couldn't come up with a logic solution for it. Most of it comes from studying new application traces and since we're using the CPU in more places, we regularly get new ideas based on how software is evolving (or where we WANT them to evolve). So there are new applications that are becoming common and when we spot these and see what's going on, we come up with new ISA. At least that's how my world on the chip evolves.

Just wanted to make one more personal observation about how competition helps innovation. In a way it's sort of true but if I use the analogy of homework. People are good at homework because they know there's a solution and work really hard. If we suck at something compared to the competition, we know for sure there's a better way to do it and we find it. Meanwhile in the opposite case where we're faster than everyone else, finding a way to make it faster is almost like getting homework which may or may not have an answer. So it's not that we get lazy, it's just that when we suck the solution is a little more obvious. Maybe it's a mental thing.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Im taking both CPU and iGPU compute power in consideration, im just saying that, even if Haswells CPU FP unit is much faster than AMDs, it will not be enough to counter the much faster iGPU unit in future Fusion GCN APU.

You need to keep in mind the type of code that is running. Throw some branchy control code at the GPU and what happens? Performance plummets.

If Haswell has significantly upgraded x86 FP capabilities, it's possible it could out perform a GPU in real world tasks.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
Llano uses 1866MHz, Trinity will use 2133MHz. Memory prices drops every year and memory speed rises too. 2013-2014 will see the introduction of DDR-4 and perhaps Haswell will utilize a DDR-4 memory controller too.
That's only a measly 14% bandwidth increase for Trinity. To put things in perspective; the transistor count still doubles every two years.

Caches can offer a partial solution to the increasing bandwidth demands, but ironically Llano has no L3 cache and neither will Trinity. If they did, they'd be sacrificing computational density or they would cost more.
Well, Llano has 480GFOPs in 2011, Trinity will have 50% more compute power in 2012 and next gen AMD Fusion chip in 22nm SOI HKMG in 2013 will probably have double Trinities Compute power (GCN). That is only the compute from the iGPU ALU units, add to that the CPU FP units as well and AMDs 22nm APUs with GCN will be a beast in FP compute power.
Too bad they won't be able to use half of that power because there's no bandwidth for it.

Seriously, it's just not going to happen. Even AMD itself only claims 30% higher performance for Trinity. Neither should you expect huge leaps for its successor, unless the price goes up accordingly.
 

denev2004

Member
Dec 3, 2011
105
1
0
Indeed, but what's your point? Physical register files are more about reducing power consumption than anything else. How does it relate to the discussion of increasing throughput?

I just mean that widen the execution units means having more register files ,which take up a lot of place and consume a lot. Sandy Bridge kind of don't have this problem compared to Nehalem, because of PRF. But Haswell can't get a chance to do so, means they might have more power consumption

Why would Intel reduce it in any way? They've got a good architecture to build on. I'm not expecting them to radically change anything that would jeopardize IPC. They are more likely to make incremental changes, like extending macro-op fusion to create 3-operand instructions.
It sounds like if they widen the vector unit and we do not take 22nm into account, they'll reduce it by increasing the power.


Yes, there are two 128-bit "lanes" which perform the same operation, so SSE instructions use only the lower half.

But as TuxDave already indicated that's not what I meant. Not having two FMA units will result in no performance gain for older code which only contains MUL and ADD instructions (and for numerical consistency reasons even new code may opt not to use FMA since it rounds the result differently). So there's good incentive for Intel to not promote one unit to FMA, but both.

Also note that even with one FMA unit and 256-bit integer operations they are forced to double the cache bandwidth(s). Without it there would hardly be any gains. But it would be a shame to still only partially take advantage of that extra bandwidth with a crippled floating-point implementation. I really don't see why on a 22 nm process they would hold back and make damaging compromises like that.
Ops, I don't know that.
 

denev2004

Member
Dec 3, 2011
105
1
0
And...I'd like to ask is there any information about the percentage of the use of MUL, ADD and MAD in most software?
 

gplnpsb

Member
Sep 4, 2011
25
0
0
I wonder if the return to 95W TDP from 77W is related to the increase from 256K L2 cache in IVB to the reported 1MB L2 per core for Haswell.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
But the solution is AVX2. It doubles the integer and floating-point vector performance, brings us gather support for parallel memory accesses, and all of this is unified into the CPU cores. So the heterogeneous communication overhead is eliminated, out-of-order execution allows to achieve higher instruction-level parallelism so less data-level parallelism is required and hence it suffers less from Amdahl's Law, and it also handles register pressure more elegantly and there's better cache hit rates so it requires less RAM bandwidth.

AVX is a SIMD instruction and by that it needs memory LP, all SSE instructions are SIMD and they need memory LP to achieve more LP. ILP has come to a point that we cannot increase it substantially anymore. So we are turning towards DLP and TLP to increase performance.
AVX2 will need more memory bandwidth (cache or system) and because of its larger register files it will increase the CPU power consumption and the die size.
Those two could be the reasons why Intel chose the FMA3 instead of FMA4 that AMD chose and implemented with Bulldozer. Again, FMA is a SIMD instruction.

SSE: Streaming SIMD Extension
AVX: Advanced Vector Extension
SIMD : Single Instruction Multiple Data
MIMD : Multiple Instructions Multiple Data
ILP : Instruction Level Parallelism
TLP : Thread Level Parallelism
DLP: Data Level Parallelism
FMA: Fused Multiply-Add
 
Last edited:

tech97

Junior Member
Dec 27, 2011
5
0
0
will AMD have something to compete with the Hasell Architectural
or it will leave the high end market as i heard ??
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
AVX is a SIMD instruction
not only, there is also a scalar variant of most instructions, for example when recompiling code for AVX2 targets scalar FMA instructions are used (in non-vectorizable routines)

AVX2 will need more memory bandwidth (cache or system)
yes but only if there is indeed 2 FMA units per core, otherwise FMA allows for more efficient code with less register fills/spills thus less pressure on the L1D cache. Note that current AVX already need more bandwidth to reach its full potential, the SSE/AVX-128 to AVX-256 speedups are quite deceptive on Sandy Bridge due to the lackluster L1D and L2/LLC bandwidth, I hope that Ivy Bridge will improve the situation, the remark for the 1.25x speedup in Excell 2010 in the leaked IVB benches http://www.tomshardware.com/gallery/intel_ivy_bridge_performance_1,0101-317230-0-2-3-0-jpg-.html makes me think there is some strong L1D and/or L2 bandwidth enhancements

and because of its larger register files
why do you think Haswell will have a larger register file ?
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
Haswell is all about AVX2. Its key feature is 'gather' support, which in some cases enables an eightfold increase in performance.

it looks like you are expecting way too much performance increase from packed gather, I'll not expect more than a 2x-4x speedup vs. software synthetised gather at the gather instruction level (assuming 0% cache miss), it should amount to something like 1.5x - 2x speedup at a inner loop level and maybe 1.1x speedup for a whole "gather intensive" application

also it remains to be seen how AVX2 gather behaves with SMT, with software synthetised gather 2 threads can make progress in parallel, we are not sure of the behavior of hardware gather, if it's a serializing instruction it will hurt MT code
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
And...I'd like to ask is there any information about the percentage of the use of MUL, ADD and MAD in most software?

I can't speak for other fp intensive applications but for 3D rendering, an area that I know quite well, FMADD/FMSUB usage can be as high as 80% with roughly 10% independent ADD/SUB and 10% MUL
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |