Haswell Integrated GPU = Larrabee3?

IonusX

Senior member
Dec 25, 2011
392
0
0

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
Looks like HSW will be using Larrabee cores as part of its iGPU.
That is highly surprising considering that Haswell is confirmed to add AVX2 support, which includes 'gather' and fused multiply-add instructions. That makes it practically equivalent to the Larrabee instruction set. So it doesn't make sense to have CPU and IGP cores with instruction sets which are very similar, yet not identical.

When GPUs started to have similar vertex and pixel processing capabilities, their cores unified. The primary thing that is lacking from AVX2 is power efficiency. But that can be solved with AVX-1024: The AVX encoding is known to be extendable to 1024-bit instructions. By executing them in four cycles on the existing 256-bit SIMD units, the power-hungry front-end of the CPU can be clock gated to achieve considerable power savings and rival the efficiency of a GPU.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
i got DH saying that haswell will only be 50% better than the current intel offering. so im going to say that in the event it does have larabee relations. that in practice it will be so weak that it wont matter.


http://www.donanimhaber.com/islemci/haberleri/intelin-Haswell-GPUsu-ivy-Bridge-GPUsundan-50-daha-hizli-olacak-.htm

Well Intel says 15x . Intel has been real good about keeping up with its graphics performance metric for unreleasd products . 15X sounds about right comparring what intel has done in last 3 years.
 

Khato

Golden Member
Jul 15, 2001
1,251
321
136
Tsk tsk, that resume is quite the blunder. I really do wonder what he was thinking putting down actual project details rather than generalities - potential employers don't exactly like an employee who reveals confidential product details well before release after all. Heh, the fact that he starting work at Intel in 2004 makes me wonder if he's doing the frequent practice of intending to never return from their 7 year sabbatical.

Oh yeah, if you feel like believing Donanimhaber, then you might at least consider reading the article correctly - it's stating a 50% gain on SNB -> IVB, and then another 50% gain on IVB -> HSW.
 

NTMBK

Lifer
Nov 14, 2011
10,409
5,673
136
If true, this could be an interesting development that could really sweep the legs out from under AMD's Fusion push. A large number of lightweight cores and half a dozen large and heavyweight cores all on the same chip, all running the same instruction set? It would take away a lot of the complications of current GPGPU toolkits, as well as letting Intel actually leverage the 20% of the chip currently given over to graphics in non-graphics related tasks.

Of course, this all depends on the idea of Intel actually getting their MIC architecture to a point where it can run graphics.
 

tweakboy

Diamond Member
Jan 3, 2010
9,517
2
81
www.hammiestudios.com
http://www.indeed.com/r/b1b4922f298d8c33



Looks like HSW will be using Larrabee cores as part of its iGPU. Discuss!


Doesn't matter your gonna disable on board graphics cuz they blow goat,, A CPU/GPU will not and never will for now, not defeat ATI and nVidia in performance.

There is a reason why a GPU costs 500 bucks and a Sandy top of line costs 300 bucks..... onboard GPU will always blow,, onboard audio isnt that bad tho... gl
 

NTMBK

Lifer
Nov 14, 2011
10,409
5,673
136
Doesn't matter your gonna disable on board graphics cuz they blow goat,, A CPU/GPU will not and never will for now, not defeat ATI and nVidia in performance.

There is a reason why a GPU costs 500 bucks and a Sandy top of line costs 300 bucks..... onboard GPU will always blow,, onboard audio isnt that bad tho... gl

But what if that integrated graphics consists of a large number of small x86 cores? Which can be used to run your applications instead of graphics if you decide to hook up a discrete card?
 

Khato

Golden Member
Jul 15, 2001
1,251
321
136
Well, before anyone gets too excited about the possibilities of a Larrabee based integrated graphics, I'll point out that such contradicts this slide that showed up around half a year ago - Workstation Processor Gfx Roadmap. That said, I do find the lack of information about Haswell Gfx somewhat surprising, maybe more details will be released around the time of Ivybridge launch.
 

BallaTheFeared

Diamond Member
Nov 15, 2010
8,115
0
71
Intel isn't trying to produce the best gaming igpu... I wouldn't expect anything amazing on that front.

However gpgpu parallel processing with support for x86 instructions sounds fun, look at Quick Sync for an idea imo of where Intel wants to take gpu's.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
I'm just going to put this out there....

People forget to update their resume...
 

lol123

Member
May 18, 2011
162
0
0
Intel has already stated repeatedly that Haswell will have upgraded GTxxx graphics just like Sandy Bridge. Further down the road they are bound to use either LRBni or AVX2 software rendering though.
 

Puppies04

Diamond Member
Apr 25, 2011
5,909
17
76
Doesn't matter your gonna disable on board graphics cuz they blow goat,, A CPU/GPU will not and never will for now, not defeat ATI and nVidia in performance.

There is a reason why a GPU costs 500 bucks and a Sandy top of line costs 300 bucks..... onboard GPU will always blow,, onboard audio isnt that bad tho... gl

Can you point out where somebody said they were going to be playing BF3 on ultra with an integrated gpu? You are missing the point so completely I don't even have words for it.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
Looks like HSW will be using Larrabee cores as part of its iGPU. Discuss!

Highly doubt it. The late-2012 Knights Corner is a graphics-less derivative of Knights Ferry, which itself was Larrabee. The more or less official-but-NDA Intel slide tells Haswell is a Gen 7 part, which is a highly modified architecture of what Sandy Bridge and Ivy Bridge's GPU will use.

It may have been Larrabee-derivative at one point, but it sure isn't now.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
I'm just going to put this out there....

People forget to update their resume...

Oh you, always trying to be the voice of reason and sanity

Nice try, but you know the rumor and thread is going to live on until Haswell is finally released, right?
 

dealcorn

Senior member
May 28, 2011
247
4
76
Acronyms sure are confusing, especially when folks are so into it that they use defined terms which you may not really understand. I read:

"iLRB (Larabee-3 slice for Haswell Client product and Discrete Larabee-3)"

as iLRB (AVX2 and MIC). This is all known, previously disclosed stuff: it does not sound newsworthy in the way the OP suggested.

The interesting part is the suggestion that the AVX2 FUB for the Haswell client may different from the AVX2 FUB for the Haswell non Client (server) product. AVX2 is a general purpose instruction set that serves different applications. The instruction set is cast in stone but it appears there may be differences in the implementation between client and server. Hmmm.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
By executing them in four cycles on the existing 256-bit SIMD units, the power-hungry front-end of the CPU can be clock gated to achieve considerable power savings

what can you clock gate out on a 4-way superscalar core with SMT ? how will you make it work ?
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
what can you clock gate out on a 4-way superscalar core with SMT ? how will you make it work ?
They can at least clock gate the in-order front-end, including L1I fetch, predecode, branch prediction, decoders, uop cache, and register renaming. This can be achieved by observing when the schedulers are full or nearly full.

It should also be possible to clock gate parts of the schedulers as well since the execution of an AVX-1024 instruction on a 256-bit unit means the next three cycles no other instruction should be scheduled. So the power savings can be very substantial, resulting in nearly the efficiency of an in-order architecture for high-throughput workloads.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
They can at least clock gate the in-order front-end, including L1I fetch, predecode, branch prediction, decoders

already achieved in SNB for inner loops with 100% uop cache hit

uop cache, and register renaming.

no way, or only if your "AVX-1024" instructions are executed one at once and will block all other instructions from the same thread and, worse, from the other thread(s), it's also assuming a single FMA unit or 2 units working in lockstep, all in all it looks like a bad idea, it's very typical to execute loads and a stores in parallel with SIMD computations for example, also think to the GPR/scalar instructions that you don't want to serialize like that, after all the whole point of a pipeline is to make progress at each clock

since the execution of an AVX-1024 instruction on a 256-bit unit means the next three cycles no other instruction should be scheduled.
exactly, that's why I mentioned 4-way superscalar and SMT, both go deeply against your idea (IMO, I'll interested to hear what an actual cpu architect has to say about it)

So the power savings can be very substantial, resulting in nearly the efficiency of an in-order architecture for high-throughput workloads.

Sure, by executing far less work per clock there should be some savings but performance will be horrible
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
already achieved in SNB for inner loops with 100% uop cache hit
Indeed, and AVX-1024 would just extend it. In particular if you want to achieve latency hiding with AVX-256 it's easy to blow out the uop cache due to unrolling and spilling, while with AVX-1024 that's less likely to happen. And even in cases where it does happen, the average rate at which instructions need to be fetched is lower so you can still save power in these stages.
no way, or only if your "AVX-1024" instructions are executed one at once and will block all other instructions from the same thread
Why would clock gating the register renamer result in only being able to execute one instruction? The scheduler queues still contain plenty of instructions with already renamed registers.
and, worse, from the other thread(s), it's also assuming a single FMA unit or 2 units working in lockstep, all in all it looks like a bad idea, it's very typical to issue a load and a store on the same clock than SIMD computations for example, also think to the GPR/scalar instructions that you don't want to serialize like that
I wasn't suggesting clock gating the entire scheduler. Only the portions that are blocked for the next three cycles anyway because the port is occupied with the AVX-1024 instruction.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
In particular if you want to achieve latency hiding with AVX-256 it's easy to blow out the uop cache due to unrolling and spilling, while with AVX-1024 that's less likely to happen.

I don't see how the width of the vectors is related to the number of uops in your loops, with 2x wider vectors you simply execute half as many iteration of your loops but the best amount of unrolling is more or less the same and so the number of uops per loop is also much the same. Also note that a small amount of unrolling provide speedup only for small loops (<50 uops) and is plain useless or even detrimental for bigger loops, so that in practice the uop cache hit ratios are very high. For high throughput code you enjoy very high temporal locality for your code, and 99.9% of the bandwidth/latency issues come from the access to your data not for fetching code.

Why would clock gating the register renamer result in only being able to execute one instruction?
at least it introduces a big bubble (3x4 uops) whatever the number of instructions already in flight down the pipe, isn't it?
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
I don't see how the width of the vectors is related to the number of uops in your loops, with 2x wider vectors you simply execute half as many iteration of your loops but the best amount of unrolling is more or less the same and so the number of uops per loop is also much the same.
No, to achieve good latency hiding you need to repeat every instruction several time (each processing different data - here's an example). So this leads to more instructions, and like I said before there will be even more to deal with the increased register pressure.

With AVX-1024 executed on 256-bit units the unrolling/pipelining process would be implicit. It's as if you have four AVX-256 instructions, rolled into a single instruction. And it wouldn't suffer from additional register pressure.
at least it introduces a big bubble (3x4 uops) whatever the number of instructions already in flight down the pipe, isn't it?
No, the AVX-1024 instructions keep the 256-bit execution units busy for four cycles, so there is no bubble.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |