Haswell Integrated GPU = Larrabee3?

Arachnotronic · Jan 22, 2012

http://www.indeed.com/r/b1b4922f298d8c33

Technical Lead for Graphics and Media cluster's micro-architecture validation for iLRB (Larabee-3 slice for Haswell Client product and Discrete Larabee-3)

Looks like HSW will be using Larrabee cores as part of its iGPU. Discuss!

IonusX · Jan 22, 2012

Intel17 said:
http://www.indeed.com/r/b1b4922f298d8c33

Looks like HSW will be using Larrabee cores as part of its iGPU. Discuss!

i got DH saying that haswell will only be 50% better than the current intel offering. so im going to say that in the event it does have larabee relations. that in practice it will be so weak that it wont matter.

http://www.donanimhaber.com/islemci/haberleri/intelin-Haswell-GPUsu-ivy-Bridge-GPUsundan-50-daha-hizli-olacak-.htm

CPUarchitect · Jan 22, 2012

Intel17 said:
Looks like HSW will be using Larrabee cores as part of its iGPU.

That is highly surprising considering that Haswell is confirmed to add AVX2 support, which includes 'gather' and fused multiply-add instructions. That makes it practically equivalent to the Larrabee instruction set. So it doesn't make sense to have CPU and IGP cores with instruction sets which are very similar, yet not identical.

When GPUs started to have similar vertex and pixel processing capabilities, their cores unified. The primary thing that is lacking from AVX2 is power efficiency. But that can be solved with AVX-1024: The AVX encoding is known to be extendable to 1024-bit instructions. By executing them in four cycles on the existing 256-bit SIMD units, the power-hungry front-end of the CPU can be clock gated to achieve considerable power savings and rival the efficiency of a GPU.

Nemesis 1 · Jan 22, 2012

IonusX said:
i got DH saying that haswell will only be 50% better than the current intel offering. so im going to say that in the event it does have larabee relations. that in practice it will be so weak that it wont matter.

http://www.donanimhaber.com/islemci/haberleri/intelin-Haswell-GPUsu-ivy-Bridge-GPUsundan-50-daha-hizli-olacak-.htm

Well Intel says 15x . Intel has been real good about keeping up with its graphics performance metric for unreleasd products . 15X sounds about right comparring what intel has done in last 3 years.

Khato · Jan 22, 2012

Tsk tsk, that resume is quite the blunder. I really do wonder what he was thinking putting down actual project details rather than generalities - potential employers don't exactly like an employee who reveals confidential product details well before release after all. Heh, the fact that he starting work at Intel in 2004 makes me wonder if he's doing the frequent practice of intending to never return from their 7 year sabbatical.

Oh yeah, if you feel like believing Donanimhaber, then you might at least consider reading the article correctly - it's stating a 50% gain on SNB -> IVB, and then another 50% gain on IVB -> HSW.

NTMBK · Jan 22, 2012

If true, this could be an interesting development that could really sweep the legs out from under AMD's Fusion push. A large number of lightweight cores and half a dozen large and heavyweight cores all on the same chip, all running the same instruction set? It would take away a lot of the complications of current GPGPU toolkits, as well as letting Intel actually leverage the 20% of the chip currently given over to graphics in non-graphics related tasks.

Of course, this all depends on the idea of Intel actually getting their MIC architecture to a point where it can run graphics.

tweakboy · Jan 22, 2012

Intel17 said:
http://www.indeed.com/r/b1b4922f298d8c33

Looks like HSW will be using Larrabee cores as part of its iGPU. Discuss!

Doesn't matter your gonna disable on board graphics cuz they blow goat,, A CPU/GPU will not and never will for now, not defeat ATI and nVidia in performance.

There is a reason why a GPU costs 500 bucks and a Sandy top of line costs 300 bucks..... onboard GPU will always blow,, onboard audio isnt that bad tho... gl

NTMBK · Jan 22, 2012

tweakboy said:
Doesn't matter your gonna disable on board graphics cuz they blow goat,, A CPU/GPU will not and never will for now, not defeat ATI and nVidia in performance.

There is a reason why a GPU costs 500 bucks and a Sandy top of line costs 300 bucks..... onboard GPU will always blow,, onboard audio isnt that bad tho... gl

But what if that integrated graphics consists of a large number of small x86 cores? Which can be used to run your applications instead of graphics if you decide to hook up a discrete card?

Arachnotronic · Jan 22, 2012

NTMBK said:
But what if that integrated graphics consists of a large number of small x86 cores? Which can be used to run your applications instead of graphics if you decide to hook up a discrete card?

So...Fusion!

Khato · Jan 22, 2012

Well, before anyone gets too excited about the possibilities of a Larrabee based integrated graphics, I'll point out that such contradicts this slide that showed up around half a year ago - Workstation Processor Gfx Roadmap. That said, I do find the lack of information about Haswell Gfx somewhat surprising, maybe more details will be released around the time of Ivybridge launch.

BallaTheFeared · Jan 22, 2012

Intel isn't trying to produce the best gaming igpu... I wouldn't expect anything amazing on that front.

However gpgpu parallel processing with support for x86 instructions sounds fun, look at Quick Sync for an idea imo of where Intel wants to take gpu's.

TuxDave · Jan 22, 2012

I'm just going to put this out there....

People forget to update their resume...

lol123 · Jan 22, 2012

Intel has already stated repeatedly that Haswell will have upgraded GTxxx graphics just like Sandy Bridge. Further down the road they are bound to use either LRBni or AVX2 software rendering though.

Puppies04 · Jan 22, 2012

tweakboy said:
Doesn't matter your gonna disable on board graphics cuz they blow goat,, A CPU/GPU will not and never will for now, not defeat ATI and nVidia in performance.

There is a reason why a GPU costs 500 bucks and a Sandy top of line costs 300 bucks..... onboard GPU will always blow,, onboard audio isnt that bad tho... gl

Can you point out where somebody said they were going to be playing BF3 on ultra with an integrated gpu? You are missing the point so completely I don't even have words for it.

Bman123 · Jan 22, 2012

IntelUser2000 · Jan 22, 2012

Intel17 said:
Looks like HSW will be using Larrabee cores as part of its iGPU. Discuss!

Highly doubt it. The late-2012 Knights Corner is a graphics-less derivative of Knights Ferry, which itself was Larrabee. The more or less official-but-NDA Intel slide tells Haswell is a Gen 7 part, which is a highly modified architecture of what Sandy Bridge and Ivy Bridge's GPU will use.

It may have been Larrabee-derivative at one point, but it sure isn't now.

Idontcare · Jan 22, 2012

TuxDave said:
I'm just going to put this out there....

People forget to update their resume...

Oh you, always trying to be the voice of reason and sanity

Nice try, but you know the rumor and thread is going to live on until Haswell is finally released, right?

dealcorn · Jan 22, 2012

Acronyms sure are confusing, especially when folks are so into it that they use defined terms which you may not really understand. I read:

"iLRB (Larabee-3 slice for Haswell Client product and Discrete Larabee-3)"

as iLRB (AVX2 and MIC). This is all known, previously disclosed stuff: it does not sound newsworthy in the way the OP suggested.

The interesting part is the suggestion that the AVX2 FUB for the Haswell client may different from the AVX2 FUB for the Haswell non Client (server) product. AVX2 is a general purpose instruction set that serves different applications. The instruction set is cast in stone but it appears there may be differences in the implementation between client and server. Hmmm.

bronxzv · Jan 23, 2012

Intel17 said:
http://www.indeed.com/r/b1b4922f298d8c33
Looks like HSW will be using Larrabee cores as part of its iGPU. Discuss!

He also mentions "Discrete Larabee-3" (sic) in his list, it doesn't make it a real product, though it shows well how far the Larrabee project was before to be canceled (recycled for MIC)

bronxzv · Jan 23, 2012

CPUarchitect said:
By executing them in four cycles on the existing 256-bit SIMD units, the power-hungry front-end of the CPU can be clock gated to achieve considerable power savings

what can you clock gate out on a 4-way superscalar core with SMT ? how will you make it work ?

CPUarchitect · Jan 23, 2012

bronxzv said:
what can you clock gate out on a 4-way superscalar core with SMT ? how will you make it work ?

They can at least clock gate the in-order front-end, including L1I fetch, predecode, branch prediction, decoders, uop cache, and register renaming. This can be achieved by observing when the schedulers are full or nearly full.

It should also be possible to clock gate parts of the schedulers as well since the execution of an AVX-1024 instruction on a 256-bit unit means the next three cycles no other instruction should be scheduled. So the power savings can be very substantial, resulting in nearly the efficiency of an in-order architecture for high-throughput workloads.

bronxzv · Jan 23, 2012

CPUarchitect said:
They can at least clock gate the in-order front-end, including L1I fetch, predecode, branch prediction, decoders

already achieved in SNB for inner loops with 100% uop cache hit

CPUarchitect said:
uop cache, and register renaming.

no way, or only if your "AVX-1024" instructions are executed one at once and will block all other instructions from the same thread and, worse, from the other thread(s), it's also assuming a single FMA unit or 2 units working in lockstep, all in all it looks like a bad idea, it's very typical to execute loads and a stores in parallel with SIMD computations for example, also think to the GPR/scalar instructions that you don't want to serialize like that, after all the whole point of a pipeline is to make progress at each clock

CPUarchitect said:
since the execution of an AVX-1024 instruction on a 256-bit unit means the next three cycles no other instruction should be scheduled.

exactly, that's why I mentioned 4-way superscalar and SMT, both go deeply against your idea (IMO, I'll interested to hear what an actual cpu architect has to say about it)

CPUarchitect said:
So the power savings can be very substantial, resulting in nearly the efficiency of an in-order architecture for high-throughput workloads.

Sure, by executing far less work per clock there should be some savings but performance will be horrible

CPUarchitect · Jan 23, 2012

bronxzv said:
already achieved in SNB for inner loops with 100% uop cache hit

Indeed, and AVX-1024 would just extend it. In particular if you want to achieve latency hiding with AVX-256 it's easy to blow out the uop cache due to unrolling and spilling, while with AVX-1024 that's less likely to happen. And even in cases where it does happen, the average rate at which instructions need to be fetched is lower so you can still save power in these stages.

no way, or only if your "AVX-1024" instructions are executed one at once and will block all other instructions from the same thread

Why would clock gating the register renamer result in only being able to execute one instruction? The scheduler queues still contain plenty of instructions with already renamed registers.

and, worse, from the other thread(s), it's also assuming a single FMA unit or 2 units working in lockstep, all in all it looks like a bad idea, it's very typical to issue a load and a store on the same clock than SIMD computations for example, also think to the GPR/scalar instructions that you don't want to serialize like that

I wasn't suggesting clock gating the entire scheduler. Only the portions that are blocked for the next three cycles anyway because the port is occupied with the AVX-1024 instruction.

bronxzv · Jan 23, 2012

CPUarchitect said:
In particular if you want to achieve latency hiding with AVX-256 it's easy to blow out the uop cache due to unrolling and spilling, while with AVX-1024 that's less likely to happen.

I don't see how the width of the vectors is related to the number of uops in your loops, with 2x wider vectors you simply execute half as many iteration of your loops but the best amount of unrolling is more or less the same and so the number of uops per loop is also much the same. Also note that a small amount of unrolling provide speedup only for small loops (<50 uops) and is plain useless or even detrimental for bigger loops, so that in practice the uop cache hit ratios are very high. For high throughput code you enjoy very high temporal locality for your code, and 99.9% of the bandwidth/latency issues come from the access to your data not for fetching code.

CPUarchitect said:
Why would clock gating the register renamer result in only being able to execute one instruction?

at least it introduces a big bubble (3x4 uops) whatever the number of instructions already in flight down the pipe, isn't it?

CPUarchitect · Jan 23, 2012

bronxzv said:
I don't see how the width of the vectors is related to the number of uops in your loops, with 2x wider vectors you simply execute half as many iteration of your loops but the best amount of unrolling is more or less the same and so the number of uops per loop is also much the same.

No, to achieve good latency hiding you need to repeat every instruction several time (each processing different data - here's an example). So this leads to more instructions, and like I said before there will be even more to deal with the increased register pressure.

With AVX-1024 executed on 256-bit units the unrolling/pipelining process would be implicit. It's as if you have four AVX-256 instructions, rolled into a single instruction. And it wouldn't suffer from additional register pressure.

at least it introduces a big bubble (3x4 uops) whatever the number of instructions already in flight down the pipe, isn't it?

No, the AVX-1024 instructions keep the 256-bit execution units busy for four cycles, so there is no bubble.

Haswell Integrated GPU = Larrabee3?

Lifer

Senior member

Senior member

Lifer

Golden Member

Lifer

Diamond Member

Lifer

Lifer

Golden Member

Diamond Member

Lifer

Member

Diamond Member

Diamond Member

Elite Member

Elite Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member