Haswell Core Count

Edrick · May 9, 2012

MisterMac said:
I'll wager 90% of consumer programs would benefit more from adding simple threading than having AVX2 plus being the easiest solution for the software suppliers both now and in future aspect.

I disagree here. New instructions is the best way to increace application performance in most cases. And it is easier to do from a programming level. Adding more threads (say up to 16), is much more difficult and more complex. And it may not even help in as many situations as new instructions.

MisterMac · May 9, 2012

Edrick said:
I disagree here. New instructions is the best way to increace application performance in most cases. And it is easier to do from a programming level. Adding more threads (say up to 16), is much more difficult and more complex. And it may not even help in as many situations as new instructions.

Disagree how? From what factual standpoint?

Your own personal experience?

My personal experience tells me the complete opposite - even with half arsed libraries implementing bad lock-scenarios.

New instructions requires compiler level optimizations.
Threading...not so much even with less than stellar implementations.

ninaholic37 · May 9, 2012

MisterMac said:
My personal experience tells me the complete opposite - even with half arsed libraries implementing bad lock-scenarios.

New instructions requires compiler level optimizations.
Threading...not so much even with less than stellar implementations.

What kind of programs are you making? Have you ever tried to compile a first person shooter so that it can fully utilize 16 threads at all times? What steps would you take to accomplish this?

Edrick · May 9, 2012

MisterMac said:
Disagree how? From what factual standpoint?

Your own personal experience?

My personal experience tells me the complete opposite - even with half arsed libraries implementing bad lock-scenarios.

New instructions requires compiler level optimizations.
Threading...not so much even with less than stellar implementations.

Yes, from my own experience. I develop investment banking software which is very difficult to multithread more than it is now, but adding new instructions can vastly improve speed.

And the Haswell AVX2 instructions are already included in many compilers.

BenchPress · May 9, 2012

Lonbjerg said:
GK104 has been neutered in GPGPU capabilities...it's common knowlegede....wait for Big K before pointing the finger at GPGPU...using the wrong...GPU :whiste:

That won't prevent people from buying the GTX 680. It's still an excellent Graphics Processing Unit. But that means NVIDIA has effectively killed GPGPU for the mainstream market.

The simple truth is that too many sacrifices for graphics performance are needed to make the GPU useful for anything else. Graphics is a streaming workload: you send the GPU data and drawing commands, but you never read the result back. The image gets displayed on the screen and it doesn't matter if this takes 20 ms.

General purpose throughput computing is very different. Take physics for example. You want the CPU to have access to the results, and you want the first results back very fast. So you want the work to be done locally, not on a GPU. And with AVX2 we're finally getting the same high throughput features as the GPU, within the CPU!

Lonbjerg · May 9, 2012

BenchPress said:
That won't prevent people from buying the GTX 680. It's still an excellent Graphics Processing Unit. But that means NVIDIA has effectively killed GPGPU for the mainstream market.

The simple truth is that too many sacrifices for graphics performance are needed to make the GPU useful for anything else. Graphics is a streaming workload: you send the GPU data and drawing commands, but you never read the result back. The image gets displayed on the screen and it doesn't matter if this takes 20 ms.

General purpose throughput computing is very different. Take physics for example. You want the CPU to have access to the results, and you want the first results back very fast. So you want the work to be done locally, not on a GPU. And with AVX2 we're finally getting the same high throughput features as the GPU, within the CPU!

The problem with physics is that it is MASSIVE parallel computations...and 4-8 cores will suffer compared to +1000 cores.

AtenRa · May 9, 2012

Edrick said:
Yea, and Intel could also keep selling their 6 core at $500 and make an even bigger profit. Without AMD stepping up, Intel can do what they want.

Lets see,

Intel will have to compete against their own CPUs first, no one will upgrade from 4 core IB to 4 core Haswell unless it has a huge IPC gain. Remember, both IB and Haswell will be on the same 22nm, dont expect to see a huge power reduction like SB to IB.

Secondly, in 2013 (same year Intel will introduce Haswell) AMD will introduce the SteamRoller. From what we know now, it seams that SteamRoller will have more than 8 threads. It will take an enormous IPC gain for a 4 core 8 threads Haswell to be competitive against a 10-12 threads SteamRoller in multithreading apps.

I believe we may see a cheaper 6 core Intel CPU with Haswell-E.

pelov · May 9, 2012

Lonbjerg said:
The problem with physics is that it is MASSIVE parallel computations...and 4-8 cores will suffer compared to +1000 cores.

and the limited bus width and the very small cache store. Physics processing on a CPU is still years away and on-die GPUs help but unless they're coupled with other local hardware that GPUs currently have we won't see CPUs for physics processing. GPUs have an edge with things like massive amounts of very fast RAM, its own small traces of cache, its own instruction units, a very wide bus and thousands of cores. An instruction set is not going to kill GPGPU anytime soon... unless you're delusional. Then AVX2 can do everything from physics computing efficiently despite obvious hardware limitations and cure cancer to boot!

In all seriousness, GPUs strictly for GPGPUs might be replaced by on-die GPUs as they get quicker and more efficient but for things like physics it always makes more sense to have a dedicated piece of hardware that's built for that specific purpose. Current CPUs are jacks of all trades and masters of none.

Edrick · May 9, 2012

AtenRa said:
Intel will have to compete against their own CPUs first, no one will upgrade from 4 core IB to 4 core Haswell unless it has a huge IPC gain. Remember, both IB and Haswell will be on the same 22nm, dont expect to see a huge power reduction like SB to IB.

People have upgraded from 4 core Yorkfield to 4 core Nehalem, to 4 core SB to 4 core IB. You really think they wont move to 4 core Haswell? The performance gain from IB to Haswell will be much larger than IB was to SB. And it should also be larger than SB was to Nehalem.

AtenRa said:
Secondly, in 2013 (same year Intel will introduce Haswell) AMD will introduce the SteamRoller. From what we know now, it seams that SteamRoller will have more than 8 threads. It will take an enormous IPC gain for a 4 core 8 threads Haswell to be competitive against a 10-12 threads SteamRoller in multithreading apps.

Really? You are putting that much faith in Steamroller? Did you learn nothing from BD? I am sure Intel is real scared. I will take 8 Haswell threads versus 12 Steamroller threads based on what we know today.

N4g4rok · May 9, 2012

MisterMac said:
Disagree how? From what factual standpoint?

Your own personal experience?

My personal experience tells me the complete opposite - even with half arsed libraries implementing bad lock-scenarios.

New instructions requires compiler level optimizations.
Threading...not so much even with less than stellar implementations.

Your average programmer probably won't be able to handle parallelism. The problem is that the ugly little problems you get with multiple threads are nearly impossible to test out. Splitting across 16 threads is worse. you'll be running into process deadlocks everywhere.

A new instruction is easier for both parties involved. The instruction is never variable in terms of what it does, and it provides a another layer of abstraction that keeps the programmer from worrying about intimate details that will cause them to break things.

The biggest complaint i have with new instructions is the lack of backward compatibility with previous platforms, but there isn't too much that can be done about that.

Edrick · May 9, 2012

pelov said:
An instruction set is not going to kill GPGPU anytime soon

You are right, it will not kill GPGPU. But AVX2 will close the gap considerably.

AtenRa · May 9, 2012

Edrick said:
People have upgraded from 4 core Yorkfield to 4 core Nehalem, to 4 core SB to 4 core IB. You really think they wont move to 4 core Haswell? The performance gain from IB to Haswell will be much larger than IB was to SB. And it should also be larger than SB was to Nehalem.

I have upgraded from 4 core YorkField to 4 core 8 threads Nehalem because it had more than 20-30% more performance. Do you actually believe that Haswell will have 20-30% higher IPC than IB ?? I dont, i expect an average 10% more IPC, iGPU on the other hand it will get higher performance.

Edrick said:
Really? You are putting that much faith in Steamroller? Did you learn nothing from BD? I am sure Intel is real scared. I will take 8 Haswell threads versus 12 Steamroller threads based on what we know today.

Since 8 threaded BD can be performance competitive to 8 threaded SB (Both in Legacy and SIMD) then i expect a 12 threaded SteamRoller will be performance competitive to 6 core 12 threaded Haswell.

pelov · May 9, 2012

Edrick said:
You are right, it will not kill GPGPU. But AVX2 will close the gap considerably.

What's to limit GPUs from embracing it? They're not closing any gap and Intel is well aware of it. How do I know this? Because Intel themselves is making a GPU for GPGPU purely because they know they need something outside of mere instruction sets...

Enter Larrabee 2.0 aka Intel MIC, or Many Integrated Cores. This is Intel's Tesla/GCN.

Until CPUs come with massive on-package incredibly fast ram (or gigabytes of cache), 512-bit bus width and thousands of cores they're not catching up to specially built co-processors. You need to dismiss the notion that AVX2 will be gamechanging and accept that it's just another instruction set along with all the others we've gotten. That is to say, useless until implemented years too late. And there's as much chance of AVX2 taking a foothold in most desktop software within the first 2 years of release as there is of me winning the lottery, getting struck by lightning and marrying a supermodel all in one night.

BenchPress · May 9, 2012

MisterMac said:
BD was designed with a modular threaded world in mind - sadly we aren't there.

That's because they failed to realize that without hardware transactional memory (HTM) it's too hard to get good performance scaling.

Threading is like having people run around in a room and avoid hitting each other. Without HTM it's like doing it with your eyes closed, using only touch. And obviously having more people in the room makes it harder and slows them down. With HTM, like Haswell's TSX, it's like doing it with your eyes open. You only slow down when you're on a collision course.

So anyway, AMD provided more cores without helping the developers in any way to make efficient use of this many cores.

Haswell introduces TSX at the right time. Four cores, with the option of Hyper-Threading, is the point where developers have to start using HTM before they scale things up to more cores.

Anyone providing a solid arguemnt, if we had 8 or 12 nehalem core chips now (instead of say, SB) - that we wouldn't see anything from windows to browsers being completely rewritten with threaded scalable core in mind - ill buy you a beer.

12 Nehalem cores and nothing else isn't scalable from a software perspective. There's a total of 66 interactions between cores you have to manage in software (actually 276 with Hyper-Threading). This will only become feasible when hardware assists with the synchronization, which is exactly what TSX is about.

We need the quad-core Haswell to become mainstream before there's any hope of advancing any further. And fortunately we'll also have AVX2 to vectorize loops by a factor 8!

Chiropteran · May 9, 2012

pelov said:
In all seriousness, GPUs strictly for GPGPUs might be replaced by on-die GPUs as they get quicker and more efficient but for things like physics it always makes more sense to have a dedicated piece of hardware that's built for that specific purpose. Current CPUs are jacks of all trades and masters of none.

More sense for who, nvidia so they can sell twice as many video cards? The reason physics is a failure so far is just that- requiring proprietary dedicated hardware. It needs to run on something everyone has before it will take off.

If a CPU is a "jack of all trade" it should be able to handle physics fine, just a bit slower than dedicated hardware.

pelov · May 9, 2012

Sorry, I mean physics for HPC calculations Those GPUs aren't the same GPUs you and I buy and no CPU will ever be able to come even close to the raw GFLOPS those are capable of. Physics processing for gaming and stuff is a completely different story and one that has indeed been limited by proprietary hardware.

Just to rain on this AVX2 for GPGPU parade a bit more,

It's 256-SIMD on AVX2 and not 1024AVX.
FMA3 will also be supported by Vishera/Trinity and that's coming in just a week and a half (and it's on laptops).
We don't know the raw GFLOPS of Haswell but the number 500 is being thrown around like it means something on various forums so let's assume it's accurate. AMD's 7970 GCN, an architecture both focused on gaming and GPGPU, is capable of putting out 947 GFLOPS at DP and 3.78 TFLOPS (!!!) at single precision with over two-thousand stream processors. We're talking orders of magnitude in difference here. Consider too that newer GPUs can be tacked on with the same instruction sets as their CPU brethren for FP-tasks then you'll quickly see why it makes zero sense to design the CPU to do anything like that.

They are co-processors with specific purposes and the CPU was never and will never be able to reach those same levels of performance or efficiency. For moderate openCL/CUDA stuff like photoshop/video editing, GPU-accelerated browsing and light gaming on-die GPUs are more than capable but past that it doesn't make sense.

BenchPress · May 9, 2012

pelov said:
GPUs have an edge with things like massive amounts of very fast RAM, its own small traces of cache, its own instruction units, a very wide bus and thousands of cores. An instruction set is not going to kill GPGPU anytime soon... unless you're delusional.

You've been fooled by GPU marketing. There is no such thing as a GPU with thousands of cores. So I'm terribly sorry but you're the only being delusional here.

GPU manufacturers count each SIMD lane as an individual core. Using the same "logic", mainstream Haswell will have 64 cores, running at over three times the clock frequency of the latest GPUs. For the record, the 22 nm HD 4000 has 128 of such "cores", however they're running at only 1150 Mhz. So there's really not that big a difference between a CPU and a GPU. We certainly don't need a big jump in core count.

That said, the instruction set is just part of the reason Haswell will kill mainstream GPGPU. CPUs can already put the latest GPUs to shame at GPGPU workloads. The reason for this is that there's no round-trip delay, no bandwidth bottleneck, and no hard register limit. And Haswell will strengthen these benefits with a GPU-like instruction set extension!

Haswell obviously won't kill mainstream GPGPU overnight, but it's blatantly obvious that GPGPU has no future. Adding AVX2 to the CPU won't cause any compromises. In contrast, for the GPU to become any better at GPGPU it has to sacrifice a considerable amount of graphics performance. It basically has to become more like a CPU. But that's downright silly. If becoming more like the CPU is the answer then why not let the CPU handle these workloads in the first place? AVX2 was the only missing bit to make that happen.

pelov · May 9, 2012

Because the CPU can't handle such HPC workloads. You need a massive bus width, gigabytes of attached very quick ram and thousands of processors! Furthermore, you can tally up the GPUs much as HPCs do today, scaling to hundreds of GPUs across PCIE lanes. You can at best hope to mimic that with CPUs but AMD has already snatched up SeaMicro for that. You can count cores however you want, at the end of the day the GPU has many many more than does the on-die CPU.

In contrast, for the GPU to become any better at GPGPU it has to sacrifice a considerable amount of graphics performance. It basically has to become more like a CPU. But that's downright silly. If becoming more like the CPU is the answer then why not let the CPU handle these workloads in the first place? AVX2 was the only missing bit to make that happen.

It does sacrifice performance and that's partly why GCN is behind Kepler as far as gaming goes (but still kicks the CPU's ass). Tesla, otoh, is completely HPC focused and trounces everything. Unlike the CPU, which is a general processing unit (jack of all trades and it has to be), HPC GPUs do only one thing well and that's raw GFLOPS in either single or double precision -- yes, that's FMA there too. Instruction sets as a CPU advantage only work as a CPU advantage if they stay in the CPU but GPUs adopt them as well so any potential advantage there disappears. Unlike CPUs they don't lug around useless ISAs and are strictly limited to FP-tasks, thus adopting AVX2 for GPGPU is not just likely but almost a certainty. Using AVX2 as a reason why CPUs will catch up won't work then.

Your benchmark only showed the same thing I mentioned in the post you're arguing with:

They are co-processors with specific purposes and the CPU was never and will never be able to reach those same levels of performance or efficiency. For moderate openCL/CUDA stuff like photoshop/video editing, GPU-accelerated browsing and light gaming on-die GPUs are more than capable but past that it doesn't make sense.

Then there's this gem off Intel's own site...

Intel® MIC products give developers a key advantage: They run on standard, existing programming tools and methods.

Intel® MIC architecture combines many Intel® CPU cores onto a single chip. Developers can program these cores using standard C, C++, and FORTRAN source code. The same program source code written for Intel® MIC products can be compiled and run on a standard Intel® Xeon processor. Familiar programming models remove developer-training barriers, allowing the developer to focus on the problems rather than software engineering.

So while CPUs are getting better at tasks that are almost entirely GPU/GPGPU based, so are GPUs. One of those, though, isn't being hamstrung by ISAs, TDP, socket compatibility and various other hardware. If you saturate the PCIE lanes then just increase the bandwidth and keep going or keep adding GPUs. If Intel was as confident in Haswell and AVX2 as you are then they wouldn't have bothered with Knight's Corner co-processor.

2is · May 9, 2012

If the CPU can handle GPGPU tasks as good or better than GPU then I say good riddance to GPGPU. My primary concern for a GPU is gaming performance and it seems like the cards that are powerful gaming platforms AND excel at GPGPU tasks are also very expensive, power hungry behemoths. I really like that nVidia went back to focusing on gaming performance with the 680.

BenchPress · May 9, 2012

pelov said:
We don't know the raw GFLOPS of Haswell but the number 500 is being thrown around like it means something on various forums so let's assume it's accurate. AMD's 7970 GCN, an architecture both focused on gaming and GPGPU, is capable of putting out 947 GFLOPS at DP and 3.78 TFLOPS (!!!) at single precision with over two-thousand stream processors. We're talking orders of magnitude in difference here.

You're comparing a chip that is much bigger and consumes a lot more power.

You'd have to compare it to something close to a 16-core Haswell, which probably results in ~1.5 TFLOPS for the same power budget. That's no longer "orders of magnitude" is it? Furthermore, the GPU actually never achieves its peak throughput for typical workloads. That's because it has only a tiny amount of cache space per work item, and it has to lower the thread count when running out of registers. Let's look at the graph of shame once more:

Even though the HD 7970 handily outperforms the GTX 680, it's still rather pathetic that a 3.78 TFLOP behemoth is only three times faster than a quad-core CPU which is only using 120 GFLOPS in this test. Now imagine what a 16-core with four times the per-core floating-point performance and gather support could do...

They are co-processors with specific purposes and the CPU was never and will never be able to reach those same levels of performance or efficiency.

And who's going to stop that from happening? The laws of physics are the same for CPUs and GPUs. CPUs used to be single-core while GPUs were multi-core. Now they're both multi-core. CPUs used to be mainly scalar while GPUs had vector instructions. With AVX2 the CPU will have wide vector instructions too. The GPU used to the only one with FMA instructions. Not any longer. Gather and vector-vector shift gave the GPU an advantage at SPMD programming. Haswell will have them too...

So exactly what kind of magical technology do you think the GPU still has left that the CPU could never integrate?

pelov · May 9, 2012

So you're implying that Intel should be making chips that are 4-times+ the size of Haswells to achieve GPGPU levels of performance that can be achieved by spending only $450, and somehow that's supposed to prove your point? You're completely ignoring hardware limitations and sizes here. Can you imagine how much a chip like that would cost? You could build an entire HPC machine with 10+ 7970s and a server-class CPU and high-grade SLC-level SSDs for the price of a single one of those CPUs. Now how many of those would you get per wafer?

Throw AVX2 out of the window as instruction sets, particularly FP-based ones, are incorporated and shared between hardware like CPUs and GPUs. That advantage isn't an advantage any longer. The raw GFLOPs/TFLOPs double/single precisions that GPUs are currently capable of is, though. We might see Sky Lake maybe inch closer to the current levels of throughput that GPUs are capable of today but by then we'll have had 2-3 generations of new GPUs with even higher throughput.

Luxmark on openCL doesn't represent an HPC-type workload. Throw up a CUDA benchmark and see how CPUs perform

ShintaiDK · May 9, 2012

HD7970=210W?, 365mm2.
Haswell=65-77W, 150-170mm2.

Not to mention one flop aint one flop. Some of the distributed code showed it. CPU took 1 flop, GPU used 4 flops. Saem result.

AtenRa · May 9, 2012

BenchPress said:
Even though the HD 7970 handily outperforms the GTX 680, it's still rather pathetic that a 3.78 TFLOP behemoth is only three times faster than a quad-core CPU which is only using 120 GFLOPS in this test. Now imagine what a 16-core with four times the per-core floating-point performance and gather support could do...

Im sorry but that behemoth has 3-4x times the performance with a TDP of 210W when Intel Core i7 3820 has a TDP of 125W. I believe that you will find that GPUs are far more efficient that your CPU for GPGPU. Just wait and see how much more efficient the GK100/110 will be.

It is the reason most super computers are using GPUs for highly compute scenarios than CPUs.

ShintaiDK · May 9, 2012

AtenRa said:
Im sorry but that behemoth has 3-4x times the performance with a TDP of 210W when Intel Core i7 3820 has a TDP of 125W. I believe that you will find that GPUs are far more efficient that your CPU for GPGPU. Just wait and see how much more efficient the GK100/110 will be.

It is the reason most super computers are using GPUs for highly compute scenarios than CPUs.

GPUs are actually dying out fast due to their limited abilities. Latest one used MICs to replace half of the GPUs or so.

Also performance depends what you do exactly. The CPU can easily be 1000 times faster than big Kepler for certain tasks. And then all those tasks that big Kepler cant do at all.

GPGPU computing is just so limiting.

pelov · May 9, 2012

Outside of HPC it's still limited and that sheer amount of GFLOPS/TFLOPS isn't necessary on the desktop/laptop yet. That could potentially change drastically with the increasing resolutions of screens/monitors/TVs. It takes a lot more to get decent FPS on a 4k resolution setup than it does at 1080p. But, like you said, it all depends on its potential uses. But as far as driving an ever-increasing pixel count you'd bet that GPUs are going to stay relevant, whether on-die or discrete. HPC is another beast entirely

Haswell Core Count

Golden Member

Senior member

Golden Member

Golden Member

Senior member

Diamond Member

Lifer

Diamond Member

Golden Member

Senior member

Golden Member

Lifer

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Lifer

Lifer

Lifer

Diamond Member