Vector Computing

GlassGhost · Feb 10, 2003

I?ve thought about this for quite some time and discussed the idea with a few colleagues and would like to hear what you guys and gals here think of the subject. True vector processing is showing up in a few areas of computing and even in discussions of hardware architecture (SMT Symmetric Multi-Threading, the big brother of Hyper Threading.) Would it behoove AMD & Intel?s R&D labs to start looking into a standardized mechanism for allowing user level processes to create and manage vector code. I?m not talking about threads, those have a ton of overhead and are different than simple operations that can be executed in parallel, i.e. a vector.

Some may argue that Intel?s IA64 is just this, but in reality the EPIC architecture Intel has come up with is far from what most programmers would consider a vector architecture. Wouldn?t it be nice to be able to send a vector processor a series of arbitrarily sized chunks of code to be executed in parallel with each other? I?m no computer engineer, but I can?t see why reserving some real estate on the die, say 16 vector execution units, could hurt. Utilizing a RISC style selection of operations, these vector execution units could chew through code quite rapidly I would imagine.

I know Apple did something akin to this with Motorola with their G4 chip, but I?m not knowledgeable enough to know how their vector style solution was designed.

zephyrprime · Feb 10, 2003

Vector processing abilities already exist on x86's in the form of MMX, 3DNOW!, and SSE. The PowerPC version is AltiVec. SSE is underpowered compared to what the processors they use in supercomputers can do but Intel has to keep the transistor count down. 16 vector units would be a back breaker for any x86 or PPC consumer processor.

I think you've got vector processors confused with multi-core processors and SMT.

I agree with you though that current programming technology isn't suited to take advantage of parallelism that's bigger than instruction level parallelism but smaller than thread level parallelism. The funny thing is, some old technologies could actually take advantage parallelism like this. For example, Fortran could exploit loop iteration parallelism by sending successive passes of a loop to different processors. I guess this sort of thing is problematic nowadays because of modern operating systems. In the near future, some sort of solution to easily exploit multi-core processors & SMT will have to be developed.

dejitaru · Feb 10, 2003

IBM's new chip (supposedly to be released later this year) will support 16 vector units.

Sohcan · Feb 10, 2003

As zephyrprime pointed out, VLIW/EPIC and SMT are quite a bit different than vector computing....the former is a method of improving instruction-level parallelism, while the later is a method of exploiting thread-level parallelism on-chip. Vector computing is a data parallel model (also called SIMD, single-instruction multiple-data), where operations are performed on a vector of elements in lock-step.

That being said, the implementations of data parallel computers fall into three camps. SIMD computers (the acronym applies to both the programming model and the implementation type) uses a matrix of distinct functional units, and never enjoyed much success beyond the well known examples of the ILLIAC IV and Thinking Machines.

The "multimedia" instruction sets used by nearly all CPU manufacturers use very small vectors, usually 128-bits long. To compensate for their short size, they usually employ variable-sized elements...for example, an operation can be performed on 2 64-bit values, 4 32-bit values, or 8 16-bit values, etc. The operations are typically meant to be executed on specialized functional units that both allow the variable-lengthed elements and execute the operation on all elements simultaneously (with a few exceptions....the 128-bit multimedia instructions on the Athlon and P4 are executed on 64-bit datapaths requiring two consecutive operations). Again, this functional unit parallelism is needed for any speedup to occur since the vectors are so short.

*True* vector computers differ from the multimedia extensions in a number of ways. First of all, the vectors are much longer...the Cray-1, introduced in 1975, has eight vector registers, each with sixty-four 64-bit elements. Older memory-memory vector computers had even longer vectors. Vector computers also don't necessarily have wide enough functional units to perform the arithmetic operation on all 64 elements at once....in fact, configurations typical in early vector computers usually had a single vector add/sub unit, multiply unit, divide unit, and vector load/store unit.

At first glance it may not seem that these vector computers had any parallelism advantage over scalar...an add operation on two vector registers still may take 64 clock cycles. But what the long vector registers allow is a large space for explicitly named values....after the long startup time to load data from memory into the vector registers, the vector operations can commence without interruption. The Cray-1 also introduced the practice of chaining, in which the partial results of one vector operation are immediately fed to another vector functional unit. So in the typical DAXPY operation (Y = a * X + Y), one add and one multiply operation can both be confidently sustained each cycle. And since each element in a vector register is inherintly independent of each other, it is easier to construct deeply pipelined functional units, which (along with innovative circuit design) allowed the original Cray products to achieve astounding clock rates. Future vector computers added more parallel functional units, allowing, for example, two adds to be performed from the same two vector register operands each cycle. Again, the inherint independence of the vector elements makes the arbitration of parallel functional units easier.

Nowadays, the clock rate advantage of vector vs. scalar computers is no longer a factor...Scalar computers more quickly adopted CMOS technology and single-chip designs, which allowed clock rates to scale much faster. In addition, since scalar CPUs are produced in a much, much higher volume than their vector counterparts, scalar CPUs typically have a much larger design budget. This allows for full-custom circuit designs vs. more automated designs for modern vector CPUs, which allows for higher clock rates.

There was, however, a proposed extension to the Alpha ISA that I read a while back...it had, I believe, eight 64-element x 64-bit vector registers like Cray along with a number of vector functional units. There are a few major setbacks against high-volume, single-chip vector CPUs...first of all, at between 4 KB to 16 KB, the register file is huge...having such a large register file on the critical path may affect clock rates, and lead to register accesses that are pipelined over a few clock cycles.

And even if the added space and complexity of the vector register file and multiple vector functional units can be handled, memory communication is a problem. Vector computers cannot use caches in the same way scalar computers can; while caches are beneficial to scalar CPUs (data exhibits temporal and spatial locality; if data is accessed, it is likely it will be accessed again soon, as will data near it), vector computers often access data with large strides, making caches useless and even detrimental in some cases. While I've heard that there has been some advance in vector caches in recent years, vector computers have always needed VERY high bandwidth main memory systems. To make a single-chip vector CPU work, it would need to integrate a sophisticated memory controller on-chip that can control a large number of memory banks. The added complexity of the memory controller and the large cost of a high-bandwidth memory system with a large number of banks would be difficult to swallow for a high-volume CPU.

I think vector computers are very interesting, but given their scope that's limited to niche scientific applications and their large cost, it doesn't seem likely that a big-named company (Intel, AMD, IBM, Sun, etc) would produce a high-volume single-chip CPU with true vector extensions.

* not speaking for Intel Corp. *

DerekWilson · Feb 10, 2003

Very informative post, Sohcan, thanks for that.

I would like to ask what you think of an option that Sony and Toshiba decided to explore with their MIPS based PS2 system. Specifically, one of the coprocessors they defined for the system was a custom vector processing unit. Actually they had two, one attached to the main CPU and one to the graphics processor.

I know, they aren't on the same magnitude as a supercomputer as they have 128bit registers that can be devided up as you discussed in your post. I have done some programming on a PS2, but I haven't programmed the VPU's to the metal, so I don't know for sure, but I think they are also 2 wide VLIW.

It seems to me that an architechture that didn't necessarily include on die vector processing, but enabled it as a coprocessor off chip, could be very useful. This would seem to me to be especially true in the area of 3D rendering and content creation on large data. VPUs are very good at doing simple tasks in parallel over and over, and there are a lot of specific applications that could benefit.

I really think that Sony and Toshiba were very smart to use a VPU as part of their graphics subsystem, as that seems its best suited use in consumer markets. What do you guys think of the viability of such a system in the desktop PC market? Perhapse some one will want to try their hand at moving multimedia support off the core in order to keep single die size down, increase clock rate, and allow for more bigger faster vector style processing?

But then, I still think that everyone should be using SRAM on a 4kbit system bus... Ahhh, if only cost weren't an object

AbsolutDealage · Feb 11, 2003

What do you guys think of the viability of such a system in the desktop PC market? Perhapse some one will want to try their hand at moving multimedia support off the core in order to keep single die size down, increase clock rate, and allow for more bigger faster vector style processing?

I doubt it. It seems that the current trend is to move more things on the core, i.e. memory control and north bridge on AMD's Hammer core. I think that as far as Intel/AMD go, they will not go this route. The cost for designing & integrating a VPU on thier core would be prohibitive. A full blown implementation of vector processing with 16 vector units would be a waste of money, and a waste of die space for the average user. The average user would see more performance if that die space was used for something else. I think the closest you will ever see (for the forseeable future) is the current situation (hyper-threading, SSE, etc.).

GlassGhost · Feb 11, 2003

Originally posted by: zephyrprime
Vector processing abilities already exist on x86's in the form of MMX, 3DNOW!, and SSE. The PowerPC version is AltiVec. SSE is underpowered compared to what the processors they use in supercomputers can do but Intel has to keep the transistor count down. 16 vector units would be a back breaker for any x86 or PPC consumer processor.

I think you've got vector processors confused with multi-core processors and SMT.

I agree with you though that current programming technology isn't suited to take advantage of parallelism that's bigger than instruction level parallelism but smaller than thread level parallelism. The funny thing is, some old technologies could actually take advantage parallelism like this. For example, Fortran could exploit loop iteration parallelism by sending successive passes of a loop to different processors. I guess this sort of thing is problematic nowadays because of modern operating systems. In the near future, some sort of solution to easily exploit multi-core processors & SMT will have to be developed.

I made a reply post, it was deleted for what ever reason.

I am not mixing up vector processing with SMT.

Vector processing, like Sohcan mentioned is the ability for a processor to execute several streams of code at the same time. These vectors of code aren't true threads (threads on x86 CPUs are given time slices to execute, they aren?t run in parallel as a vector would), rather small (compared to threads) parts of a program that can be executed at the same time. As Sohcan pointed out, this is also different than what MMX or SSE or SIMD is. They are vector-like, but are only able to operate on a few pieces of data (4 max if I remember correctly) and only 1 MMX/SSE instruction can execute at a given time.

SMT is the ability for a scalar CPU to recognize threads and organize instructions from different threads in a way to maximize idle CPU cycles. It appears the CPU is executing the threads simultaneously, but isn?t.

i.e.

Thread 1 mov eax,7
Thread 2 add ecx,7
Thread 3 fmul st,96
Thread 1 cmp ebx,eax
Thread 2 jne SomeLoaction
Thread 3 fsqr st

The instructions are all executed in serial, but the CPU knows that it can order the microOps anyway it wants.

Sohcan:

Interesting opinion. I agree that a cache system for a vector CPU would be a nightmare, and also agree that a pure vector CPU would be only a niche product, but I foresee scalar CPUs as being slowly phased out in favor of a scalar/vector hybrid CPU design that combines a high clocked scalar processing pipeline with several lower clocked vector processing pipelines. With a design like this, the compiler could say what code should run in parallel and what should run in serial, and the CPU designers could do away with all the out of order logic found on most modern CPUs.

Sohcan · Feb 11, 2003

Derek Wilson: I'm not terribly familiar with the PS2's vector system, other than it's a 128-bit vector processor.

SMT is the ability for a scalar CPU to recognize threads and organize instructions from different threads in a way to maximize idle CPU cycles. It appears the CPU is executing the threads simultaneously, but isn?t.

i.e.

Thread 1 mov eax,7
Thread 2 add ecx,7
Thread 3 fmul st,96
Thread 1 cmp ebx,eax
Thread 2 jne SomeLoaction
Thread 3 fsqr st

The instructions are all executed in serial, but the CPU knows that it can order the microOps anyway it wants.

SMT does have the ability to use a superscalar core's multiple function units to execute instructions from multiple threads simultaneously...the P4 can execute instructions from two threads in the same cycle, and the now-defunct Alpha EV8 could do so from four threads. What you seem to be describing is fine-grained multithreading, in which the CPU would execute instructions from a different thread each cycle.

but I foresee scalar CPUs as being slowly phased out in favor of a scalar/vector hybrid CPU design that combines a high clocked scalar processing pipeline with several lower clocked vector processing pipelines

Interesting proposition...though I'm still skeptical given its cost and that most of the HPC market has moved from vector computing to parallel systems of fast commodity processors.

I found the paper with the proposed Alpha vector extension. Some interesting features:
[*]Based on the EV8 core, 2.5 GHz at 65nm
[*]The ISA extension includes 32 vector registers, each holding 128 64-bit values (a 32 KB vector register file :Q)
[*]16 vector lanes with two issue ports each, capable of 32 double-precision floating-point operations per cycle
[*]The vector core would have been as large as the scalar core, together taking up 30% of the die area, which is 286 mm^2.
[*]The 16 MB L2 cache was to be able to supply up to sixty-four 64-bit words each clock cycle, for a bandwidth of 1,280 Gigabytes/sec Q), compared to 104 GB/sec for Itanium 2 at 1 GHz. If you look at figure 5, it's easy to see just how much of the die area would be sucked up by the L2 bus.
[*]The memory system was to control 32 RDRAM channels for over 64 GB/sec of main memory bandwidth

It's an ambitious design, but IMHO still far too infeasible for the 65 nm node, especially given that the vector core and much of the L2 and main memory bandwidth would go unused for much of any target market except high performance computing.

* not speaking for Intel Corp. *

GlassGhost · Feb 12, 2003

Interesting article on the EV8. Its too bad compaq went down the crapper and sold off the alpha group to intel. Now those great minds will be stuck engineering a (IMHO) CPU (Itanium) that isn't as interesting or potentially powerful as the alpha was.

SMT does have the ability to use a superscalar core's multiple function units to execute instructions from multiple threads simultaneously...the P4 can execute instructions from two threads in the same cycle, and the now-defunct Alpha EV8 could do so from four threads. What you seem to be describing is fine-grained multithreading, in which the CPU would execute instructions from a different thread each cycle.

Yes, I am aware.

The instructions are all executed in serial, but the CPU knows that it can order the microOps anyway it wants.

Thats what I meant by that line. My point was that the instructions were all retired in serial order, unlike how a true vector processor would execute, all in parallel, thus pointing out the difference between SMT or hyperthreading and vector processing.

In my original post, I mentioned SMT and hyperthreading and MMX/SSE to point out that vector like features have been of interest lately with CPU designers, not that SMT or hyperthreading *IS* vector processing.

DerekWilson · Feb 12, 2003

OK, but you said:

in reality the EPIC architecture Intel has come up with is far from what most programmers would consider a vector architecture

... So, yeah, VLIW/EPIC are different than vector processing and sse... but if I understood the question from your original post, it seemed that the reason you think an integrated vector processing unit in desktop systems would help because it would allow multiple operations to complete in parallel right?

The real key to look at against vector processing actually helping very much is this: you have to do THE SAME THING to all parts of the vector at the same time. If you aren't doing large data ops, you just don't see any speedup because if you only need one add it'll take the same ammount of time as 16adds. Of course, correct me if I am wrong, but this is my understanding of how it works.

well, vector processing can compute large sets of data very quickly, but compilers aren't good at optimizing for it, and hardware isn't good enough to automatically discover when it can and cannot do vector processing... vector code really needs to be hand coded to be effective. Of course, that's arguable and possibly won't be true at all in a few years, but that remains to be seen.

with EPIC you have the flexibility of being able to handcode packets and specify ILP. If someone imlemented an 16 wide EPIC machine, I'd take that any day over a scalar processor with an additional 16 wide vector processing unit. OK, OK, fine, so a 16 wide EPIC machine is not going to happen anytime soon (or not so soon) AFAIK. But i think a 4 wide EPIC machine would still work out better because you get dynamic optimizations while the processor is running along with your ability to either compile or handcode software. Vector processing relies way too much on the coder, but EPIC still gives you a good bit of control so you aren't restrained (not that I'd EVER want to hand code IA64 -- but EPIC style processors don't need to be as complex as that). Besides, vector processing units can't do anything with higher level cpu functionality ... and they have a hella time with shifts. All they can do is basic math stuff reallly. And with EPIC you still get multiple instructions retiring at the same time, so you do get parallelism out of it.

But ignoring all that, you can definitly still make a very valid case that it could be effective for specific purposes. Afterall, adding additional processing power to something isn't gonna slow it down. About the only think I could see it being a viable solution for is setting up complex graphics, and maybe doing really complex physics computations for games. Which is why I think such a solution (even if it wasn't very wide vector processing) was useful to the PS2. And that's probably where it will stay in the consumer market.

GlassGhost · Feb 14, 2003

http://www.nec.co.jp/press/en/0302/1001.html

NEC just introduced a highly parallel CPU with:

-Integration of 128 single instruction multiple data (SIMD) processing elements (PEs) (Note 2) that operate on a low frequency of about 100-megahertz (MHz). With each PE co-operating simultaneously to process the target picture, NEC has achieved high-performance operation at a low-power-consumption rate and enabled extensive programmability through software application.

-Use of a memory structure that enables independent access to each PE, facilitating efficient execution of parallel software programmed to simultaneously process a picture through a number of newly devised hardware structures (Note 3)

-Development of a new architecture for image recognition processing that realizes a development environment based on an optimized C language (Note 4) compiler and increases software development efficiency.

Although highy specialized, I find this approach very interesting.

Sohcan · Feb 14, 2003

There was a paper on that presented at ISSCC on Monday (page 19)...interesting that they took the SIMD approach as opposed to the vector approach. I might be able to get a hold of the paper once it's up on Citeseer.

Vector Computing

GlassGhost

Member

zephyrprime

Diamond Member

dejitaru

Banned

Sohcan

Platinum Member

DerekWilson

Platinum Member

AbsolutDealage

Platinum Member

GlassGhost

Member

Sohcan

Platinum Member

GlassGhost

Member

DerekWilson

Platinum Member

GlassGhost

Member

Sohcan

Platinum Member

TRENDING THREADS