As zephyrprime pointed out, VLIW/EPIC and SMT are quite a bit different than vector computing....the former is a method of improving instruction-level parallelism, while the later is a method of exploiting thread-level parallelism on-chip. Vector computing is a data parallel model (also called SIMD, single-instruction multiple-data), where operations are performed on a vector of elements in lock-step.
That being said, the implementations of data parallel computers fall into three camps. SIMD computers (the acronym applies to both the programming model and the implementation type) uses a matrix of distinct functional units, and never enjoyed much success beyond the well known examples of the ILLIAC IV and Thinking Machines.
The "multimedia" instruction sets used by nearly all CPU manufacturers use very small vectors, usually 128-bits long. To compensate for their short size, they usually employ variable-sized elements...for example, an operation can be performed on 2 64-bit values, 4 32-bit values, or 8 16-bit values, etc. The operations are typically meant to be executed on specialized functional units that both allow the variable-lengthed elements and execute the operation on all elements simultaneously (with a few exceptions....the 128-bit multimedia instructions on the Athlon and P4 are executed on 64-bit datapaths requiring two consecutive operations). Again, this functional unit parallelism is needed for any speedup to occur since the vectors are so short.
*True* vector computers differ from the multimedia extensions in a number of ways. First of all, the vectors are much longer...the Cray-1, introduced in 1975, has eight vector registers, each with sixty-four 64-bit elements. Older memory-memory vector computers had even longer vectors. Vector computers also don't necessarily have wide enough functional units to perform the arithmetic operation on all 64 elements at once....in fact, configurations typical in early vector computers usually had a single vector add/sub unit, multiply unit, divide unit, and vector load/store unit.
At first glance it may not seem that these vector computers had any parallelism advantage over scalar...an add operation on two vector registers still may take 64 clock cycles. But what the long vector registers allow is a large space for explicitly named values....after the long startup time to load data from memory into the vector registers, the vector operations can commence without interruption. The Cray-1 also introduced the practice of chaining, in which the partial results of one vector operation are immediately fed to another vector functional unit. So in the typical DAXPY operation (Y = a * X + Y), one add and one multiply operation can both be confidently sustained each cycle. And since each element in a vector register is inherintly independent of each other, it is easier to construct deeply pipelined functional units, which (along with innovative circuit design) allowed the original Cray products to achieve astounding clock rates. Future vector computers added more parallel functional units, allowing, for example, two adds to be performed from the same two vector register operands each cycle. Again, the inherint independence of the vector elements makes the arbitration of parallel functional units easier.
Nowadays, the clock rate advantage of vector vs. scalar computers is no longer a factor...Scalar computers more quickly adopted CMOS technology and single-chip designs, which allowed clock rates to scale much faster. In addition, since scalar CPUs are produced in a much, much higher volume than their vector counterparts, scalar CPUs typically have a much larger design budget. This allows for full-custom circuit designs vs. more automated designs for modern vector CPUs, which allows for higher clock rates.
There was, however, a proposed extension to the Alpha ISA that I read a while back...it had, I believe, eight 64-element x 64-bit vector registers like Cray along with a number of vector functional units. There are a few major setbacks against high-volume, single-chip vector CPUs...first of all, at between 4 KB to 16 KB, the register file is huge...having such a large register file on the critical path may affect clock rates, and lead to register accesses that are pipelined over a few clock cycles.
And even if the added space and complexity of the vector register file and multiple vector functional units can be handled, memory communication is a problem. Vector computers cannot use caches in the same way scalar computers can; while caches are beneficial to scalar CPUs (data exhibits temporal and spatial locality; if data is accessed, it is likely it will be accessed again soon, as will data near it), vector computers often access data with large strides, making caches useless and even detrimental in some cases. While I've heard that there has been some advance in vector caches in recent years, vector computers have always needed VERY high bandwidth main memory systems. To make a single-chip vector CPU work, it would need to integrate a sophisticated memory controller on-chip that can control a large number of memory banks. The added complexity of the memory controller and the large cost of a high-bandwidth memory system with a large number of banks would be difficult to swallow for a high-volume CPU.
I think vector computers are very interesting, but given their scope that's limited to niche scientific applications and their large cost, it doesn't seem likely that a big-named company (Intel, AMD, IBM, Sun, etc) would produce a high-volume single-chip CPU with true vector extensions.
* not speaking for Intel Corp. *