If AVX2 really is that good, how come it hasn't been added years ago? Is it hard to implement in HW, or what has been holding it back?
First of all, it's not like there wasn't anything before AVX2:
1996 - Pentium MMX: 64-bit integer (INT) vector instructions, 64-bit INT execution units
1999 - Pentium 3: 128-bit floating-point (FP) vector instructions, 64-bit execution units
2001 - Pentium 4: 128-bit FP + INT vector instructions, still 64-bit execution units
2006 - Core 2: 128-bit FP + INT vector instructions, 128-bit execution units, but lower frequency
2011 - Sandy Bridge: 256-bit FP vector instructions, 256-bit FP execution units
2013 - Haswell: 256-bit FP + INT vector instructions, 256-bit FP + INT execution units, gather support, FMA support
You have to think of doubling the vector width as almost an alternative to doubling the number of cores. If Haswell didn't have AVX2 but came with 8 cores by default, it would have been a pretty great chip too but nobody would question "why didn't they think of this before". Likewise AVX2 is great from a performance perspective for a certain range of applications, but not all
that extraordinary from a technological point of view to question why they haven't done it before. They've been adding vector capabilities to x86 since 1996, and the transistor budget has been the main limiting factor.
So why did they choose AVX2 over doubling the number of cores? It's all about balance. Software contains varying levels of task parallelism (
TLP) and data parallelism (
DLP). TLP is extracted using more cores/threads, while DLP is best extracted using wider vectors. DLP can actually also be extracted using more cores, but programming for many cores is hard while using wide vector instructions is relatively easy. Last but not least, wider vectors are also more power efficient than more cores.
What does set AVX2 apart from every previous vector instruction set extension is gather support; the ability to read multiple memory locations in parallel. This is a big deal because it enables the compiler to easily vectorize scalar code. So why haven't they added gather before then? Well it only starts making a real difference when the vectors are quite wide, so it wasn't worth the transistors before. Wide vectors and gather kind of belong together, but once you have both they enable a whole new programming model.