Well, if that is the case I guess we'll be stuck with x86 more or less forever, unless something changes. No point taking the penalty of switching ISA if the benefit is a mere 10% performance increase.
If you run Windows, and use software from a variety of years of publishing, x86 is all you want, either way.
If not that, what's the big deal with switching? It's a pain for unpopular chips, since much software is allowed to be packaged so long as it compiles, but if many others have the same chips (or comparable chips), what's this penalty you're worried about? You're not alone in that, but geez, come on over the FOSS side, and whip yourself up a little desktop or server on an old Mac, or a RPi, or something. It's not that bad. Really.
But what about power efficiency then? Could a completely new ISA make any gains there? I mean for example ARM is chosen in more or less all really low power devices (smart phones and similar), so I guess its ISA must have an advantage compared to x86 when it comes to power efficiency? Is it e.g. easier to implement a CPU with the ARM ISA in a power efficient way compared to x86 ISA?
Yes, but Intel has been willing to throw lots of R&D at the problem of x86's front end, which has historically been usually on at full speed, and low-ILP*. ARM's ISA has several features, most notably the shift+add and conditional instructions, that allowed it offer decent performance w/o cache, and allowed it to use fewer instructions per some unit of work compared to other simple in-order RISCs. RISCs in general, with fixed instruction sizes, make it easier to have wide decoders, too. But, want a wide OOOE CPU, that needs to stay 100+ cycles ahead of your RAM? That all amounts to a lot of nothing, and RISCS need a bit more cache bandwidth and size (L1D for constants and such, L1I for bigger instructions), so minor advantages will even out.
ARM has had advantages in their CPU designs, which cyclically have gone to also being ISA advantages for those kinds of CPUs. But, when you're talking about 1GHz+ 2+ wide OOOE CPUs, that's all crap. x86 was designed for low-memory cost-sensitive devices, and has some benefits and problems from it. ARM was designed for low-power cost-sensitive devices, and has some pros and cons from that. But, neither can your RAM faster.
Power with has virtual memory, or MIPS/SPARC implemented in a way that respects register windows, would have some difficulties. But other than that, what needs to be worked out will be different, more than one ISA being flat superior. And, both x86 and ARM have some legacy from the trappings of expensive xtors and RAM, but both also either did important things pretty well (like virtual memory), or left them open for programmer/compiler interpretation (both allow more than one way to handle memory moving, setting, and copying, FI).
IoW, the big issue with performance is that changing addresses of DRAM arrays is slow, because as they get smaller, while R may get much smaller, L and/or C get relatively larger as it's all packed closer together. More pins for less wide channels to allow more IO gets more expensive per bandwidth (and won't use standard DIMMs), and still only offers minor gains, since you won't know which channel what you need next will be on. The farther ahead of RAM you need to stay, which affected by clock speed and desired IPC, the more cache you'll need, and the better your speculation performance will need to be (also, a bigger cache than your workload actually needs can allow for more aggressive and/or sloppier prefetching). The actual concepts behind those are surprisingly simple, though
far from intuitive, but actually implementing in a way that they can advance every clock with a rolling history, switching fairly few gates/cycle...frankly, that's impressive as all hell, to me.
Many of the older advantages and disadvantages, like register and stack allocation, slow memory copying**, arithmetic and address operations having some interlocking, and so on, have largely been either designed away, thanks to research and cheaper xtors, or compiled away, thanks to research and cheaper memory.
* You must decode instruction 1's length, before you can start with instruction 2, and 2 before 3, and so on, where with ARM, you just have to watch for thumb, and otherwise do 32 bits at a time.
** well, this isn't a problem that's gone away, but it's now a software/hardware optimization problem, not a, "these CPUs are really bad at this, because of the ISA," thing.