FI a misprediction will stall the pipeline and as a consequence there s no computation that is done in the back end, hence it doesnt use as much power as when the pipeline is correctly filled and exe units are performing actual work.
This is true of course, but you also use more energy on the frontend, you can create cache pollution, etc. Didn't mean to imply that a cache miss will always reduce instant power consumption, just that there could be some scenarios, if not on the prediction side then the speculative execution side. Instant power consumption isn't really something people think of that much compared to task energy.
The better proof is on the cache side.
Likelihood of increasing IPC while decreasing power is low. Not because it's impossible, but because it implies the previous approach was less efficient and used more transistors to achieve that worse result.
There's not many places in a modern CPU where you could expect to swap out hardware for asymptotically better performance while reducing the transistors used at the same time. That fruit was picked a long time ago.
There's also area to consider rather than just active transistors.
I somewhat agree on the low hanging fruit part. In terms of the most fundamental aspects, yeah, not much at least that our human faculties can observe... probably.
In terms of more derived aspects, AMD leaves low hanging fruit all the time, for example, for the sake of agility. Zen2 with it's dual CCXes, Zen4c with that plus a total lack of optimization for its target frequency while retaining many disadvantages of a more speed-demon focused architecture, etc.
I'm also not sure we've seen the last Core2 moment, Maxwell moment, etc. Before those monumental leaps forward, it also didn't seem like there was that much room left for fundamental improvements. Maxwell especially was less transistors for more frequency and performance (and better performance/flop) at substantially less power on the same node despite being clearly derived from Kepler, which shows that Kepler had a lot of fruit to pick despite seeming advanced at the time.
Almost anything the reduces memory latency (at least in terms of raw cycle count since the access time may decrease just as clock speed increases, which keeps the ratio similar even if both are faster) is a result of adding extra cache, which requires more transistors. Those need more power, but if they keep the rest of the CPU better fed it might reduce the overall power used for some workload even though the CPU is using more power at any individual point in time during that point.
I would assume the extra transistors for cache usually consume less, but there's also elements of cache size versus timing (Apple gets the best of both worlds with its L1), and organization. Zen2->Zen3 going from 2 CCX's per CCD to 1 didn't need to grow the cache physically.
If we froze our process tech and it never improved, newer chips would see a gradual increase in power use. The efficiency for some workloads may improve as the extra transistors allow that work to finish faster than the extra power increase accumulates to surpass the previous energy total.
Perhaps in the long run being stuck in that situation would drive engineers to rework existing solutions to use less power, even if only because there's no additional room to increase the power or add transistors. But generally, it's been node shrinks reducing both capacitance and voltage required that has led to lower power requirements.
I think you might see a few cycles like that. Reducing voltage wouldn't stop being a focus right away. There would be some space to take a break from constant shrinkage and optimize for the "final node." What happens next also depends on whether that process becomes increasingly economical or not.