@ Pelov
Yes, it should have dynamically realizable L2 cache. It's more of a power saving technique since PD's single core within a module can use up the whole L2 if possible. What they will need to do is improve L2 read and write performance and this is what they claim they achieved(especially write). OoO loads and store are more efficient also(supposedly).
Wouldn't that introduce problems too? Say, bumping up another thread's stores to LLC which isn't going to get any better, judging from AMD's last statements regarding Steamroller? For power consumption that would do wonders, though. With that much cache, you need a way of toggling that on and off.
The read and writes have improved, although still lag behind Intel.
IMO, this should be AMD's #1 concern: cache and IMC. Cache is still the most likely reason for the PD architecture underperforming in lightly-threaded workloads and the IMC is going to be their biggest bottleneck in their APUs. They've made improvements, even if the L2's cycles don't seem to have changed, but not enough to notable.
A more CMT aware OS helps avoid non-optimal loading, which I believe is one of the reasons Linux comparisons of Bulldozer weren't quite as terrible as on Windows 7.
AMD advise against loading modules first then doubling up on threads after you've hit them all. The power consumption while doing that would be much worse than it already is. The win8 scheduler doesn't do that either. The scheduler is much more complicated, taking into account like threads sharing resources, turbo, and more.
The two biggest reasons BD/PD do better in Linux are 1 - Linux tends to have fairly well threaded software, and 2 - the GCC compiler.