You have to say it's a real eye-opener when a dev says that 20% more performance is "not unreasonable", using something that has basically always been there but hardly ever used.
We know Mantle will give huge gains in CPU bottlenecked scenarios, but to read about the compute gains, which are basically free...that's in GPU bottlenecked scenarios. I mentioned compute shader lighting before and we might see that in BF4, or some kind of post process AA. I think they talk about other gains in the deep dive from BSN? 5% from the new memory model was one thing mentioned I believe.
There is just so much here, so many new things that devs can take advantage of. It's unbelievable that it's taken so long to realise it.
There are slide decks from the
GPU13 and
APU13 presentations outlining some of the elements from which Mantle performance gains will be derived. I'll list some of them out for the sake of contextual clarity.
GPU13 - DICE slide deck
-Very low overhead rendering, loading & streaming 
-Perfect parallel rendering - utilize all 8 CPU cores 
-Avoid bottlenecking the GPU and the system 
-Highly optimized GPU usage 
-Full access to graphics hardware capabilities 
-Lots of low-level optimizations made possible 
APU13 - DICE slide deck:
Control
-Thin low-level abstraction to expose how hardware works 
-App explicit memory management 
-Resource CPU access tied to device context 
-Resources are globally accessible  
-App explicit resource state
Control App responsibility 
-Tell when render target will be used as a texture ‒ And many more resource state transitions 
-Don’t destroy resources that GPU is using ‒ Keep track with fences or frames Manual dynamic resource renaming ‒ No DISCARD for driver resource renaming 
-Resource memory tiling 
-Powerful validation layer will help!
Control Explicit control enables 
-App high-level decisions & optimizations ‒ Has full scene information
-Easier to optimize performance & memory 
-Flexible & efficient memory management ‒ Linear frame allocators ‒ Memory pools ‒ Pinned memory 
-Reduced development time ‒ For advanced game engines & apps ‒ Easier to get to target performance & robustness
Control Explicit control enables 
-Transient resources ‒ Alias render targets within frame ‒ Major memory savings ‒ No need to pre-allocate everything 
-Light-weight driver ‒ Easier to develop & maintain ‒ Reduced CPU draw call overhead
CPU performance
CPU performance Descriptor sets 
-Table with resource references to bind to graphics or compute pipeline
-Image Memory Sampler Link 
-Replaces traditional resource stage binding ‒ Major performance & flexibility advantage ‒ Closer to how the hardware works 
-Example 1: Single simple dynamic descriptor set ‒ Bind everything you need for a single draw call ‒ Close to DX/GL model but share between stages Dynamic descriptor set VertexBuffer (VS) Texture0 (VS+PS) Constants (VS) Texture1 (PS) 
-App managed - lots of strategies possible! ‒ Tiny vs huge sets ‒ Single vs multiple ‒ Static vs semi-static vs dynamic Texture2 (PS) Sampler0 (VS+PS)
CPU performance Command buffers 
-Issue pipelined graphics & compute commands into a command buffer ‒ Bind graphics state, descriptor sets, pipeline ‒ Draw calls ‒ Render targets ‒ Clears ‒Memory transfers ‒ NOT: resource mapping 
-Fully independent objects ‒ Create multiple every frame ‒ Or pre-build up front and reuse
CPU performance Parallel dispatch with Mantle
-CPU 0 Game Game Game
-CPU 1 Render Render Render
-CPU 2 Render Render Render
-CPU 3 Render Render Render
-CPU 4 Render Render Render 
-App can go fully wide with its rendering – minimal latency  
-Close to linear scaling with CPU cores  
-No driver threads – no overhead – no contention  
- Frostbite’s approach on all consoles – and on PC with Mantle!
GPU performance
GPU performance optimizations 
-Thanks to improved CPU performance – CPU will rarely be a bottleneck for the GPU ‒ CPU could help GPU more: ‒ Less brute force rendering ‒ Improve culling 
-Resource states ‒ Gives driver a lot more knowledge & flexibility ‒ Apps can avoid expensive/redundant transitions, such as surface decompression  -Expose existing GPU functionality 
-Shader pipeline object – driver optimizations ‒ Can optimize with pipeline state knowledge ‒ Can optimize across all shader stages ‒ Quad & Rect-lists ‒ HW-specific MSAA & depth data access ‒ Programmable sample patterns ‒ -And more..
GPU performance Queues 
-Modern GPUs are heterogeneous machines with multiple engines Graphics ‒ Graphics pipeline ‒ Compute pipeline(s) ‒ DMA transfer ‒ Video encode/decode ‒ More… 
-Mantle exposes queues for the engines + synchronization primitives -Compute DMA ... Queues GPU
-Async DMA transfers ‒ Copy resources in parallel with graphics or compute Copy DMA Graphics Render Other render Use copy
- Multiple compute kernels collaborating ‒ Copy resources in parallel with graphics or compute ‒ Can be faster than über-kernel ‒ Example: Compute geometry backend & compute rasterizer 
-Async compute together with graphics ‒ ALU heavy compute work at the same time as memory/ROP bound work to utilize idle units Compute 0 Compute 1 Graphics Compute Geometry Compute Rasterizer Ordinary Rendering
-Compute as frontend for graphics pipeline ‒ Compute runs asynchronously ahead and prepares& optimizes geometry for graphics pipeline Process0 Draw1 Draw2
Programmability
Explicit Multi-GPU 
-Explicit control of GPU queues and synchronization, finally! ‒ Implement your own Alternate-Frame-Rendering ‒ Or something more exotic.. 
-Use case: Workstation rendering with 4-8 GPUs ‒ Super high-quality rendering & simulation ‒ Load balance graphics & compute job graphs across GPUs ‒ 20-40 TFlops in a single machine! 
-Use case: Low-latency rendering ‒ Important for VR and competitive games ‒ Latency optimized GPU job graph scheduling ‒ VR: Simultaneously drive 2 GPUs (1 per eye
New mechanisms 
-Command buffer predication & flow control ‒ GPU affecting/skipping submitted commands ‒ Go beyond DrawIndirect / DispatchIndirect ‒ Advanced variable workloads ‒ Advanced culling optimizations 
-Write occlusion query results into GPU buffer ‒ No CPU roundtrip needed ‒ Can drive predicated rendering ‒ Or use results directly in shaders (lens flares)
Bindless resources 
-Mantle supports bindless resources ‒ Shaders can select resources to use instead of static binding from CPU ‒ Extension of the descriptor set support  -Examples ‒ Performance optimizations – less data to update ‒ Logic & data structures that live fully on the GPU ‒ Scene culling & rendering ‒ Material representations 
-Key component that will open up a lot of opportunities! ‒ Deferred shading ‒ Raytracing
There's also a Nixxus slide deck here with numerous additional examples.
Operating on the logical assumption each of the above listed Mantle advantages will result in
some performance gain and that the above list is by no means a complete delineation of the performance gains Mantle can enable, totaling up the sum of the individual performance gains would logically result in a very substantial figure indeed. As the order of magnitude increase in draw calls alone can enable a 20% performance increase, a 'guesstimate' of a 40% performance increase overall would not be amiss.