Netburst was little bit of braindead design as most of resources that could be benefit SMT was used by stupid recall mechanism - if instruction data wasn't ready instead of waiting data to be ready Netburst re-issued instruction repeatedly until it's data arrived. No wonder that those cpu's run...
Old one threaded game engines were synced to fps. Such a case fps per frame is pretty much constant. Multithreaded engines do run game/physical engines asyncronously to rendering engine so instructions aren't totally tied to fps, but for fps mattering visual part they still pretty much are, at...
Game is doing given numbers of instructions per frame if stupid things like lock spinning is excluded. And if it's included - performance that matters is that fps not non-useful instruction count executed. So yeah, when comparing game performance measure fps over anything non revealing measurements.
IPC can be calculated for game too including stalled cycles by locking as told before. But case with comparing 5700 to 5700x removes that argument, both CPU's share same 8c-ccx with similar clocks and 5700 actually have little less locking penalty as it have built-in northbridge vs external in...
Rdram was only used on Willamette and early Northwood - they swiched to DDR-ram with Granite bay long before Prescott. 32-bit Xeons did have 36 bit physical address space so 64GB ram limit.
Need only compare 5700 against 5700x. 5700 has a bit better memory latencies from unified design but only half a cache. It's can also compared to Zen2 with comparable 16mb CCX caches. https://www.techspot.com/review/2802-amd-ryzen-5700/
Zen3 doubled 1-thread cache. That's always massive uplift for cache sensitive applications and gave some places 100% IPC uplift. Direct AMD quote from link posted before: "It also transitioned to a new "unified complex" design that brought 8 cores and 32MB of L3 cache into a single group of...
Separate cache pools waste most of cache capacity to duplication. And AMD did make it clear that unified 32MB cache pool of Zen3 is responsible for most part of game speed up. https://www.amd.com/en/technologies/zen-core
Uop cache is about that energy efficiency difference between decoding instruction and fetching it from mop cache. MOP cache will take silicon space so for area efficiency it's better without but for efficiency simple instructions sets like arm can live without mop cache but x86 - just need it to...
Didn't AMD tell that 512-bit FPU is optional in Zen5 designs? Using 512bit FPU pipelines and load/store engine and trying to optimize that design to max density is little bit retard way to optimize designs?
Problem is that Intel did extract everything they could from chip - and gone too far. They become unstable. Big part of that is that SMT - removing SMT in chip design will decrease chips hot spots by simplying critical path routings. Have to actually wonder if anybody is actually tested what...
Point that should be seen is that in Intel hybrid cpu designs HT gain in best case scenario isn't 30% but at best 10%. It should be plain obvious that SMT should be dropped from P-cores and instead focus targeting best possible 1T-speed with those P-cores - and let highly-threaded workloads...
No, what they did do with Zen3 is to use marco-ops instead of micro-ops - they reduced PRF usage by letting macro-ops transfer data directly between execution units so they could increase concurrency without increasing PRF throughput.
In hybrid cpu configuration big cores are there for best per thread performance. If they want still to utilize SMT right core to have it is those e-cores - splitting slow core performance to half for best n-thread performance but still maintaining good 1-thread performance. Having SMT on their...
You do know that what you supposed means disabling HT. Splitting each core to two virtual cores = HT on, one thread per core = HT off. HT can of course "disabled" by parking one core from core pair. But to have 100% one thread performance HT have to be disabled totally as some hardware resources...
This specially was solutions for utilizing wider cores. Vectorization(SIMD) is working only when there's no dependencies between data - basically dependencies have to be resolved compile time. With loop unrolling it's also possible resolve dependencies (calculate or predict variables from...
Many part of loops are't vectorizable but can be unrolled. Compilers do unroll loops to extract parallelism but compile time unrolling is more limited than runtime. Hardware loop unrolling is pretty complicated scheme but known for ages - and todays hardware already has loop caches which is...
There's something that might give good results from very wide cores that aren't yet utilized - like hardware loop unrolling. Complex to do - but when done it makes possible to run every iteration of loop in it's own hardware making well use of very wide execution hardware. Though proper ISA...
Everything can be buggy or just doesn't work so usually cpu's have ability to switch off almost all performance options. But loading data with speculated pointers - why the hell everything has to be general purpose on todays cpus - why ISA doesn't implement different registers for data and...
Whole NPU meaning is to make optimized hardware for very short datatypes with simplified instructions. FPU does very complex instructions with long datatypes being exactly opposite optimizing point to NPU. And FPU isn't actually co-processor anymore, it's a part of a ISA and it cannot removed...
Speculation was about replacing FPU with NPU. FPU doesn't usually even support FP16 math, it's single precision is 32 bit and double precision 64 bit(or more). OK, FPU SIMD today do support also FP16 but that's just outlier for something like AI where NPU is aimed at. So no, you can't replace...
NPU is exact opposite of FPU. Floating point instructions are varying point so number expression range can be huge, like from 2^-64 to 2^64 and calculations can be done between opposite extremes. NPU instead is relying extremely short integers, like 4 and 8 bits - only 16 or 256 values. If we...
Actually higher leakage in cpu means higher clocking potential. Cpu manufacturers sometimes give golden overclocking samples away that are way too leakage to be sold. Metal resistance will grow with temperature so keeping silicon as cold as possible gives transistors more current to switch -...
I speculated in risc-v threads that cpu designs should go towards split register file designs. VISC is nothing more than split register file core with software abstraction layer - which isn't way I see split register cpu should be made but instead it should rely on ISA that allows using split...
Yeah, Intel should probably just made GoldenCove much bigger and run 40 of them @2Ghz with at least 4-way SMT to be competitive with rival's smaller cores. With such a strategy they are soon out of business. Actually they already are - have to wonder which parties are actually buying their...
That's absolutely not core's fault. Intel does pack them that way for being easily add 4-core complexes to existing ring/mesh networks but nothing is in their way to implement different L2-cache versions. They could for example made 60-core e-core solution which have fast 12MB L2 per core in...
Intel big cores are just wasted in their server grade cpu's - they are enormous as they are designed to run 6GHz and they are clocked about 3Ghz in their big configurations. Running them at 3GHz with HT means that their single-thread performance is actually lower than their small-core rivals...
Intel HT is symmetrical threading. Both threads are equal, every other clock cycle instructions are feed from different thread. There aren't primary/secondary thread but both threads are executed at speed that is a bit more than half of that execution of single thread on that core.
Big cores in hybrid designs are there to offer better thread performance. Using HT will nullify that as HT will split core thread performance to about half. Only beneficial case for HT in those hybrid designs are massively parallelized loads where single thread performance won't matter - and if...
Actually 386 protected mode is just same as 286, just expanded to 32bits, with paging unit added. 386 just allows misusing it's segmentation by allowing overlapping segments - which can be as big as addressable memory. It sure makes programming easier but that's actually a shame because 286/386...
x86 is hard to translate to any other ISA but Apple and Windows are doing it fine today for aarch64. Arm to risc-V and other way around isn't big deal at all, should except near native performance.
Legacy support ain't so important nowadays - it's pretty much enough that there is emulation layer to support existing software. Google is building risc-V version from Android and when it's ready and there's competitive hardware RV will be pretty much viable alternative for ARM for phones...
Basic principle should be KISS - keep it simple stupid. At Intel they time after time develop exact opposites of that - IAPX432, Itanium and so on, which is exactly opposite. Probably there's no-one really on top of design so they put everything ever invented in designs list and try to...
They made ISA to implement everything compiler could do to made their core forward-proof without recompilation and to make execution hardware as simple as possible - strictly in-order-design. They failed in both - code was not optimized for future without recompiling and executing hardware...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.