I ask Greg a few things about Veyron afterward:
They can fuse non-adjacent instructions, so no compiler support is needed.
LMUL is handled late in the pipeline, so LMUL>1 instructions are scheduled as single instructions and only split at issue.
The talk is up on youtube:
Biggest additional info, not present in the slides: They plan to release a Athena (8x Ascalon) devboard and even laptop for people to buy as a development platform.
This graph is actually the biggest source of SPEC2006/GHz estimates for newer processors I've seen so far.
Since RISC-V companies still mostly publish SPEC2006/GHz, I added some additional cores into the graph (without release date):
Full slides here: https://riscv.or.jp/wp-content/uploads/Japan_RISC-V_day_Spring_2025_compressed.pdf
The reported SPECint scores for Ascalon don't really match up with "Projected Zen5 performance in 2024".
Callandor looks insane though, 16 wide decode with 2-ahead branch predictor and...
It's worse than that, apart from the a64fx every SVE implementation reuses the NEON ALUs for their SVE implementation AFAIK.
So on the Neoverse V1, you can use four issue 128-bit NEON or two issue 256-bit SVE.
Which makes the gain from SVE minimal.
Take the AVX10 spec, change encodings of AVX10/128, AVX10/256 and AVX10/512 to overlap, remove the 0.1% of instructions that don't make sense anymore, add instruction that returns the vector length.
Now you have a scalable vector ISA and it's possible to write length agnostic code.
This maps...
https://godbolt.org/z/13xc73T3n
You just need to include v in the march string.
I also added -mrvv-max-lmul=dynamic, because that ends up with better codegen (tries to maximize LMUL).
IMO it should be the default option.
@Nothingness Give me some micro benchmarks where you see RISC-V codegen as lacking, and I'll try to benchmark it on XiangShanV2 and XiangShanV3 in the next couple of days.
clang O2 vectorizes it on both arm and RISC-V: https://godbolt.org/z/TGvWKWch3
Both do 4 adds per loop, but Arm takes 10 instructions, while RISC-V takes 8. If we are fair, and expand the load pair, and LMUL=2 instructions, then we got Arm 12 uops, and RISC-V 20 uops.
clang currently defaults...
Except the important part, the inner loop, is 8 instructions for RISC-V and 6 instructions for Arm, however it's 8 uops for both: https://godbolt.org/z/vMv4G98zf
Oh, and the RISC-V inner loop is 22 bytes, while the Arm inner loop is 24 bytes.
I've responded to the original twitter thread, so I'll just copy past my comment on r/riscv that paraphrases the answers:
Regarding RVC decode complexity
I think the decode is missing part of the picture.
For a fixed size isa to be competitive it needs to have more more complex...
I think it's because that's what the other RISC-V vendors use, as soon as one of them switches to SPEC2017 the others will likely follow.
Keep in mind that Tenstorrent for Ascalon and SiFive for the P870 both report >18 SPECint2006/GHz, but tenstorrent has 8 wide, while P870 6 wide decode (and...
The lane size usually the vector length though. Or to put it differently, the width of the vector execution units, specifically for vector permutations, is usually the same as the vector length.
Look at AVX512, there you've got a low latency high throughput vpermb, heck even vpermi2b, which...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.