Blender on ARM

jhu · Jan 2, 2015

Exophase said:
It's not like Blender is very representative of common loads for an apps processor that's deployed almost entirely in mobile (phones and tablets). In the apps space here double precision FP is rarely used very heavily. I don't know what else Blender really stresses - I doubt it's just doubles or Saltwell wouldn't perform that well either - but I do know that it's an application that isn't popular for the platform in general.

That said, Krait 200 came in products about 10 months before Cortex-A15 did, so it's not like Qualcomm had it as an alternative. As far as Cortex-A9 goes, Krait 200 usually beats it, although not always. Especially when clocked more at its peak frequencies and not at the low frequency you have it clocked at. Krait 300 and 400 improve things a little further, but maybe not as much as Qualcomm would have hoped. The performance is really all over the place vs the competition. It does seem to have some pretty big glass jaws, like small L1 caches, a fairly high L1 dcache latency when the L0 cache is missed (and some loads will probably miss from it pretty frequently), a very high L2 cache latency, and some weird decoding penalties - see here:

http://www.7-cpu.com/cpu/Krait.html

Now that's just looking at performance, where power efficiency and area are also huge factors. So it's hard to judge it purely on that basis alone.

I think going with Cortex-A57 in their current flagship (810) is a way of conceding that their uarch has fallen too far behind. Not that adding 64-bit support is trivial, but if it came down to only that I think they could have managed it in time.

What I find interesting is that even the stock core performance is all over the place. My tests have have the Cortex A9-based OMAPs being faster than the Cortex A9-based Exynos, but the Exynos device still feels faster.

Exophase · Jan 2, 2015

jhu said:
What I find interesting is that even the stock core performance is all over the place. My tests have have the Cortex A9-based OMAPs being faster than the Cortex A9-based Exynos, but the Exynos device still feels faster.

How much different? I see for this test you've listed OMAP4 but no Exynos 4 SoC.

It's not surprising there'd be some difference in performance even between different implementations of ARM CPUs. There's still a fair number of variables, like configurable cache sizes, TLB/BTB/GHB sizes, L2 latencies, AXI link count/width, and processor revision (where newer ones can have optimizations, or fix errata that required disabling performance improving features) Perhaps the biggest variable is memory controller performance, where Samsung's SoCs had traditionally outpaced TI's. On the other hand, OMAP4s had twice the L2 cache Exynos 4s did. So that could explain why they trade places in different tests/experiences.

soccerballtux · Jan 2, 2015

wow shows what I know about LLVM vs GCC performance.

also, that's funny regarding mpdecision

also, branch prediction

also, without custom CPU-arch-specific tweaking and another compile flag, the Exynos arch may differ sufficiently from TI's closer-to-stock A9 implementation that lends itself more readily to the Cortex A9 optimizations.

jhu · Jan 2, 2015

Exophase said:
How much different? I see for this test you've listed OMAP4 but no Exynos 4 SoC.

From my Povray thread, it's about 20% higher IPC. I no longer have the Galaxy S2.

Exophase said:
It's not surprising there'd be some difference in performance even between different implementations of ARM CPUs. There's still a fair number of variables, like configurable cache sizes, TLB/BTB/GHB sizes, L2 latencies, AXI link count/width, and processor revision (where newer ones can have optimizations, or fix errata that required disabling performance improving features) Perhaps the biggest variable is memory controller performance, where Samsung's SoCs had traditionally outpaced TI's. On the other hand, OMAP4s had twice the L2 cache Exynos 4s did. So that could explain why they trade places in different tests/experiences.

That's interesting, and confusing!

jhu · Jan 2, 2015

soccerballtux said:
wow shows what I know about LLVM vs GCC performance.

LLVM will have good performance, eventually. Just as it took GCC a while to have good performance.

soccerballtux said:
also, that's funny regarding mpdecision

This one was pretty annoying. Was wondering why setting the CPU speed wasn't working.

soccerballtux · Feb 18, 2015

jhu said:
What do you have in mind?

OK I see now. Wasn't interpreting these correctly.

soccerballtux · Feb 18, 2015

jhu said:
LLVM will have good performance, eventually. Just as it took GCC a while to have good performance.

I just didn't know it was 'so far' behind. From what I heard regarding it's use in PS4 game development I thought it had reached parity and was outperforming GCC.

jhu · Feb 19, 2015

soccerballtux said:
I just didn't know it was 'so far' behind. From what I heard regarding it's use in PS4 game development I thought it had reached parity and was outperforming GCC.

They have other reasons for using LLVM. Plus, they're making optimization changes and sending their changes upstream. So it may be that LLVM is now faster than GCC on Jaguar.

soccerballtux · Feb 20, 2015

jhu said:
They have other reasons for using LLVM. Plus, they're making optimization changes and sending their changes upstream. So it may be that LLVM is now faster than GCC on Jaguar.

I don't believe they were comparing to 4.8

jhu · Feb 20, 2015

soccerballtux said:
I don't believe they were comparing to 4.8

Well, if anyone has a Jaguar-based system running FreeBSD (apparently PS4 runs FreeBSD), please test out LLVM and GCC performance.

jhu · Aug 2, 2015

Updated with a BayTrail CPU results. The original samples/s was off by a factor of 4, but the relative performance between CPUs remains unchanged. Atom sure has come a long way since the Bonnell days.

Seems the samples/s is still off. I'll update them later.

jhu · Aug 18, 2015

Updated with PowerPC 7400. PowerPC 970FX results coming soon!

jhu · Aug 24, 2015

Updated with Pentium 3. I was expecting it to be faster.

soccerballtux · Aug 25, 2015

still favorite thread of all time

jhu · Sep 19, 2015

Updated with PowerPC 970MP.

jhu · Dec 3, 2015

Added a Skylake CPU.

jhu · Feb 17, 2016

Added NVidia Tegra K1, 32-bit (Denver, Nexus 9). 64-bit version keeps giving seg faults. Still looking into this issue.

jhu · Feb 22, 2016

Updated with 64-bit Tegra K1 (Denver) results. It's significantly faster than the 32-bit result. Image rendering results are the same, so it's not an anomalous result.

Exophase · Feb 22, 2016

jhu said:
Updated with 64-bit Tegra K1 (Denver) results. It's significantly faster than the 32-bit result. Image rendering results are the same, so it's not an anomalous result.

Does Blender use a lot of double precision math? This is not part of the SIMD in ARMv7 but is in ARMv8 AArch64. So it can offer a big performance gain.

nVidia says they actually do double precision vectorization in their translation but I doubt that it can do as well as a compiler.

jhu · Feb 22, 2016

Exophase said:
Does Blender use a lot of double precision math? This is not part of the SIMD in ARMv7 but is in ARMv8 AArch64. So it can offer a big performance gain.

nVidia says they actually do double precision vectorization in their translation but I doubt that it can do as well as a compiler.

For rendering, it's single precision, hence the effort at CUDA rendering support (which significantly faster than CPU rendering).

Nothingness · Feb 22, 2016

Exophase said:
Does Blender use a lot of double precision math? This is not part of the SIMD in ARMv7 but is in ARMv8 AArch64. So it can offer a big performance gain.

IIRC gcc will not vectorize even single precision for ARMv7 unless you compile with -ffast-math (due to vectored SP not being IEEE compliant on v7).

jhu · Feb 22, 2016

Nothingness said:
IIRC gcc will not vectorize even single precision for ARMv7 unless you compile with -ffast-math (due to vectored SP not being IEEE compliant on v7).

Actually, -funsafe-math-optimizations

tynopik · Feb 22, 2016

can we get 486 and pentium results?

jhu · Feb 22, 2016

tynopik said:
can we get 486 and pentium results?

Two issues here that make that unfeasible.

1) I don't have a 486 or Pentium
2) Rendering the benchmark requires about 200MiB of RAM. You'll generally not see this much RAM on such old systems.

nismotigerwvu · Feb 23, 2016

I have a A8-3850 based system I'm using as an HTPC I can run some benchmarks on if you had any interest in seeing if LLano performed any differently than Thuban/Deneb

Blender on ARM

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Golden Member