Blender on ARM

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

jhu

Lifer
Oct 10, 1999
11,918
9
81
It's not like Blender is very representative of common loads for an apps processor that's deployed almost entirely in mobile (phones and tablets). In the apps space here double precision FP is rarely used very heavily. I don't know what else Blender really stresses - I doubt it's just doubles or Saltwell wouldn't perform that well either - but I do know that it's an application that isn't popular for the platform in general.

That said, Krait 200 came in products about 10 months before Cortex-A15 did, so it's not like Qualcomm had it as an alternative. As far as Cortex-A9 goes, Krait 200 usually beats it, although not always. Especially when clocked more at its peak frequencies and not at the low frequency you have it clocked at. Krait 300 and 400 improve things a little further, but maybe not as much as Qualcomm would have hoped. The performance is really all over the place vs the competition. It does seem to have some pretty big glass jaws, like small L1 caches, a fairly high L1 dcache latency when the L0 cache is missed (and some loads will probably miss from it pretty frequently), a very high L2 cache latency, and some weird decoding penalties - see here:

http://www.7-cpu.com/cpu/Krait.html

Now that's just looking at performance, where power efficiency and area are also huge factors. So it's hard to judge it purely on that basis alone.

I think going with Cortex-A57 in their current flagship (810) is a way of conceding that their uarch has fallen too far behind. Not that adding 64-bit support is trivial, but if it came down to only that I think they could have managed it in time.

What I find interesting is that even the stock core performance is all over the place. My tests have have the Cortex A9-based OMAPs being faster than the Cortex A9-based Exynos, but the Exynos device still feels faster.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
What I find interesting is that even the stock core performance is all over the place. My tests have have the Cortex A9-based OMAPs being faster than the Cortex A9-based Exynos, but the Exynos device still feels faster.

How much different? I see for this test you've listed OMAP4 but no Exynos 4 SoC.

It's not surprising there'd be some difference in performance even between different implementations of ARM CPUs. There's still a fair number of variables, like configurable cache sizes, TLB/BTB/GHB sizes, L2 latencies, AXI link count/width, and processor revision (where newer ones can have optimizations, or fix errata that required disabling performance improving features) Perhaps the biggest variable is memory controller performance, where Samsung's SoCs had traditionally outpaced TI's. On the other hand, OMAP4s had twice the L2 cache Exynos 4s did. So that could explain why they trade places in different tests/experiences.
 
Dec 30, 2004
12,553
2
76
wow shows what I know about LLVM vs GCC performance.

also, that's funny regarding mpdecision

also, branch prediction

also, without custom CPU-arch-specific tweaking and another compile flag, the Exynos arch may differ sufficiently from TI's closer-to-stock A9 implementation that lends itself more readily to the Cortex A9 optimizations.
 
Last edited:

jhu

Lifer
Oct 10, 1999
11,918
9
81
How much different? I see for this test you've listed OMAP4 but no Exynos 4 SoC.

From my Povray thread, it's about 20% higher IPC. I no longer have the Galaxy S2.

It's not surprising there'd be some difference in performance even between different implementations of ARM CPUs. There's still a fair number of variables, like configurable cache sizes, TLB/BTB/GHB sizes, L2 latencies, AXI link count/width, and processor revision (where newer ones can have optimizations, or fix errata that required disabling performance improving features) Perhaps the biggest variable is memory controller performance, where Samsung's SoCs had traditionally outpaced TI's. On the other hand, OMAP4s had twice the L2 cache Exynos 4s did. So that could explain why they trade places in different tests/experiences.

That's interesting, and confusing!
 
Dec 30, 2004
12,553
2
76
LLVM will have good performance, eventually. Just as it took GCC a while to have good performance.
I just didn't know it was 'so far' behind. From what I heard regarding it's use in PS4 game development I thought it had reached parity and was outperforming GCC.
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
I just didn't know it was 'so far' behind. From what I heard regarding it's use in PS4 game development I thought it had reached parity and was outperforming GCC.

They have other reasons for using LLVM. Plus, they're making optimization changes and sending their changes upstream. So it may be that LLVM is now faster than GCC on Jaguar.
 
Dec 30, 2004
12,553
2
76
They have other reasons for using LLVM. Plus, they're making optimization changes and sending their changes upstream. So it may be that LLVM is now faster than GCC on Jaguar.

I don't believe they were comparing to 4.8
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
Updated with a BayTrail CPU results. The original samples/s was off by a factor of 4, but the relative performance between CPUs remains unchanged. Atom sure has come a long way since the Bonnell days.

Seems the samples/s is still off. I'll update them later.
 
Last edited:

jhu

Lifer
Oct 10, 1999
11,918
9
81
Added NVidia Tegra K1, 32-bit (Denver, Nexus 9). 64-bit version keeps giving seg faults. Still looking into this issue.
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
Updated with 64-bit Tegra K1 (Denver) results. It's significantly faster than the 32-bit result. Image rendering results are the same, so it's not an anomalous result.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Updated with 64-bit Tegra K1 (Denver) results. It's significantly faster than the 32-bit result. Image rendering results are the same, so it's not an anomalous result.

Does Blender use a lot of double precision math? This is not part of the SIMD in ARMv7 but is in ARMv8 AArch64. So it can offer a big performance gain.

nVidia says they actually do double precision vectorization in their translation but I doubt that it can do as well as a compiler.
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
Does Blender use a lot of double precision math? This is not part of the SIMD in ARMv7 but is in ARMv8 AArch64. So it can offer a big performance gain.

nVidia says they actually do double precision vectorization in their translation but I doubt that it can do as well as a compiler.

For rendering, it's single precision, hence the effort at CUDA rendering support (which significantly faster than CPU rendering).
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
Does Blender use a lot of double precision math? This is not part of the SIMD in ARMv7 but is in ARMv8 AArch64. So it can offer a big performance gain.
IIRC gcc will not vectorize even single precision for ARMv7 unless you compile with -ffast-math (due to vectored SP not being IEEE compliant on v7).
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
can we get 486 and pentium results?

Two issues here that make that unfeasible.

1) I don't have a 486 or Pentium
2) Rendering the benchmark requires about 200MiB of RAM. You'll generally not see this much RAM on such old systems.
 

nismotigerwvu

Golden Member
May 13, 2004
1,568
33
91
I have a A8-3850 based system I'm using as an HTPC I can run some benchmarks on if you had any interest in seeing if LLano performed any differently than Thuban/Deneb
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |