Blender on ARM

jhu

Lifer
Oct 10, 1999
11,918
9
81
Using the BMW blend file here.

Custom compiled binaries (Blender 2.71)
Code:
01. Core i5 6200U (2.7 GHz turbo, 2C/4T)        508011 samples/s ; 94076 samples/s/core/GHz
02. Core i5 3317U (2.4 GHz turbo, 2C/4T)        368928 samples/s ; 76860 samples/s/core/GHz
03. Core i7 2600 (3.5 GHz turbo, 4C/8T)        1032258 samples/s ; 73732 samples/s/core/GHz
04. Core i5 6200U (2.8 GHz turbo, 1 thread)     188505 samples/s ; 67323 samples/s/GHz
05. Core i5 4570 (3.4 GHz turbo, 4C/4T)         915012 samples/s ; 67280 samples/s/core/GHz
06. Core i5 3317U (2.4 GHz turbo, 1 thread)     156340 samples/s ; 60130 samples/s/GHz
07. FX 8350 (4.1 GHz turbo, 4M/8T)              962942 samples/s ; 58716 samples/s/modules/GHz
08. Core i5 2400S (2.5 GHz turbo, 4C/4T)        599758 samples/s ; 57699 samples/s/core/GHz
09. FX 8350 (4.1 GHz turbo, 4M/8T)              876268 samples/s ; 53431 samples/s/modules/GHz   
10. Core 2 Duo E8400 (3 GHz, 2C/2T)             267630 samples/s ; 44605 samples/s/core/GHz
11. Tegra K1 (Denver 64-bit, 1.02 GHz 2C/2T)     78307 samples/s ; 38386 samples/s/core/GHz
12. Phenom II x6 (3.2 GHz, 6C/6T)               706364 samples/s ; 36790 samples/s/core/GHz
13. FX 8350 (4.2 GHz turbo, 1 thread)           151348 samples/s ; 36035 samples/s/GHz
14. PowerPC 970MP (2.0 GHz, 2C/2T)              133486 samples/s ; 33371 samples/core/s/gHz
15. FX 8350 (4.2 GHz turbo, 1 thread)           135564 samples/s ; 32277 samples/s/GHz
16. PowerPC 750 (underclocked 0.4 GHz, 1 thread) 11329 samples/s ; 28323 samples/s/GHz
16. PowerPC 7400 (0.466 GHz, 1 thread)           12896 samples/s ; 27673 samples/s/GHz
17. AMD E-450 (1.65 GHz, 2C/2T)                  81854 samples/s ; 24804 samples/s/core/GHz
18. Atom Z3735F (1.33 GHz, 4C/4T)               152314 samples/s ; 24100 samples/s/core/GHz
19. Tegra K1 (Denver, 1.02 GHz 2C/2T)            44099 samples/s ; 21617 samples/s/core/GHz
20. Exynos 5250 (1.7 GHz; 2C/2T)                 65434 samples/s ; 19245 samples/s/core/GHz
21. Atom N270 (1.6 GHz; 1C/2T)                   28980 samples/s ; 18113 samples/s/GHz
22. Pentium D 805 (2.66 GHz, 64-bit, 2C/2)       81512 samples/s ; 15321 samples/s/core/GHz
23. TI OMAP 4470 (1.5 GHz, 2C/2T)                43042 samples/s ; 14347 samples/s/core/GHz
24. Atom N270 (1.6 GHz, 1C/1T)                   17706 samples/s ; 11067 samples/s/GHz
25. Pentium 3 (0.45 GHz, 1C/1T)                   4970 samples/s ; 11044 samples/s/GHz
26. MSM 8974ABv3 (1.037 GHz, 4C/4T)              44398 samples/s ; 10703 samples/s/core/GHz
27. APQ 8064 (1.026 GHz, 4C/4T)                  33238 samples/s ;  8099 samples/s/core/GHz

System notes:
  • 01, 04: Debian 8.2, gcc 4.9 -march=haswell
  • 02, 03, 06: Ubuntu 14.04, gcc 4.8 -march=core-avx-i
  • 05: Ubuntu 14.04, gcc 4.8 -mtune=core-avx2
  • 07, 13: Ubuntu 14.04, gcc 4.8 -march=bdver1 -mtune=bdver2
  • 09, 14: FreeBSD 10, llvm 3.3 -march=bdver2
  • 08, 10: Ubuntu 14.04, gcc 4.8 -march=core2
  • 11: Debian 9 (stretch, 64-bit), gcc 5.3 (Nexus 9)
  • 12: Ubuntu 14.04, gcc 4.8 -march=barcelona
  • 14: Debian 8.1, gcc 4.9, -mcpu=generic64 -mtune=970 (PowerMac 11,2 @ 2.0 GHz)
  • 16: Debian 8.1, gcc 4.9 -mcpu=7400 (PowerMac 3,4 @ 466 MHz)
  • 17: Ubuntu 14.04, gcc 4.8 -march=btver1
  • 18: Debian 8.1, gcc 4.9.2 -march=silvermont (ASUS EeeBook X205TA, 64-bit, max turbo 1.83 GHz, 1.58 GHz at constant 4 thread load)
  • 19: Debian 8.2 (32-bit), gcc 4.9.2 -march=cortex-a15 (Nexus 9)
  • 20: Debian 8, gcc 4.8 -march=armv7 -mtune=cortex-a15 (Samsung Chromebook)
  • 21: Debian 8.1, gcc 4.9 -march=bonnell
  • 22: Ubuntu 14.04, gcc 4.8 -march=nocona
  • 23: Debian 8, gcc 4.8 -march=armv7 -mtune=cortex-a9 (Barnes & Noble Nook HD+)
  • 24: Debian 8.1, gcc 4.9 -march=bonnell
  • 25: Debian 8.1, gcc 4.9 -march=pentium3
  • 26: Debian 8, gcc 4.8 -march=armv7 -mtune=cortex-a9 (HTC One M8, max clock 2.2 GHz, underclocked to 1.037 GHz, needed to rename /system/bin/mpdecision otherwise it will override CPU clockspeed)
  • 27: Debian 8, gcc 4.8 -march=armv7 -mtune=cortex-a9 (Nexus 4, max clock 1.5 GHz, underclocked to 1.026 GHz)
Unable to compile with ARM versions with NEON support. Oh well.

Pre-compiled binaries (Blender 2.71 or 2.72)
Code:
00. Core i7 4770K (3.7 GHz turbo, 4C/8T)        1135846 samples/s ; 76746 samples/s/core/GHz
01. Core i5 3317U (2.4 GHz turbo, 2C/4T)         324000 samples/s ; 67500 samples/s/core/GHz
02. Core i7 2600 (3.5 GHz turbo, 4C/8T)          905660 samples/s ; 64690 samples/s/core/GHz
03. Core i5 4570 (3.4 GHz turbo, 4C/4T)          838156 samples/s ; 61629 samples/s/core/GHz
04. Core i5 3317U (2.4 GHz turbo, 1 thread)      140298 samples/s ; 53961 samples/s/GHz
05. FX 8350 (4.1 GHz turbo, 4M/8T)               898206 samples/s ; 54768 samples/s/modules/GHz
06. Core i5 2400S (2.5 GHz turbo, 4C/4T)         529872 samples/s ; 52987 samples/s/core/GHz
07. Core 2 Duo E8400 (3 GHz, 2C/2T)              254586 samples/s ; 42431 samples/s/core/GHz
08. Core 2 Duo (2.2 GHz, 2C/2T)                  162370 samples/s ; 36902 samples/s/core/GHz
09. Phenom II x6 (3.2 GHz, 6C/6T)                680940 samples/s ; 35465 samples/s/core/GHz
10. FX 8350 (4.2 GHz turbo, 1 thread)            141896 samples/s ; 33785 samples/s/GHz
11. PowerPC 7400 (0.466 GHz, 1 thread)            11354 samples/s ; 24365 samples/s/GHz
12. AMD E-450 (1.65 GHz, 2C/2T)                   76898 samples/s ; 23302 samples/s/core/GHz
13. Atom Z3735F (1.33 GHz, 4C/4T)                146436 samples/s ; 23170 samples/s/GHz
14. Atom N270 (1.6 GHz, 1C/2T)                    26218 samples/s ; 16386 samples/s/GHz
15. Exynos 5250 (1.7 GHz; 2C/2T)                  54284 samples/s ; 15965 samples/s/core/GHz
16. Pentium D 805 (2.66 GHz, 64-bit, 2C/2T)       75026 samples/s ; 14102 samples/s/core/GHz
17. TI OMAP 4470 (1.5 GHz, 2C/2T)                 36720 samples/s ; 12240  samples/s/core/GHz
18. Atom N270 (1.6 GHz, single thread)            16814 samples/s ; 10509 samples/s/GHz
19. Pentium !!! (0.45 GHz, single thread)          4326 samples/s ;  9613 samples/s/GHz
20. Pentium 4m (1.5 GHz)                          10176 samples/s ;  6784 samples/s/GHz
21. MSM 8974ABv3 (??? GHz, 4C/4T)                 61880 samples/s ;  ???? samples/s/GHz

All the systems I have access to (not necessarily mine). Results are 64-bit, except for the processors that aren't (Exynos 5250, MSM 8974, Pentium 4m)

System notes:
  • 00: Nothingness's system running Fedora 19, official Blender 2.71 binary
  • 01 - 07, 09, 10, 12, 14, 16, 18 : Ubuntu 14.04, official Blender 2.71 binary
  • 08: Mac OS X 10.6.8, official Blender 2.71 binary
  • 11: Debian 8.1, Debian repository 2.72 binary
  • 13: Debian 8.1, official Blender 2.71 binary (ASUS EeeBook X205TA, 64-bit)
  • 15: Debian 8 in chroot running on top of ChromeOS 39.0.2171.96, Debian repository Blender 2.72b binary (Samsung Chromebook)
  • 17: Debian 8 in chroot running on top of Android 4.3, Debian repository Blender 2.72b binary (Barnes & Noble Nook HD+ running Cyanogenmod 10)
  • 19: Debian 8.1, Debian repository 2.72 binary
  • 20: Debian 7, official Blender 2.71 binary
  • 21: Debian 8 in chroot running on top of Android 4.4, Debian repository Blender 2.72 binary (HTC One M8 running Cyanogenmod 11)
 
Last edited:
Dec 30, 2004
12,553
2
76
AMD's turbo implementation is potentially significantly lacking

I have a filed a support request directly with the Asus engineers to verify it's not just my problem or a BIOS issue with my FX8310

recommend using AmdMsrTweaker to manually set P-state 'P1' (officially Pb1's MSR registers) to Pb0's (P0's) V and FID. For the purpose of 'making sure' we've got the right bench.

For single threaded benchmarking this should enable you to hit the max 4.1ghz while staying within the TDP (ie, avoid throttling).
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
AMD's turbo implementation is potentially significantly lacking

I have a filed a support request directly with the Asus engineers to verify it's not just my problem or a BIOS issue with my FX8310

recommend using AmdMsrTweaker to manually set P-state 'P1' (officially Pb1's MSR registers) to Pb0's (P0's) V and FID. For the purpose of 'making sure' we've got the right bench.

For single threaded benchmarking this should enable you to hit the max 4.1ghz while staying within the TDP (ie, avoid throttling).

Haven't seen any issues with turbo on my FX 8350. 8 threads it runs at 4.1 GHz. 1 thread it runs at 4.2 GHz.
 

greatnoob

Senior member
Jan 6, 2014
968
395
136
Thanks for posting your results, I used Excel to create a line graph based on your data and make it more readable:

 
Dec 30, 2004
12,553
2
76
What do you have in mind?

just games in general. There's been a lot of discussion about long-term potential of the FX platform, and my and several others' opinions that some 'funny business' is going on leading to lowest-end i3's performing competitively to the FX-8xxx series in even heavily multithreaded games like Battlefield 4.

My personal opinion is that considering the my FX-8310 gets ~220fps in Handbrake BBB Android mp4 encode and the i3's get ~80fps, that there's some unfair optimization going on in the BF4 engine, and in other games; something along the lines of this where Ars found writing GENUINEINTEL to the CPUID registers of the Via Nano improved performance due to 'unfair' optimizations on behalf of the Intel compiler ignoring existing SIMD instruction support in the processor simply on the basis of whether the processor reads GENUINEINTEL or not.




my point being that based on your results the FX architecture is not nearly as bad as some would have us believe.
 
Last edited:

jhu

Lifer
Oct 10, 1999
11,918
9
81
Indeed, armel doesn't seem to use hardware FP at all...

Last time I tested it, it does actually use the hardware FP. Unfortunately it doesn't pass FP in registers between OS functions and there's apparently a huge speed penalty for doing things that way.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,358
136
Last time I tested it, it does actually use the hardware FP. Unfortunately it doesn't pass FP in registers between OS functions and there's apparently a huge speed penalty for doing things that way.
Running nm -Du on the blender binary from Debian armel package shows things like this:
Code:
         U __aeabi_dadd
         U __aeabi_dcmpeq
         U __aeabi_dcmpge
         U __aeabi_dcmpgt
         U __aeabi_dcmple
         U __aeabi_dcmplt
         U __aeabi_dcmpun
         U __aeabi_ddiv
         U __aeabi_dmul
         U __aeabi_dsub
         U __aeabi_f2d
         U __aeabi_f2iz
         U __aeabi_f2lz
         U __aeabi_f2uiz
         U __aeabi_fadd
         U __aeabi_fcmpeq
         U __aeabi_fcmpge
         U __aeabi_fcmpgt
         U __aeabi_fcmple
         U __aeabi_fcmplt
         U __aeabi_fdiv
         U __aeabi_fmul
         U __aeabi_fsub
This means the executable will use software simulated FP D:
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,358
136
the 8350 is a throughput monster, come on amd moar coars now!
This is per module. It would be interesting to compare against Intel CPU with HT.
EDIT : look at 3317U result.
EDIT2 : misread the graph as point/core/GHz, sorry
 
Last edited:

jhu

Lifer
Oct 10, 1999
11,918
9
81
Running nm -Du on the blender binary from Debian armel package shows things like this:
Code:
         U __aeabi_dadd
         U __aeabi_dcmpeq
         U __aeabi_dcmpge
         U __aeabi_dcmpgt
         U __aeabi_dcmple
         U __aeabi_dcmplt
         U __aeabi_dcmpun
         U __aeabi_ddiv
         U __aeabi_dmul
         U __aeabi_dsub
         U __aeabi_f2d
         U __aeabi_f2iz
         U __aeabi_f2lz
         U __aeabi_f2uiz
         U __aeabi_fadd
         U __aeabi_fcmpeq
         U __aeabi_fcmpge
         U __aeabi_fcmpgt
         U __aeabi_fcmple
         U __aeabi_fcmplt
         U __aeabi_fdiv
         U __aeabi_fmul
         U __aeabi_fsub
This means the executable will use software simulated FP D:

Ouch! That is awful. However, I did compile Povray before using soft-float (ie uses floating point hardware but doesn't pass results in floating point registers) and it was still abysmally slow. This was about two years ago and never looked back. armel is just awful.
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
just games in general. There's been a lot of discussion about long-term potential of the FX platform, and my and several others' opinions that some 'funny business' is going on leading to lowest-end i3's performing competitively to the FX-8xxx series in even heavily multithreaded games like Battlefield 4.

My personal opinion is that considering the my FX-8310 gets ~220fps in Handbrake BBB Android mp4 encode and the i3's get ~80fps, that there's some unfair optimization going on in the BF4 engine, and in other games; something along the lines of this where Ars found writing GENUINEINTEL to the CPUID registers of the Via Nano improved performance due to 'unfair' optimizations on behalf of the Intel compiler ignoring existing SIMD instruction support in the processor simply on the basis of whether the processor reads GENUINEINTEL or not.




my point being that based on your results the FX architecture is not nearly as bad as some would have us believe.

That's nearly 10 years ago. So I just tested Povray performance using icc 14 vs. gcc 4.8 on Ubuntu 14.04 on the FX 8350 and the results are below. I'll try to compile Blender with icc, but I'm having trouble even setting compiler flags for compiling Blender.

Code:
gcc -march=bdver2:             1985 pps
gcc -march=barcelona:          1692 pps
gcc -march=nocona:             1692 pps

icc -march=corei7-avx:         1789 pps
icc -march=corei7-avx-i:       1789 pps
icc -axCORE-AVX-I:             1789 pps
icc -march=corei7:             1692 pps
icc -march=core2:              1692 pps
icc -march=corei7-avx2:        segmentation fault

Looks like gcc beats icc on the FX 8350. Of note icc's -ax switch does dispatching and it appears that corei7-avx-i path is taken rather than the core2 path (this is oppositve of what people found, I included, several years ago). And, of course, Piledriver doesn't have avx2, thus the seg fault.
 
Last edited:
Apr 20, 2008
10,067
990
126
that's comforting to know, maybe I will indeed get a bios update out of this. what mobo?

It was very rare that I only had 1 intense thread making my turbo hit 4.2GHZ. 95%+ it was 4.1GHZ. I just decided to disable turbo and clock the frequency to 4.2ghz. Biostar TA-970 for me.
 

greatnoob

Senior member
Jan 6, 2014
968
395
136
I realised normalised readings don't actually show relative performance gains so these graphs should be much more readable and hopefully easier to understand:





The Exynos build was most likely compiled without any sort of feature set identification meaning VFP/NEON optimisations were left out. Every other processor on the list had some sort of SSE or AVX flag set when the binaries were being compiled.
 
Last edited:
Dec 30, 2004
12,553
2
76
That's nearly 10 years ago. So I just tested Povray performance using icc 14 vs. gcc 4.8 on Ubuntu 14.04 on the FX 8350 and the results are below.

it just doesn't make sense for AMD to do so wonderfully well on normalized benchmarksp

there's no guarantee it would show up on ICC, previously it was Intel's math libraries on Windows binaries.
 
Last edited:
Dec 30, 2004
12,553
2
76
Looks like gcc beats icc on the FX 8350. Of note icc's -ax switch does dispatching and it appears that corei7-avx-i path is taken rather than the core2 path (this is oppositve of what people found, I included, several years ago). And, of course, Piledriver doesn't have avx2, thus the seg fault.

wouldn't it need to be gcc vs icc on an Intel chip, compared to the same on FX-8350?
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
wouldn't it need to be gcc vs icc on an Intel chip, compared to the same on FX-8350?

That would be a different type of comparison. The Povray compilation via ICC and GCC is just to show which one is better on FX. Still, I haven't tried the Open64 compiler that AMD is supporting (which I don't know why they don't just funnel support into GCC and LLVM instead). Also haven't tested Intel's MK libraries either since I'm more interested in rendering performance.
 

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
That would be a different type of comparison. The Povray compilation via ICC and GCC is just to show which one is better on FX. Still, I haven't tried the Open64 compiler that AMD is supporting (which I don't know why they don't just funnel support into GCC and LLVM instead). Also haven't tested Intel's MK libraries either since I'm more interested in rendering performance.

amd hasnt updated open64 in a while. They are all in for gcc as they have bdver4[excavator] support already in there.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |