cellarnoise
Senior member
- Mar 22, 2017
- 807
- 436
- 136
How is ~8 hours for the type units in the upcoming contest ?
PrimeGrid 9.51 Generalized Cullen/Woodall (LLR) (mt) llrGCW_662112230_0 02:03:44 (15:55:12) 96.49 25.935 05:52:17 09d,21:56:08 8C Running Turin
As far as I can tell from merely 8 tasks which are running here for parsnip, the workunit sizes of current GCW tasks are varying a lot.Closer to 8.6 hours average, that is with lasso in 8 x 8 config.
cd boinc/; grep FFT slots/*/stderr.txt
slots/0/stderr.txt:Using zero-padded AVX-512 FFT length 3200K, Pass1=1K, Pass2=3200, clm=1, 8 threads.
slots/1/stderr.txt:Using zero-padded AVX-512 FFT length 2880K, Pass1=192, Pass2=15K, clm=4, 8 threads.
slots/2/stderr.txt:Using zero-padded AVX-512 FFT length 3200K, Pass1=1K, Pass2=3200, clm=1, 8 threads.
slots/3/stderr.txt:Using zero-padded AVX-512 FFT length 3200K, Pass1=1K, Pass2=3200, clm=1, 8 threads.
slots/4/stderr.txt:Using zero-padded AVX-512 FFT length 3M, Pass1=192, Pass2=16K, clm=2, 8 threads.
slots/5/stderr.txt:Using zero-padded AVX-512 FFT length 2880K, Pass1=192, Pass2=15K, clm=4, 8 threads.
slots/6/stderr.txt:Using zero-padded AVX-512 FFT length 3456K, Pass1=192, Pass2=18K, clm=4, 8 threads.
slots/7/stderr.txt:Using zero-padded AVX-512 FFT length 3200K, Pass1=1K, Pass2=3200, clm=1, 8 threads.
Anybody? No?Out of curiosity, is anyone else who runs PrimeGrid on Windows (rather than on Linux) using Process Lasso also? And if so, does it offer an explicit "bind to last level cache domains" setting, or does the user have to come up with something else which yields the same effect?
I still am a fan of tools which are made to do specifically what they are meant to achieve. :-)A reminder of the most prominent alternatives on Windows:
If I had a Windows computer at PrimeGrid or other projects which profit from CPU affinities, I would strongly gravitate to the last option in this list, somehow. :-)
- EPYCs and Threadrippers can be set to 1 CCX = 1 NUMA domain in the BIOS. Caveat: I don't know how effective Windows' NUMA handling is; this method works nicely with Linux at least.
- Computers with reasonably low amount of CCXs could run one BOINC client instance per CCX, with CPU affinity defined for the BOINC client process.
- Pavel Atnashev's AffinityWatcher (github link)
- pschoefer's Powershell script (SG forum, 2022 season thread, #535), perhaps with xii5ku's extras (SG forum, 2023 season thread, #485)
The topmost CPU utilization graph looks good insofar that all physical cores are running one software thread each.Well, since I am using simple settings, here is the Turin lasso screen.
Its 400 watts TDP, thats all I know about power, no meters on any boxes right now. Since the tasks are so similar in elapsed time, I can only assume lasso is working as intended.The topmost CPU utilization graph looks good insofar that all physical cores are running one software thread each.
The process list below this is looking both good…
…and bad at the same time:
- The CPU sets should follow the formula
[ (0;2;4;6;8;10;12;14) + {0 or 16 or 32 or 48 or 64 or 80 or 96 or 112} ].
Only then each of these eight CPU sets would correspond to one CCX exactly. At least that's according to what I understood how Windows is numbering the logical CPUs. The CPU affinity lists which your screenshot is showing are indeed following this formula.
It seems like Process Lasso is doing the desired job only partially. Edit: Or Process Lasso has got some sort of UI bug in which it displays CPU affinities of the 1st NUMA node only, not of the second NUMA node. After all, the topmost CPU utilization graph clearly shows that both NUMA nodes are loaded.
- The CPU sets to which PRST processes are bound each occur twice. Best would be if each PRST process had an individual CPU set which differs from all others.
One step which *may* (perhaps) help with that would be to reboot into BIOS, go to "Advanced" --> "ACPI Settings" --> set "ACPI SRAT L3 Cache As NUMA Domain" to "Enabled", and boot back into Windows. After that, the bottom line of Process Lasso should say 8 NUMA nodes instead of the current 2 NUMA nodes. Each CCX will be a NUMA node then. And I hope that the NUMA node boundaries directly translate to process scheduling boundaries. (On Linux, the latter would be true by default without extra tools. As soft boundaries though, not as hard boundaries; but the overall performance effect would be close to as if they were hard boundaries.)
Do you have the computer plugged into a power meter? Or is HWMonitor or a similar software showing CPU socket power consumption? — If yes, scheduling optimizations would show up as lower power fluctuations and higher steady power draw.
1 | markfw | TeAm AnandTech | 3 088 689.19 | 58 |
2 | Icecold | TeAm AnandTech | 2 012 475.08 | 40 |
3 | tng | Antarctic Crunchers | 1 767 318.29 | 38 |
4 | EA6LE | Romania | 1 126 570.67 | 21 |
5 | crashtech | TeAm AnandTech | 826 747.07 | 15 |
How is ~8 hours for the type units in the upcoming contest ?
PrimeGrid 9.51 Generalized Cullen/Woodall (LLR) (mt) llrGCW_662112230_0 02:03:44 (15:55:12) 96.49 25.935 05:52:17 09d,21:56:08 8C Running Turin
I thought the 3.5 ghz and avx-512 (at half speed) was what gave by "herd" the power, but this must not use avx-512 at all. Why the Turin does almost as well as the Genoa, I have no idea since it runs a lot slower (3.5 vs 2.1)8 hours? Ha! My Gold 6138 runs one task in 7 hours, using, uhm, all 20 cores. lol
I just ain't got the 'cache' to play with the big dogs.
edit: Just for giggles, Broadwell running 21 cores (avx-2) is coming in at about 9 hours. I expect to come in 2nd (from last) this challenge.
It heavily uses AVX 512. A 9950x is significantly faster on these tasks than a 7950x would be, for example.I thought the 3.5 ghz and avx-512 (at half speed) was what gave by "herd" the power, but this must not use avx-512 at all. Why the Turin does almost as well as the Genoa, I have no idea since it runs a lot slower (3.5 vs 2.1)
Genoa has emulated AVX512. It runs each AVX512 instruction in two parts, which is only slightly better than no AVX512 at all. Turin has real AVX512. That's why it's almost as fast as the Genoa.I thought the 3.5 ghz and avx-512 (at half speed) was what gave by "herd" the power, but this must not use avx-512 at all. Why the Turin does almost as well as the Genoa, I have no idea since it runs a lot slower (3.5 vs 2.1)
I guess mine impression is based on my experience, in that my Milan was WAY overpowered by Genoa, and Rome was also way down there. I wish I could afford a non-ES Turin as they are faster than Genoa. And unlike some previous competitions, By Turin is running way below 95c and 100c respectively at 60 and 65, so that not the reason for the speed. (cpu and vrm temps). The below pic is fully loaded with 8x8 tasks of the current competition.Genoa has emulated AVX512. It runs each AVX512 instruction in two parts, which is only slightly better than no AVX512 at all. Turin has real AVX512. That's why it's almost as fast as the Genoa.
As far as PrimeGrid LLR/LLR2/PRST/Genefer-CPU are concerned, this is what the Zen generations brought to the table:I guess mine impression is based on my experience, in that my Milan was WAY overpowered by Genoa, and Rome was also way down there.
Ah, I missed this. Do you remember what clocks this Turin makes when it runs something like WCG MCM? I am asking because in a power-limited scenario like with all mid- to high-core-count EPYCs, if PrimeGrid is not "lassoed" optimally, the CPU core clocks are higher than if properly optimized. Not optimized: Cores wait for RAM a lot and don't spend much energy hence are automatically clocked higher. Optimized: Cores spend a lot of energy quickly due to heavy SIMD utilization, and are therefore clocked lower compared to more generic program code.except the Turin runs 2.1
I rarely have the Turin turned on, to save power in total, mostly only 7950x and 9950x. I can't remember what speed it runs in WCG.Ah, I missed this. Do you remember what clocks this Turin makes when it runs something like WCG MCM? I am asking because in a power-limited scenario like with all mid- to high-core-count EPYCs, if PrimeGrid is not "lassoed" optimally, the CPU core clocks are higher than if properly optimized. Not optimized: Cores wait for RAM a lot and don't spend much energy hence are automatically clocked higher. Optimized: Cores spend a lot of energy quickly due to heavy SIMD utilization, and are therefore clocked lower compared to more generic program code.
I'm surprised you're behind me. Bumblebee's doing very well in this challenge; my new hex-core Zen 4 isn't much faster.I expect to come in 2nd (from last) this challenge.
Here's a start:I kinda wish I had my own property here in UAE with a solar panel array. The one thing you can almost bet on having every single day here is sunshine. Lots and lots of it!
And maybe a thermoelectric generator to convert the heat from the sun into electricity.
Now if I only had a few million lying around here somewhere, I could get started on such a project
Yeah, haven't seen that anywhere here. It's crazy how open-minded and future-interested Europeans are, though. If they had a population approaching that of China+Russia+India, the European Union would've already established a moonbase by now and working on starting a Mars colony.It may not be available in your area quite that way by electrical code, though.
I don't see a "deep sheet" on this list? Guess we will find outDay 2 stats:
Rank___Credits____Username
1______22264004___Icecold
2______21656564___markfw
6______9321612____crashtech
7______7503827____cellarnoise2
11_____4695133____w a h
18_____2681471____ChelseaOilman
36_____1131428____Orange Kid
91_____463135_____waffleironhead
103____392114_____Ken_g6
105____389991_____mmonnin
107____385859_____johnnevermind
161____120644___10esseeTony
248____864________[TA]Skillz
Rank__Credits____Team
1_____71006652___TeAm AnandTech
2_____24455062___Czech National Team
3_____18396609___[H]ard|OCP
4_____14886156___SETI.Germany
Somebody, who shall remain nameless (but who is currently in last place on our team) issued a challenge on the PrimeGrid Discord, and now they're trying to get everybody else to join one team to challenge us. I guess we'll see what happens.