einstein@home: "All-Sky Gravitational Wave search on O3 (O3AS)" extremely inefficient

StefanR5R · Nov 29, 2023

I haven't been running einstein@home since they stopped "Gamma-ray pulsar binary search #1 on GPUs (FGRPB1G)". So I need to find my way around the other GPU applications now.

At first try, I received O3AS work. It started out using an RTX 4090 by less than 50% SM utilization and less than 150 W board power. I was about to write me an app_config.xml which would launch 2 tasks concurrently per GPU. But the task came to about half of its completion percentage and ceased to use the GPU altogether! It went on to use 100% of one CPU thread instead.

Dammit! How hard are these projects trying to prevent me from keeping my apartment warm?

Trying "Binary Radio Pulsar Search (MeerKAT) (BRP7)" next.

BurnedOut2 · Nov 29, 2023

Use the GPU Utilization Factor (in Preferences) to run more than one task simultaneously.

StefanR5R · Nov 29, 2023

BRP7 does use GPU almost all of the time, at least. But its heating characteristics are lacking too.

1 task per GPU: 94 % SM utilization, 215…235 W power draw
2 tasks per GPU: 100 % SM utilization, 210...225 W power draw

…edit, which isn't a lot for an RTX 4090, obviously.
Gonna have to try Petri's BRP7 app.

Edit 2, Petri's app,
1 task per GPU: 100 % SM utilization, 235…255 W power draw

crashtech · Nov 29, 2023

@StefanR5R I have posted some potentially relevant info on the other site.

StefanR5R · Nov 29, 2023

Thanks, this brings me to 260…~~270~~285 W with 2 tasks/GPU.

mmonnin03 · Nov 29, 2023

Yes the O3AS app sucks.
BRP7 should only be run with the custom app. 1x is faster than 2x but I haven't seen those CUDA options.
app_config.xml > E@H's site implementation

gsrcrxsi · Dec 1, 2023

the O3AS app has gone through a few iterations. I think as of this post, you were already trying the latest one, but for a little background and context i'll try to explain the behavior or best practices.

after the application had been solidified, it was basically running equal time chunks for the GPU and CPU portion. for example, on a single task, a 2080Ti was taking about 8 minutes crunching on the GPU, then another 8 minutes on the CPU sort/search final portion. you could essentially double your throughput by staggering the start of a second task so that the GPU part of the second task started at the CPU portion of the first task. and you might be able to negate the shift over time running more multiples (if you had the VRAM for it, each task took about 4GB)

now, it's a little more complicated. the developers decided to essentially split each task into two halves. instead of running 1x 2Hz searches, they now run 2x 1Hz searches. they did this primarily to lower VRAM use to allow lower end GPUs to be able to contribute, in an effort to increase overall contribution. the tasks now only need about 2GB VRAM. but the behavior is now GPU->CPU->GPU->CPU so the timing of multiples might be more complicated. with something like a 4090 and 24GB VRAM, I'd probably try some high number of multiples like 4-8x with random staggered starts to see if you can combat the low GPU utilization from the CPU-only times, as well as the random drift that will happen leading to undesirable start times and tasks lining up.

but in any case, Petri's BRP7 will probably still give more points, and be easier to manage. 1x without MPS might be fastest (it has been on my cards, but I dont have something as powerful as a 4090). or multiples with MPS. I like 3x with 40% active thread percentage on all my cards.

StefanR5R · Dec 1, 2023

Thanks for the insights! Some time after my post, I too noticed that O3AS isn't just switching from GPU to CPU once, but has got at least (or according to your info, exactly) one more switch back to GPU and another forth to CPU.

My space heater in question consists of two 4090 GPUs which are driven by a 4-core/8-thread Kaby Lake CPU. Throughput gain from HyperThreading isn't very good, but maybe it would be reasonable to run a GPU feeder on one hyperthread and a CPU worker on the paired hyperthread. Given this, the following setup could be investigated:

GPU0, CPU0 ----> task1 using GPU _ CPU _ GPU _ CPU
GPU0, CPU4 ---------------> task2 using GPU _ CPU _ GPU _ CPU
GPU0, CPU1 ----> task3 using GPU _ CPU _ GPU _ CPU
GPU0, CPU5 ---------------> task4 using GPU _ CPU _ GPU _ CPU

GPU1, CPU2 ----> task5 using GPU _ CPU _ GPU _ CPU
GPU1, CPU6 ---------------> task6 using GPU _ CPU _ GPU _ CPU
GPU1, CPU3 ----> task7 using GPU _ CPU _ GPU _ CPU
GPU1, CPU7 ---------------> task8 using GPU _ CPU _ GPU _ CPU

Obviously, timing would be crucial, and depend on a suitable balance of GPU time vs. CPU time within tasks as well as on consistency between different workunits. If necessary, a background script could monitor GPU utilizations and suspend/resume individual tasks.

However, my actual goal on Wednesday was — and still is — to let this particular computer put out a certain heat flux consistently, doing some good science along the way, and accomplish this all with a minimum of hand-holding.

Instead of this dual-GPU space heater, I could have deployed one of my dual-CPU space heaters. However, those were not as easy to move to the other room at this time. I think I should work on making some more of my dual socket computers more mobile than they currently are.

gsrcrxsi · Dec 1, 2023

also the BRP7 app seems to be heavily memory bandwidth limited, which explains why a 4090 isnt massively faster than a 3090/3080ti. it's faster, but not a lot, and not by the amount you'd expect.

gsrcrxsi · Dec 1, 2023

StefanR5R said:
However, my actual goal on Wednesday was — and still is — to let this particular computer put out a certain heat flux consistently, doing some good science along the way, and accomplish this all with a minimum of hand-holding.

yeah, I moved one of my systems from the garage to the living space to add heat for the winter and try to reduce NG use.

StefanR5R · Dec 1, 2023

gsrcrxsi said:
also the BRP7 app seems to be heavily memory bandwidth limited, which explains why a 4090 isnt massively faster than a 3090/3080ti. it's faster, but not a lot, and not by the amount you'd expect.

I never looked into tuning my GPUs beyond reducing their power limits, and in case of air-cooled GPUs, increasing their fan speeds. I wonder if memory clocks on the 4090 could be cranked up from the default, for use cases such as BRP7.

gsrcrxsi · Dec 1, 2023

it might help marginally, but watch for memory errors pushing it too far. I pretty much only run the memory clocks at whatever offset gets them to P0 clocks. for Ampere GDDR6X that's +500, for Turing GDDR6 that's +400. not sure how much the 4090 is penalized in P2 vs P0.

StefanR5R · Dec 1, 2023

StefanR5R said:
However, my actual goal on Wednesday was — and still is — to let this particular computer put out a certain heat flux consistently, doing some good science along the way, and accomplish this all with a minimum of hand-holding.

PS, putting out a lot of heat consistently and without any hand-holding at all — using GPUs — would obviously have been accomplished if I had chosen PrimeGrid. In these aspects, number theory handily beats molecular simulations (Folding@Home) or signal processing (Einstein@Home). But (a) I already have quite many PrimeGrid points, (b) none of their conjecture projects use GPUs, (c) their sieving projects have almost accomplished their goals for now (as has the Arithmetic Progressions project), (d) what's left at PrimeGrid for GPUs are the pure Big-Prime-Finding projects which I am not very much in to.

einstein@home: "All-Sky Gravitational Wave search on O3 (O3AS)" extremely inefficient

StefanR5R

Elite Member

BurnedOut2

Junior Member

StefanR5R

Elite Member

crashtech

Lifer

StefanR5R

Elite Member

mmonnin03

Senior member

gsrcrxsi

Member

StefanR5R

Elite Member

gsrcrxsi

Member

gsrcrxsi

Member

StefanR5R

Elite Member

gsrcrxsi

Member

StefanR5R

Elite Member

TRENDING THREADS