Anandtech vs Tom's Hardware Folding@Home Coronavirus Race thread

StefanR5R · Apr 14, 2020

StefanR5R said:
Three of my four computers which have GPUs ran dry during the last two hours; but the strange and bothersome thing about this was that their clients became unresponsive on the control port. (FAHControl, telnet, or my script could connect to port 36330, but did not receive responses.)

And now both my CPU-only 2x14c computers simultaneously ran into the state with dysfunctional port 36330. This was while one was looking for work, while the other was in the middle of a WU. It seems to me that the client may get stuck this way after it ran for a certain time straight.

Ken g6 said:
Having trouble getting work again.

Besides this, I am also seeing uploads in progress more often now in my periodic log. I.e., uploads probably take longer again... which means less Quick Return Bonus. :-|

hbaade · Apr 14, 2020

21:20:20:WU02:FS02:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
21:20:21:WARNING:WU02:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:20:21:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:11762 run:0 clone:550 gen:49 core:0x22

Been getting a lot of these today... anyone else?

ZipSpeed · Apr 14, 2020

On one of my rigs, I got 3 bad WUs in a row. It did resolve itself eventually. I wouldn't worry too much unless you're literally churning out bad WUs all day.

borandi · Apr 14, 2020

I'm having to baby-sit my machines today to get WUs.

No errors though.

Markfw · Apr 14, 2020

hbaade said:
21:20:20:WU02:FS02:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
21:20:21:WARNING:WU02:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:20:21:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:11762 run:0 clone:550 gen:49 core:0x22

Been getting a lot of these today... anyone else?

I have gotten those on my EPYC boxes, so I know its not the computer.

mopardude87 · Apr 14, 2020

Well, besides my 1080ti sounding like its the destroyer of worlds with 2x 2,400 rpm high static pressure 120mm fans on it now overheating isn't a issue. Not hitting over 80cel and given my room is hot enough to bake a cookie about now, i will take this as a victory! Holding a solid steady 1835.

Anyone got any ideas on perhaps a mild to moderate overclock on this 1080ti? Its the Zotac Mini so i won't be applying voltage i think to any oc. Maybe during the cooler hours during the night perhaps.

TennesseeTony · Apr 14, 2020

I personally would like a pic of this new cooling!

As a mini card...and overclocking....uhm....genuinely I would PERSONALLY wait (you said a 2080Ti is inbound? or am I thinking of someone else?) The 1080Ti is a great, no, astounding! card, but the 2080TI will give you 3M ppd running Linux. Don't risk the hardware, especially since it is not designed to deal with the additional heat (mini).

Might I suggest you open a tub of this, in your room? Just scoop out some 1 inch balls of it onto a baking pan, just behind the computer.

mopardude87 · Apr 14, 2020

TennesseeTony said:
I personally would like a pic of this new cooling!

As a mini card...and overclocking....uhm....genuinely I would PERSONALLY wait (you said a 2080Ti is inbound? or am I thinking of someone else?) The 1080Ti is a great, no, astounding! card, but the 2080TI will give you 3M ppd running Linux. Don't risk the hardware, especially since it is not designed to deal with the additional heat (mini).

Might I suggest you open a tub of this, in your room? Just scoop out some 1 inch balls of it onto a baking pan, just behind the computer.

View attachment 19605

Yeah i was hoping for a stock voltage overclock, the 2080 ti is coming if and when the stimulus comes. Still waiting like a bunch of others for it. Not a bad idea about the cookies, i love the idea of harnessing the heat off this thing for that. Its not even being wasteful dumping into my room, i can handle the heat and at night its just better.

Edit: added the picture

https://imgur.com/a/2aHs3nI

Ken g6 · Apr 14, 2020

I just went downstairs, and my computer smelled hot, so I decided to turn it off about 12 hours early. I guess I'll have to try to suss out what's overheating later.

Endgame124 · Apr 14, 2020

New F@H stats show it increased from 2.43 exaflops to 2.48. I was a little disappointed at first, until I remembered that is still a 50 petaflop increase!

StefanR5R · Apr 15, 2020

Endgame124 said:
New F@H stats show it increased from 2.43 exaflops to 2.48. I was a little disappointed at first, until I remembered that is still a 50 petaflop increase!

I doubt that these extra 50 petaflops do anything but sit there and idle – and of course bother the servers with more requests which they can't fulfill anyway.

EOC's aggregate:

26 B 7-days average — 23 B on April 13 — 20 B during the last 24 hours.

Therefore, as nice as F@H's Exaflops stats are to look at, they certainly reflect merely clients which contact the servers, not clients which actually do anything.

Of my 9 computers, three went out of service during the night. The triple-GPU computer went unresponsive; when I brought it back now by reboot, it had 2 slots trying to upload, otherwise all slots idle. One of the "smaller" CPU-only computers had a domain decomposition failure in FahCore again, but this time followed up with FahClient killed due to a segfault. The biggest CPU-only computer, which is a very effective producer as long as it works, had a domain decomposition failure too but was caught in the usual nonsensical loop of retrying the same useless WU again and again.

With the upload performance troubles which I am seeing since about yesterday, this all looks like the work servers are overloaded to a degree again like at the time before they added the first donated servers.

If our race wouldn't end today, I would attempt much more automation of the client monitoring and control than I currently have, e.g. automated reboot when a client is deemed unresponsive.

(On the level of the client software, BOINC is orders of magnitude more reliable than FahClient. No doubt because of its much larger base of projects using it, and feeding back into its maintenance and development.)

Edit, between the earlier three restarts after I woke up, and now that I need to leave for work, I had to shut down/dump/restart two more clients. I don't think they will achieve a lot anymore while I am away until after the end of the race.

Assimilator1 · Apr 15, 2020

Yea I wish F@H would switch to BOINC! (although that wouldn't make any difference to the server load of course). Much easier to use than the F@H client.

sswingle · Apr 15, 2020

It's probably more complicated on their end than it looks to us, but I wonder why some servers are barely utilized, such as linus1, oracle2, and why all these new servers only have CPU work and the GPU work is only on a couple of servers.

Endgame124 · Apr 15, 2020

sswingle said:
It's probably more complicated on their end than it looks to us, but I wonder why some servers are barely utilized, such as linus1, oracle2, and why all these new servers only have CPU work and the GPU work is only on a couple of servers.

Working in IT Performance for a living, my team tests at 10x of normal day to day load, and attempts to project horizontal scalability beyond that. However, with testing you know some things that work at 10x won't survive to 15x, but you're willing to say "eh, we'll fix that if we get 15x the business overnight... not worth it now". F@H only has 1 full time developer who probably wasn't worried about testing at 10x load in a few weeks, and with almost no ramp up F@H went from 2M WUs / month to 24M WUs / month. They are probably finding problems in their code that they didn't even know they had, mixed in solveable issues that take attention like adding more work servers, adding more storage to those servers, and more difficult issues to solve like maxing bandwidth.

Assimilator1 · Apr 15, 2020

Interesting to know , what do you mean by "project horizontal scalability "?

Also, it maybe that the bottleneck now is the rate at which they can produce WUs, at least the Linustech video mentioned that.

biodoc · Apr 15, 2020

The race is over. Great job all!

Assimilator1 · Apr 15, 2020

Yey! Congrats all Beer time!

Endgame124 · Apr 15, 2020

Assimilator1 said:
Interesting to know , what do you mean by "project horizontal scalability "?

Also, it maybe that the bottleneck now is the rate at which they can produce WUs, at least the Linustech video mentioned that.

There are 2 types of performance scaling that we test - horizontal and vertical. Lets look at one example of each based on processing 10 jobs that in a current setup that each take 1 hour to complete. If it helps, you can think of them as 10x 1 hour WUs for Folding at home.

The first is the one most people are familiar with - Vertical scalability. We take this set of 10x 1 hour jobs, and we increase aspects of the computer running it - we double clock speed and go from a 2ghz CPU to a 4ghz CPU. Ideally each job will now take 30 minutes each and a total of 5 hours to complete. If someone suddenly sends us 20x work units and we vertically scale by 2x, we should be able to complete those 20 work units in 10 hours. However, the problem with vertical scaling is computers only get so big - you might be able to go from 2ghz CPUs to 4ghz CPUs, but there are no 8ghz CPUs. Depending on the work load, you can run into other bottlenecks in a system as well. You can go from 256GB ram to 512GB ram, but 512GB to 1TB ram is a lot harder, or 50TB storage to 100TB, but maybe not 200TB, and so on.

The second one is horizontal scalability, which is the ideal we strive for in cloud applications, because we typically have lots of the same type of servers to throw at a problem. Given our 2ghz server that does 10 work units in an hour, if we add another server, can we do 20 work units in an hour? Can we add 4 servers and process 40 work units? Can we add 10 servers and do 100? In a well designed cloud application, we should be able to add as many horizontal servers as necessary to sustain any load, and do it dynamically. If our current workload is 15 work units, we should have 2 servers. If in 2 hours we have 100 work units, we should be able to immediately add 8 more servers and process the load. If, 2 hours after that, our inbound work units drop to 6, we should be able to remove 9 servers without any issues.

Now, despite working for a fortune 100, they don't give me the resources to test at 20x and 50x load (200 and 500 work units), so what we do is measure the whole application when going from 10 work units to 100 work units, and then we try to deduce if any part of the application will fail with additional load. Maybe our application depends on a database, and we can't horizontally scale the database. We might use 5% of our DB for 10x work units, and 50% of the DB at 100 work units. We can project that the current application will be able to sustain about 200 work units before the DB prevents adding more work units. In a case like this, we might know the DB will be a problem, but since we are unlikely to get 20x more load quickly, we will have time to address the DB issue when needed.

TennesseeTony · Apr 15, 2020

Semi-final stats, should reflect the 'Finish Line' data.

Boo, 67.2M short of the 10B mark for combined teams points.

Assimilator1 · Apr 15, 2020

Endgame124
Got ya, thanks

Btw, hasn't someone overclocked a CPU to 8 GHz with liquid Helium cooling?

IEC · Apr 15, 2020

Only my CPU slot had work overnight.

I woke up to a cold room lol

Assimilator1 · Apr 15, 2020

Doh! Where do you live?
My client ran out of work ~6am this morning, BST.

blckgrffn · Apr 15, 2020

I was reading an article somewhere (oh yeah, that Ars one) and in an interview with the Stanford folks they revealed they had to drop all other projects in order to keep up with the technicalities of feeding the beast that was the huge uptick in connected clients.

One of the projects dropped for now was a rewrite of the F@H client. I am sure they are aware of many of its shortcomings. We can hope that the client is updated before the next race. Or the next Pandemic, whichever is first

Hopefully someone will create a new thread with final results where we can all give high fives & kudos?

biodoc · Apr 15, 2020

A big thanks to @Ken g6 and @StefanR5R for race stats!

ZipSpeed · Apr 15, 2020

biodoc said:
A big thanks to @Ken g6 and @StefanR5R for race stats!

I second that! Thank you for the stats gentlemen.

One race might be done, but the other race to solve the pandemic is far from finished! That said, I barely got any work on my 2080 Ti this prior week, so I'm going to pull it off the project temporarily until things improve. No idea why, but my GTX 1060 has been getting consistent-ish work so I'll leave that there for now.

Anandtech vs Tom's Hardware Folding@Home Coronavirus Race thread

Elite Member

Junior Member

Golden Member

Member

Moderator Emeritus, Elite Member

Diamond Member

Elite Member

Diamond Member

Programming Moderator, Elite Member

Senior member

Elite Member

Elite Member

Diamond Member

Senior member

Elite Member

Diamond Member

Elite Member

Senior member

Elite Member

Elite Member

Elite Member

Elite Member

Diamond Member

Diamond Member

Golden Member