AOA Forums

AOA Forums (http://www.aoaforums.com/forum/)
-   Windows/Linux SMP Clients (http://www.aoaforums.com/forum/123-windows-linux-smp-clients/)
-   -   Problems return WUs? (http://www.aoaforums.com/forum/windows-linux-smp-clients/47856-problems-return-wus.html)

Gizmo 10th September, 2009 08:31 PM

Problems return WUs?
 
Is anyone else having problems return WUs on their Linux SMP cores?

This is the second time in the space of a week that I've had problems returning a WU. The first time I just deleted the work directory and started again. But it's happened again, and this is getting damned annoying:

For some reason, when I shut my machine down and then restart it, I get an error indicating that the thing isn't able to resume the WU, and then when it tries to contact the work server, it can't.

ThunderRd 11th September, 2009 02:59 AM

First thing I normally do if something like that happens is try to connect to the work server in my browser. If I get an "OK" then I know that problem is on my end. If I don't, then I check here:
Folding@Home server status

I checked the forum; the only complaints recently about server downage were on GPU servers.

If the problem continues, try deleting the following from the fah directory:
-work
-all logfiles
-unitinfo.txt
-fahcore_n

Then restart the Deino service if you are running that client. Otherwise make sure that the mpiexec process is killed, and any running core processes.

Le the client down a new WU, and it will also download the current core file, and see if that helps.

Gizmo 15th September, 2009 09:35 PM

Ok, I'm having the same problem AGAIN! Here's some log output:

Quote:

Originally Posted by FAHlog.txt
[06:59:08] Completed 250000 out of 250000 steps (100%)
[06:59:10] DynamicWrapper: Finished Work Unit: sleep=10000
[06:59:20]
[06:59:20] Finished Work Unit:
[06:59:21] - Reading up to 21160944 from "work/wudata_05.trr": Read 21160944
[06:59:21] trr file hash check passed.
[06:59:21] - Reading up to 27659408 from "work/wudata_05.xtc": Read 27659408
[06:59:21] xtc file hash check passed.
[06:59:21] edr file hash check passed.
[06:59:21] logfile size: 190430
[06:59:21] Leaving Run
[06:59:24] - Writing 49160566 bytes of core data to disk...
[06:59:26] ... Done.
[06:59:36] - Shutting down core
[06:59:36]
[06:59:36] Folding@home Core Shutdown: FINISHED_UNIT
[07:02:46] CoreStatus = 64 (100)
[07:02:46] Sending work to server
[07:02:46] Project: 2677 (Run 26, Clone 78, Gen 48)
[07:02:46] + Attempting to send results [September 14 07:02:46 UTC]
[07:12:52] + Results successfully sent
[07:12:52] Thank you for your contribution to Folding@Home.
[07:12:52] + Number of Units Completed: 166
[07:12:57] - Preparing to get new work unit...
[07:12:57] + Attempting to get work packet
[07:12:57] - Connecting to assignment server
[07:12:57] + Could not connect to Assignment Server
[07:12:57] + Could not connect to Assignment Server 2
[07:12:57] + Couldn't get work instructions.
[07:12:57] - Attempt #1 to get work failed, and no other work to do.
Waiting before retry.
[07:13:10] + Attempting to get work packet
[07:13:10] - Connecting to assignment server
[07:13:10] + Could not connect to Assignment Server
[07:13:10] + Could not connect to Assignment Server 2
[07:13:10] + Couldn't get work instructions.
[07:13:10] - Attempt #2 to get work failed, and no other work to do.
Waiting before retry.
[07:13:30] + Attempting to get work packet
[07:13:30] - Connecting to assignment server
[07:13:30] + Could not connect to Assignment Server
[07:13:30] + Could not connect to Assignment Server 2
[07:13:30] + Couldn't get work instructions.
[07:13:30] - Attempt #3 to get work failed, and no other work to do.


ThunderRd 16th September, 2009 02:51 AM

Gizmo, let's start that client with the -verbosity 9 flag so we can see more info (mainly we want to see which server you're trying to connect to unsuccessfully). Everything is normal in that log otherwise.

If we know where it's calling home, we can check the server status page.

Gizmo 16th September, 2009 03:02 AM

Quote:

Originally Posted by FAHlog.txt
[01:57:53] - Connecting to assignment server
[01:57:53] Connecting to http://assign.stanford.edu:8080/
[01:57:53] - Could not CosmHTTPOpen
[01:57:53] + Could not connect to Assignment Server
[01:57:53] Connecting to http://assign2.stanford.edu:80/
[01:57:53] - Could not CosmHTTPOpen
[01:57:53] + Could not connect to Assignment Server 2
[01:57:53] + Couldn't get work instructions.
[01:57:53] - Attempt #3 to get work failed, and no other work to do.


Ok, so when I ping that server with my browser, it comes back as 'Ok'. That begs the question:
"What the hell is going on?"

I'm obviously able to connect to the internet and download stuff, or I wouldn't be able to get the WUs to start with.

If I su to fah (my folding user) and wget the status page, that works as well, so the issue isn't user permissions.

ThunderRd 16th September, 2009 03:11 AM

I have seen the same thing occasionally, last time it turned out to be a problem with the ISP and a transparent proxy blocking the transfers somehow. IDK if that is the problem here or not. Can you resolve that address to an IP xxx.xxx.xxx.xxx? Then we can check the server page. Sometimes, even though you get OK status, the server is loaded heavily or it is in "reject" status and waiting is the only answer. Or the server may be in "accept" status for some reason, and it isn't assigning ATM. Again, waiting it out is the answer.

Here's a possible solution that's recent: Folding Forum • View topic - can not connect to assign server for work

I do know that when I see things like this, although they piss me off, they go away eventually in every case. All by themselves.

Gizmo 16th September, 2009 03:31 AM

I can ping them both by name:

Quote:

[fah@chris-lap ~]$ ping assign.stanford.edu
PING vsp10v-vz00.stanford.edu (171.67.108.200) 56(84) bytes of data.
64 bytes from vsp10v-vz00.Stanford.EDU (171.67.108.200): icmp_seq=1 ttl=51 time=77.7 ms
64 bytes from vsp10v-vz00.Stanford.EDU (171.67.108.200): icmp_seq=2 ttl=51 time=76.7 ms

[fah@chris-lap ~]$ ping assign2.stanford.edu
PING vspg6-vz7.stanford.edu (171.64.65.121) 56(84) bytes of data.
64 bytes from vspg6-vz7.Stanford.EDU (171.64.65.121): icmp_seq=1 ttl=51 time=76.8 ms
64 bytes from vspg6-vz7.Stanford.EDU (171.64.65.121): icmp_seq=2 ttl=51 time=76.7 ms
I might buy the whole ISP thing if it weren't for the fact that it happens both at home and at the office. At home, I'm obviously at the mercy of the ISP, but at the office I've got a dedicated T1 that comes through my own router and I have my web, mail, and proxy servers on that network: Quest doesn't jack with SQUAT or they get an earful.

I dunno if I've got some kind of a library problem or what, but it sure is frustrating. This has been goin on now for something like 1 1/2 weeks. That's a lot of power wasted.

ThunderRd 16th September, 2009 03:33 AM

Here's an idea-

Go to the folding directory, create a directory called "later" or some such, and move the work directory and the queue.dat file into it. You could delete them if you want, the finished WU already uploaded, so there's nothing of value there.

Then, restart the client with the -config flag. You will get the configuration routine, make no changes. Does it download now?

EDIT: I just checked those IPS, and servers are running normally according to the log pages. So problem isn't Stanford. Don't get fooled by being able to ping them, though. Normally they are pingable if they are in accept status, but they won't assign until staff works out the problem. Accept means "accepting connections", but the accepted connections just sit there.

Gizmo 16th September, 2009 03:50 AM

No, it doesn't, and the failure is pretty much instantaneous. I must have a borked library or something somewhere.

Hmm......I just realized that I'm running the 6.24 beta version. I don't rightly remember when I downloaded it though. It seems like it was back in August some time, but maybe I ought to back down to the 6.02 client?

ThunderRd 16th September, 2009 03:58 AM

If I were you, I'd let the client run overnight tonight and see if it has a WU in the morning.

The 6.24 client IS the 6.02 client with a few upgrades regarding how it handles EUEs and some other changes, but basically the networking part is the same. I don't think using 6.02 would be any different. It's starting to sound like some transparent proxy issue to me. I'd wait it out and see.

I'm going into a meeting now, and you're 12 hours different, so by the time I'm out you're sleeping. I'll check back with you tonight when you're back at work.

Gizmo 16th September, 2009 04:41 AM

No worse than dealing with my new owners in India. LOL.

Thanks for your help, TR.

ThunderRd 16th September, 2009 04:54 AM

I just had a thought...

If you run the config routine again, instruct the client to use a web proxy (you'll have to find one on google or something, like hidemyass.com or another one - there are many).

Then see if the client downloads. If it does, you have an ISP problem. That is how I discovered the cause of my own problem a while ago. When I reported it to the ISP they told me there was a new transparent proxy in place, and the IP filters were still being worked on. Sewveral days later, the problem was gone, and it didn't return. In my case, IIRC, I could download but not upload. WUs died on the vine, so to speak.

Gizmo 16th September, 2009 05:11 AM

There's no way that can be the problem for two reasons:

1) This is my traveling laptop: I take it between home and the office every day and I have the problem both places.

2) I have 5 other machines here folding away (1 Ubuntu, 1 fedora, 2 Gentoo, and 1 Windows) and they all are getting WUs just fine.

It really looks like a local (laptop) issue of some kind, but I haven't the foggiest clue what it might be.

ThunderRd 16th September, 2009 08:49 AM

I didn't know you had other machines connecting successfully. Maybe a reinstall of the client is in order. You might also try reinstalling the ia32-libs package.

ThunderRd 16th September, 2009 04:48 PM

Have you had any success as yet?

Gizmo 16th September, 2009 05:10 PM

No ia32-libs package for Fedora 11. No compat-libstdc++ either. Apparently, it shouldn't be working at all.

ThunderRd 16th September, 2009 06:17 PM

Some distros have the 32-bit support, but I know for sure that debian and derivatives do not. If the client runs, Fedora must have it. I forget what exactly needs it - it's either the client itself or the core. (not both)

Gizmo 16th September, 2009 06:21 PM

Quote:

Originally Posted by FAHLog.txt
Could not CosmHTTPOpen

I think this line's got a crucial bit of information, but I haven't got the foggiest idea how to interpret it.

Gizmo 24th September, 2009 09:34 PM

Ok, after banging around in the folding forums for a few days, we finally figured out the issue:

In /etc/nsswitch.conf, I changed the following:
Quote:

hosts: files mdns4_minimal [NOTFOUND=return] dns
to be:
Quote:

hosts: files dns
And that resolved the CosmHTTPOpen problem and got me downloading WUs again.

PPM, you're going to have to peddle faster to catch me! I got 5th gear back!

ThunderRd 25th September, 2009 02:44 AM

Can you add a link to your thread over there...I can't find it.

EDIT NM, I have it:

http://foldingforum.org/viewtopic.ph...t=CosmHTTPOpen

[note to self: Gizmo called jackofall a "GOD". will direct obscure linux queries to him from now on]

@Gizmo, he's asking a followup question, don't know if you saw...


All times are GMT +1. The time now is 10:28 PM.


Copyright ©2001 - 2010, AOA Forums


Search Engine Friendly URLs by vBSEO 3.3.0