And who ever thought that some astray ypserv processes would eat your AFS cell for lunch?
We experienced some problems with our AFS cell today, not to dissimilar to those we experienced on Saturday night. ...
This time we ended up doing a complete service restart to bring everything back, only to have it come crashing down again an hour later. Symptoms: access to AFS filespace hanging, but not timing out. After another partial shutdown and restart of our infrastructure fileservers (ifs servers) cleared up the problem, only to have it happen again. At our wits end, we began by only bringing only one IFS server (ifs1) into service, and watched it intently.
Tue Apr 26 15:52:35 2005 VL_RegisterAddrs rpc failed; will retry periodically (code=5381, err=2)
This message appeared a few times... it struck me as odd, as you'd only expect to see this message on a server with no network connectivity. (it's trying to contact a db server -- the volume location server, in particular, to let it know what IP addresse(s) it's at). This made me want to look at our database servers, which I found it excruciatingly painful to log in to -- in fact, one of them was spewing complaits about a lack of swap! What I found were a bunch of forked ypserv processes chewing up CPU and memory. (ypserv only forks if someone is doing a map dump, and only dumb clients do a map dump...) At this time, I wacked the ypserv processes running on all of the db servers, and things seemed to magically clear up. So much for all the worry about outside hackers and such, I'll never run NIS service off of my database servers again.
However, this brings us to the question of why the problem presented itself the way it did -- the apparent "hanging" without timeouts of our servers. An AFS fileserver can respond to a request three ways: service it, say "I'm busy, try me later", or not at all. Not responding at all causes a server time out, and failover to a replica if there is one. However, "I'm busy, try me later" results in the client busy-waiting and constantly retrying that particular server with it's request. Now, how would a slow/swamped/wacked out database server cause this situation? When a client initiates a "connection" (using the word loosely) with a fileserver that comes from an authenticated user, it has to query the PTS (protection) database for that user's group memberships so that it may appropriatly apply ACLs, this response has to come back pretty quickly otherwise the fileserver will respond with "I'm busy, try me later". This can quickly snowball to the point where the fileserver is swamped with requests it can't service, because the ptservers haven't responded to their CPS queries, and so it goes.
Keep crap off our AFS DB servers, that's the moral of the story.