There were some system problems (some noticable, and some unnoticable)on Saturday afternoon and into the night affecting some of the core infrastructure. Later in the evening (around 9:30pm), they showed themselves as being a bit worse, and, at that time most of the AFS fileservers (home space, data space, you name it) had their fileserver processes restarted. A couple less cooperative client machines (most of the imap/pop servers, mail delivery systems, irix2 and linux2) were hard rebooted, or powercycled at around 10:45, and most everything was back in service by 11pm. ('cept hfs5, read on)
Earlier in the day, it was notied by Nagios that the load on some of the 'mr' machines had climbed over 10. Upon examination, there were some imap processes stuck in short-wait by two specific users, who's
names both began with 'a.b'. Upon examining their volumes from another machine, they were shown to be fine, leaving the problem with the root "a.b" volume that they were mounted from -- as it turned out, accesses to the 'a.b' volume were hanging from various machines, seemingly those that had been configured to prefer 'ifs2' over the other two infrastructure file servers. At that time, ifs2 was restarted, and it seemingly cleared up the problem.
Later, however, some other servers started having problems where some jobs were getting left in short-wait, however, it was not apparent at the time that any specific AFS server/volume was at fault. At the point of frustration, I restarted the fileserver processes on all of our AFS servers, which after the expected moaning and complaing, restored access. Most of these machines had been running in production for 246 days. As mentioned above, some of the client machines (Linux...surprise) had to be hard power-cycled to recover from this problem. It would figure that this would have to happen just as we were starting to get our system management cards on our Linux/Intel servers connected so I could have done this from home.
I noticed hfs5 was taking a little longer to come up; after logging in, I noticed there were too many salvager processes running (looked like there was 8 or 9.) Way too many for a 4-vice-partition machine. I killed them all off so they could restart correctly.