As you probably have noticed, Monday (11/14) and Tuesday (11/15) saw some major email and home directory outages which were caused by the AFS server processes on hfs10, hfs11, and hfs12 hanging.
These two outages has afforded us the opportunity to attempt to catch the software bug in action, and we've boiled it down to a few candidate areas and have initiated discussion about these one the opensafs-devel list. It was comforting to find out that U Michigan is seeing similar problems on one of their servers.
The current theory right now is this: The AFS fileserver process is multi-threaded using POSIX threads (pthreads). Sometimes, these threads need to share data or access some central table of data. To do this, a thread must first "lock" the table (if it isn't already locked by another thread), then do its stuff to it, and then release the lock. This makes it so that another thread just can't alter the table as another one is looking at it. What appears to be happening is called a deadlock, where a thread that has a lock on a table stalls and ultimately never releases that lock. The other threads that need to also access this table can't do anything until that lock is released by the stalled thread, which of course never happens... so everything grinds to a halt. No thread can do anything because the thread holding the lock has tarded out.
The goal right now is to find which table or shared memory space is being locked and never unlocked... and why. We've put some extra debugging code into the areas of the code which we believe is stalling, but unfortunately this means we're going to have to have another crash to get any meaningful telemetry from this. With luck, that one crash would be the last one as we would then have the info necessary to fix the problem properly. Stay tuned...