Of course, that whole krb/tcp thing was a bit of a red herring. That *was* a problem, and I'm sure it didn't help. But, the real problem was within Sun's directory server.
It's best described in this excerpt from the "changeplan" from Duke's "krbdirp"
package.
(2) We've run into yet another set of thread safety issues directly
linked to the thread-unsafe nature of the Kerberos 5 code from
MIT. Even with careful coding within the PAM library, it's
not possible to prevent the DS thread managing krbdirp operations
from colliding with other DS threads during unsafe DNS resolution
calls. The absence of exposure of the internal mutexes used by
the SunOne DS code for single-threading unsafe DNS calls is
a show-stopper. Problems with threadsafety are less frequent
in the PAM version (1.3a) of the code, but still exist, since
the PAM module itself executes within the context of the calling
process (in this case, the DS instance). As before, problems
are not apparent on single-processor machines, presumably
because no real concurrancy occurs -- the problem becomes
more obvious on multi-processor machines, and particularly on
machines with multiple slow processors (more processors = higher
concurrancy = higher chance for collision; slower processors =
longer DNS resolution operations = higher chance for collision).
To address this, we propose to extract the PAM library calls out
of the plugin itself, and pull them completely out of the context
of the slapd process into a separate co-process. Looking at the
available options, it appears that our best bet for now will be
to link the slapd and this external process over either a Unix
domain socket or a Solaris "door". The latter will provide
better performance (since doors are explicitly intended for this
sort of purpose -- essentially providing RPC-like interfaces
between processes through lightweight kernel channels) but will
be non-portable to systems other than relatively recent versions
of Solaris.
So, with a bit of a rush, we modified, implemented, and tested the "krbdirp" code from Duke, with it's external "door" module for password checking. It's in production now, and seems to be working just fine.
Of course, in the middle of this, hfs12 hung (like hfs11 earlier) and was restarted... They had both been running about the same time. Hopefully, we'll be off beta code on those fileservers in the next couple weeks.