We had two service problems today which caused much headache. So much for a quiet friday...
Problem #1
Around 9:30 this morning we began to receive reports of authentication problems with applications that used our LDAP servers for authentication. These included the UMBC VPN, WebAuth/MyUMBC and PeopleSoft. We noted our Kerberos authentication plugin for SunONE DS was erroring with:
Kerberos authentication failure: (-1765328164) Cannot resolve network address for KDC in requested realm
Which is darned odd. We were seeing this on both of our read-only directory
servers which service such applications. A quick restart of the directory
server software seemed to clear up the problem...
...until about a half hour later, when the problem started again. On both servers, at about the same time. Some basic changes were tried to see if we could isolate the problem (like calling out IP addresses in the krb5.conf instead of hostnames), but this simply moved the problem a few steps up the stack -- stuff was either not getting sent, or not getting received, from the KDC. So we began looking at the KDC -- after being a bit lost, this sort of popped out from the ipfilter logs on the KDC:
Aug 28 13:16:40 kerberos.umbc.edu ipmon[99]: [ID 702911 local0.warning] 13:16:39.670853 hme0 @0:1 b 130.85.25.23 -> 130.85.24.57 PR icmp len 20 571 icmp unreach/port for 130.85.24.57,88 - 130.85.25.23,56239 PR udp len 20 543 IN
Why should the KDC be blocking things going to port 88? The rules say to let
it through... but wait, they say to let through UDP port 88, not TCP port 88,
which as it has it, was what was trying to be connected to. Why the heck was
it trying to use TCP? Well, the new versions of Kerberos libraries prefer to
use TCP for requests of a certain size. Our KDC is an older version that
doesn't support TCP, so we were blocking (as in dropping on the floor) any
packets that came in on 88/TCP. This caused the authentication plugin to
get all backed up waiting for the TCP connection-refused's that never came.
So, removing the filters on port 88/tcp & 750/tcp from primary
& secondary KDCs seemed to clear up that problem for the most part. I say for the most part, as we still saw one "train-wreck" instance happen on one of
the servers after we had done this. I put in a restriction to stop about half
of the KDC authentication attempts from going through for at least one
directory "user" that didn't have a KDC principal. While this doesn't actually
solve the problem where certain conditions can cause a train-wreck,
it certainly cuts down on them. We really need to verify the thread-safeness
of the current MIT client libraries and remove the horrible semaphore from
the authentication plugin.
Problem #2
While we were dealing with all of this, we had an AFS fileserver hang.
./rxdebug hfs11.afs.umbc.edu 7000 | more
Trying 130.85.24.122 (port 7000):
Free packets: 10, packet reclaims: 54015, calls: 320718401, used FDs: 64
not waiting for packets.
437 calls waiting for a thread
2 threads are idle
rxdebug was nice enough to help point this out. Sad that the AFS fileserver
code still has these problems, when so many eyes have been on it in the past
10 years...
Anyhow, hfs11 was back up and running (fully) by 2:30. Honestly, we don't
know how long it was messed up for -- all of our eyes were staring at the
directory server issue at the time.