« October 2005 | Main | December 2005 »

November 2005 Archives

November 4, 2005

BFS5 content now being served from BFS1

Today I moved the volumes being served by bfs5.afs.umbc.edu to bfs1.afs, a new Sun V20z which is connected to our Core Storage Fabric which has 1TB of mirrored disk online. This removes one more of the old Gen 2 linux AFS servers from our machine room.

Next week, the contents of bfs3.afs will be moved to bfs1, allowing us to turn off bfs3 and its multiple A5200 JBOD arrays in an effort to conserve power and reduce the cooling load for the ECS data center.

QLogic fiber channel switches now running in PP

Two new QLogic SANbox 5200 fiber channel switches were configured and installed today in the Public Policy (PP) data center. They are awaiting the installation of the cross-campus fiber pairs which will allow them to join the Core Storage Fabric in the ECS data center with an aggregate speed of 8Gb/second between the two buildings (2x 2Gb/s, bidirectional)

November 9, 2005

Blackboard crash

Looks like Blackboard died ~8:11 this morning.

Looks like the tomcat process died. This happens every now and then if blackboard is not restarted often enough. However, Blackboard was restarted 3 hours before Tomcat died. So the memory leak theory does not work here.

krb5 changes...

Been making some changes today to our Kerberos configuration -- adding support for some encryption types *other* than the slightly out-of-date DES ;)

"out of date DES" is still the default until I've verified that all of the older 'aklog' binaries have been updated to support the new encryption types; the kerberos libraries that the builds were linked against were rather moldy and oldy, so they don't contain support for the newer enctypes that they'll be seeing. In addition, the newer builds of aklog will support addressless tickets, which is the flavor of the month in KRB5 land.

November 10, 2005

BB crash 11/10

11/10 at ~10:30AM Blackboard crashed or at least slowed down to the point the users could not login. Turns out one of the SQL server processes was not started yesterday when the system was restarted for maintenance. This caused the database server to slow down and hence slow down blackboard till it froze. After restarting the process and running some jobs to clear up the transaction log, performance returned to normal.

November 16, 2005

Update on the AFS problems of the past few days

As you probably have noticed, Monday (11/14) and Tuesday (11/15) saw some major email and home directory outages which were caused by the AFS server processes on hfs10, hfs11, and hfs12 hanging.

These two outages has afforded us the opportunity to attempt to catch the software bug in action, and we've boiled it down to a few candidate areas and have initiated discussion about these one the opensafs-devel list. It was comforting to find out that U Michigan is seeing similar problems on one of their servers.

The current theory right now is this: The AFS fileserver process is multi-threaded using POSIX threads (pthreads). Sometimes, these threads need to share data or access some central table of data. To do this, a thread must first "lock" the table (if it isn't already locked by another thread), then do its stuff to it, and then release the lock. This makes it so that another thread just can't alter the table as another one is looking at it. What appears to be happening is called a deadlock, where a thread that has a lock on a table stalls and ultimately never releases that lock. The other threads that need to also access this table can't do anything until that lock is released by the stalled thread, which of course never happens... so everything grinds to a halt. No thread can do anything because the thread holding the lock has tarded out.

The goal right now is to find which table or shared memory space is being locked and never unlocked... and why. We've put some extra debugging code into the areas of the code which we believe is stalling, but unfortunately this means we're going to have to have another crash to get any meaningful telemetry from this. With luck, that one crash would be the last one as we would then have the info necessary to fix the problem properly. Stay tuned...

November 22, 2005

New Syscore admin

Kendrick Hernandez joined the Syscore team this week! Kendrick comes to us from the Help Desk and will help with ticket resolution and administration of the servers. Welcome, Kendrick!

About November 2005

This page contains all entries posted to OIT SysCore in November 2005. They are listed from oldest to newest.

October 2005 is the previous archive.

December 2005 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34