bb restart
The blackboard app server was restarted at 12:14AM (monday) to fix the video driver
« April 2005 | Main | June 2005 »
The blackboard app server was restarted at 12:14AM (monday) to fix the video driver
After installing the legato backup client on two machines (syscoredb and jumpcore) this morning (wondering why this hadn't been done at all, as well), I copied the legato packages and install/config scripts out of Tim's work directory to /afs/umbc.edu/depts/oit/systems/legato.
Both production blackboard server were restarted to clear up the java mem leak. This will hopefully keep us from crashing before the semester ends.
The new MySQL server is now doing nightly dumps of its databases at 3:30am. It creates a full dump and a individual dumps of each database on it in a backup directory, organized by date. This will make reconstituting the entire databases server or individual databases easy to do in the even that needs to be done.
Titan.umbc.edu went down this morning. One of it's CPU boards completely failed, and has been removed from service. Titan is only a 20 processor machine now... This makes us very very sad, it's so hard to hold back the tears.
We've brought hfs10, one of the new Sun V20z fileservers on-line, and are starting to migrate users to it for some good old-fashioned live load testing. Perhaps you're lucky enough to be one of them...
This entry has been moved.
We've brought the second of our Sun V20z/Xraid backed fileservers on line. The first, hfs10, has been performing quite well -- so we figured we'd fire up the second.
Update: Just a reminder -- *someone* configured an alternate interface on this server (an internal, 192.* address). However, this causes the AFS fileserver to register that address as a valid address in the VLDB. An entry in the server's /usr/afs/local/NetRestrict file had to be added, and the server restarted.
We've started a process to purge "really old" accounts from our system; meaning, we're clearing out the files of accounts that have been deactivated for a really long time....
hfs11 began crashing periodically beginning at 5:30pm on Sunday May 22, seemingly due to some memory problems -- at least that's what the logs seem to suggest. Of course, it had to wait until it was mostly full of users (including the CIO), because that's just the way these things go.
The memory was replaced around 9am. We'll see how she works now.
update: sun concurred that it was the memory.
FTP1's system disk failed yesterday evening. It came up fine after a power cycle, but the disk should be replaced very soon.
Between May 20th and May 23rd, we were experiencing some mail delivery delays due to two unconnected reasons.
First, mx5in seemed to have lost it's time sync -- and since everything around here works via Kerberos, and Kerberos requires a relativly synchronized clock to do it's thing, it wasn't able to get priviliges to delivery mail into folks' accounts, and a period over the weekend was queuing mail.
The second reason is a bit more strange -- it appears the the AFS client software on a couple of the mail delivery boxes didn't refresh it's volume location information; as we've been doing quite a few volume moves, it seems for a small subset (probably less that 100) of user volumes, these servers lost track of where they are. We noticed this monday morning, and forcibly refreshed the volume location information on all of our servers which kicked loose the mail that was pending delivery.
We periodically run a process which examines the space usage on our AFS home directory servers, and moves user volumes from one server to another in an attempt to balance the usage. However, just balancing on usage isn't enough. Recently, we've made some changes in the process to take into account other factors, such as the volume's average activity, in order to have servers with a load profile that is more even.
This entry has been moved.
This page has been moved.
Today we experienced a mail delivery problem that has gotten us before. Basically, a machine not under our pervue had been sitting with a metric ****ton of cued messages for one person on our system. The administrator of this machine "fixed" the problem, and it's mail server happily began to deliver these into our system. Now, the MTA will accept bunches of message for someone, fork off MDA processes (procmail) to deliver to the local addresses, the MDA will wait for a lockfile to deliver the message, do it's thing, clear the lock file, etc. Of course, if you've got a TON of messages being delivered, there are problably the respective TON of procmail processes waiting for their lockfiles... After awhile, things begin to break down, as all of the available sendmail children processes are waiting for their respective MDA's to deliver messages to this one address... WAIT, what's that locking thing???
hfs11's system was replaced last night (5/26) with that of hfs12. If it keeps crashing now, something is really wrong.
ftp1.umbc.edu's disk finally died today. Died completely, it wouldn't even show up on a SCSI bus.
This entry has been moved.
This page contains all entries posted to OIT SysCore in May 2005. They are listed from oldest to newest.
April 2005 is the previous archive.
June 2005 is the next archive.
Many more can be found on the main index page or by looking through the archives.