hercules.rs disk failure
Hercule.rs.umbc.edu's system disk was periodically timing out on SCSI commands, and was replaced this morning.
Hercule.rs.umbc.edu's system disk was periodically timing out on SCSI commands, and was replaced this morning.
jarjar, the remedy server, was down due to a fail(ing|ed) disk this morning from 7:50am until approximatly 9:30am. The server was power-cycled, and the machine
came up cleanly. A replacement disk has been ordered, and it will be replaced.
The blackboard servers will be rebooted late tonight 4/20. This is to apply security updates. Folks that are active at the time of reboot could notice a pause
in service of up to 2 minutes. Most folks will not notice anything.
Mr4 was down overnight due to an apparent disk problem:
Apr 19 17:07:52 mr4.umbc.edu cadp160: WARNING: Timeout on target 0 lun 0. Initiating recovery.
The disk has been replaced. No user-noticable downtime should have been noticed.
There were some system problems (some noticable, and some unnoticable)on Saturday afternoon and into the night affecting some of the core infrastructure. Later in the evening (around 9:30pm), they showed themselves as being a bit worse, and, at that time most of the AFS fileservers (home space, data space, you name it) had their fileserver processes restarted. A couple less cooperative client machines (most of the imap/pop servers, mail delivery systems, irix2 and linux2) were hard rebooted, or powercycled at around 10:45, and most everything was back in service by 11pm. ('cept hfs5, read on)
bfs4, a fileserver that primarly servers web resources and the like, was unintentionally "paused" during the console server work yesterday. Once it was notied, it was 'unpaused', and everything seemed to recover fine. This outage affected parts of the MyUMBC environment, including webadmin, and class schedules.
And who ever thought that some astray ypserv processes would eat your AFS cell for lunch?
We experienced some problems with our AFS cell today, not to dissimilar to those we experienced on Saturday night. ...
Continue reading "Who ever thought NIS would be resource intensive?" »
Email that passed through mx9in since the system problems yesterday to this morning has been held up in it's local queue. It's clearing out now (it has a half-gig of mail to deliver!), so folks may see some delayed mail coming in from yesterday. All of the other mail delivery servers survived just fine -- there always has to be one...
We lost power to a rack this evening at around 5pm; this affected users on hfs9.afs.umbc.edu, and also took down solaris1.gl.
Power was back on quickly afterwards, however, hfs9 hadn't finished "salvaging" it's filesystems until after 6.
The blackboard app server was restarted at 12:14AM (monday) to fix the video driver
Both production blackboard server were restarted to clear up the java mem leak. This will hopefully keep us from crashing before the semester ends.
Titan.umbc.edu went down this morning. One of it's CPU boards completely failed, and has been removed from service. Titan is only a 20 processor machine now... This makes us very very sad, it's so hard to hold back the tears.
hfs11 began crashing periodically beginning at 5:30pm on Sunday May 22, seemingly due to some memory problems -- at least that's what the logs seem to suggest. Of course, it had to wait until it was mostly full of users (including the CIO), because that's just the way these things go.
The memory was replaced around 9am. We'll see how she works now.
update: sun concurred that it was the memory.
FTP1's system disk failed yesterday evening. It came up fine after a power cycle, but the disk should be replaced very soon.
Between May 20th and May 23rd, we were experiencing some mail delivery delays due to two unconnected reasons.
First, mx5in seemed to have lost it's time sync -- and since everything around here works via Kerberos, and Kerberos requires a relativly synchronized clock to do it's thing, it wasn't able to get priviliges to delivery mail into folks' accounts, and a period over the weekend was queuing mail.
The second reason is a bit more strange -- it appears the the AFS client software on a couple of the mail delivery boxes didn't refresh it's volume location information; as we've been doing quite a few volume moves, it seems for a small subset (probably less that 100) of user volumes, these servers lost track of where they are. We noticed this monday morning, and forcibly refreshed the volume location information on all of our servers which kicked loose the mail that was pending delivery.
Today we experienced a mail delivery problem that has gotten us before. Basically, a machine not under our pervue had been sitting with a metric ****ton of cued messages for one person on our system. The administrator of this machine "fixed" the problem, and it's mail server happily began to deliver these into our system. Now, the MTA will accept bunches of message for someone, fork off MDA processes (procmail) to deliver to the local addresses, the MDA will wait for a lockfile to deliver the message, do it's thing, clear the lock file, etc. Of course, if you've got a TON of messages being delivered, there are problably the respective TON of procmail processes waiting for their lockfiles... After awhile, things begin to break down, as all of the available sendmail children processes are waiting for their respective MDA's to deliver messages to this one address... WAIT, what's that locking thing???
hfs11's system was replaced last night (5/26) with that of hfs12. If it keeps crashing now, something is really wrong.
ftp1.umbc.edu's disk finally died today. Died completely, it wouldn't even show up on a SCSI bus.
As some folks may have noticed, myUMBC has had issues on two separate occasions recently. Of course, Murphy's Law dictates that these problems must always occur after hours, and on evenings when I (as the primary developer) am out and unable to get to a computer to check it out. Anyhow..... we think the cause has been identified and are working on a permanent fix.
Our blackboard systems will be down for several minutes on 6/21 in the early AM. I will be applying several patches.
The patches scheduled for the blackboard server on 6/21 did not complete
before 8:00am. The patchwork was resumed 6:30am today 6/22 and was completed around 7:30.
HFS12 experienced a kernel panic related to filesystem corruption on July 14, at approximatly 3:30pm. A few hours before we had noticed evidence of some filesystem corruption, that seemed to stem from some work earlier in the day attempting to enable LUN masking on the backend storage attached to hfs11 & hfs12.
While no "fatal" errors were noticed on these systems when making the backend storage changes, it seems that some disk writes to the mirror pair that was being worked on were lost, resulting in out of sync mirrors, and hense, not-quite-right filesystem data -- depending on which mirror was being read from.
Attention:
Two of our fileservers will be down for system maintenance on
Sunday, July 17th from 8am until 11am. During this time,
Email and UNIX home directory access for those users who's data is housed on these servers will be unavailable.
During this time, these servers will be brought up to the latest patch level, and various changes will be made to their backend storage configuration.
Continue reading "hfs11/hfs12 server downtime, Jul 17 [updated]" »
Last night, hfs11 & hfs12 "wigged out" and started returning "busy" for all AFS fileserver requests. This particular event causes certain things on "connected" afs client machines to hang.
Oddly, these machines were rebooted at roughly the same time on Sunday, and had the same (new) version of the AFS fileserver stuff installed on them at that time. Coincidence. Probably not. Anyhow, I've installed the previous version of the volserver/fileserver software so that if they hang again and need to be restarted, they'll restart with the previous (non-hangy) software. Otherwise, a short (15 minute or so) downtime will be scheduled for later this week to switch back to the old code.
The master LDAP directory server, and our Identity Managment System "hub" will be down during the morning of Saturday, August 20, 2005 for software and hardware upgrades.
During this time, certain MyUMBC functionality relating to account maintenance will be unavailable. Functions affected include:
And others -- basically, anything housed on the "accounts.umbc.edu" or "webadmin.umbc.edu" websites. Other services, such as email access & delivery, UNIX shell accounts and Blackboard will continue to be available during this time.
The outage will begin as early as 3am, and last until 12 noon at the latest.
Please check this website for further updates and status.
We had two service problems today which caused much headache. So much for a quiet friday...
Of course, that whole krb/tcp thing was a bit of a red herring. That *was* a problem, and I'm sure it didn't help. But, the real problem was within Sun's directory server.
The ListProc software on listproc.umbc.edu will be taken offline on the night of 9/11/2005 to rebuild its users database, a process which will take several hours.
Mailing lists will essentially be down during this period.
We had an outage of hfs10,11 & 12 during the morning of 10/25. These three fileservers experienced the "thread starvation" problem which causes all clients that are accessing them to "hang". This was very quickly identified, and the fileserver processess were forcibly restarted. Unfortunatly,this meant the salvager had to run on all of the volumes -- which took between 30-45 minutes per server to complete. Service was restored to the affected volumes by 12:45pm.
We had some more oddness on the three AFS servers that were mentioned on Tuesday(?). Basically, we were able to watch them "busy out", all pretty much at the same time -- by busy, we really don't mean busy, they were busy sitting there doing "nothing" except for waiting for something to happen.
After a few minutes, that something would happen, and they'd go about their business. Then 15 minutes or so later, the same thing would happen.
We've never noticed anything like this on our OpenAFS 1.2.13 servers, which were running fine during this time. To test out the theory that this could be a problem, we did a "special build" of the fileserver & volserver processes for Solaris 10, and installed them on hfs12. We'll be watching hfs12 to see if it exhibits the same wankiness as it did before (we expect hfs10 & 11 will continue weirding out.) If it's fine while the others aren't, we'll 'upgrade' them to the version of the code we're running on hfs12.
Our restart of hfs12 to install the new software took much longer than expected because it didn't (or wouldn't) shut down cleanly. This could very well be related to whatever is going wrong...
Looks like Blackboard died ~8:11 this morning.
Looks like the tomcat process died. This happens every now and then if blackboard is not restarted often enough. However, Blackboard was restarted 3 hours before Tomcat died. So the memory leak theory does not work here.
11/10 at ~10:30AM Blackboard crashed or at least slowed down to the point the users could not login. Turns out one of the SQL server processes was not started yesterday when the system was restarted for maintenance. This caused the database server to slow down and hence slow down blackboard till it froze. After restarting the process and running some jobs to clear up the transaction log, performance returned to normal.
As you probably have noticed, Monday (11/14) and Tuesday (11/15) saw some major email and home directory outages which were caused by the AFS server processes on hfs10, hfs11, and hfs12 hanging.
These two outages has afforded us the opportunity to attempt to catch the software bug in action, and we've boiled it down to a few candidate areas and have initiated discussion about these one the opensafs-devel list. It was comforting to find out that U Michigan is seeing similar problems on one of their servers.
The current theory right now is this: The AFS fileserver process is multi-threaded using POSIX threads (pthreads). Sometimes, these threads need to share data or access some central table of data. To do this, a thread must first "lock" the table (if it isn't already locked by another thread), then do its stuff to it, and then release the lock. This makes it so that another thread just can't alter the table as another one is looking at it. What appears to be happening is called a deadlock, where a thread that has a lock on a table stalls and ultimately never releases that lock. The other threads that need to also access this table can't do anything until that lock is released by the stalled thread, which of course never happens... so everything grinds to a halt. No thread can do anything because the thread holding the lock has tarded out.
The goal right now is to find which table or shared memory space is being locked and never unlocked... and why. We've put some extra debugging code into the areas of the code which we believe is stalling, but unfortunately this means we're going to have to have another crash to get any meaningful telemetry from this. With luck, that one crash would be the last one as we would then have the info necessary to fix the problem properly. Stay tuned...
Kerberos.umbc.edu was "Hung" for a time (still responding to pings, just not doing much) from 4pm on Dec 18 until approximatly 9:30pm. It is unknown why it was hung, just that it was -- no diagnostics on the console or logs. Services that required the backwards-compatible rxKa authentication for older AFS client installs, and password changes, did not work during this time. In addition, it was noticed that the principal database on the secondary KDC hadn't been updated in some time. Logs showed that it hadn't been updated since November 8; seems that 'cron' on kerberos2 had packed it in as well. Well, it has been up for 450+ days.
Next week on 12/28 we will move ff1-raid1 from ECS to PP. This move will mean that all AFS home storage will be redundant between buildings, a first for the campus! This means not all data is lost if the ECS building were to go up in flames.
No user-viewable disruption should be seen, as the missing array will be automatically handled by the mirroring on our Solaris AFS servers. ff1-raid2 will remain in the ECS data center. Once ff1-raid1 is back on the fibre channel fabric in PP, the mirrors will automatically resync and the AFS home server network will return to a nominal state.
Blackboard was Down 8 p.m. on 1/24 to 10 a.m. on 1/25
for a hardware upgrade that improved performance for the traditionally busier Spring semester, the UMBC Blackboard server was down from 8 p.m. Tuesday, January 24, to 10 a.m. Wednesday, January 25. This did not affect Winter Blackboard courses which were running on a separate Blackboard server.
OIT uprading(sic... really upgraded but we also upRAIDed our storage to a raid 50) the Blackboard application database environment to complement the Jan. 13 software upgrade. We had hoped to complete both upgrades on the 13th, but delays in receiving the hardware, configuring and adequately testing it required a second downtime before Spring classes start.
Our Blackboard server slowed to a halt at ~11:30AM today, 3/8. The main tomcat JAVA paocess was hung using too much memory. I was able to get the hung tomcat process to restart but the main page would not load till I restarted IIS.
This Saturday, 3/11, solaris.gl.umbc.edu will be down for a short time beginning at 1:30pm for an upgrade of its OS to Solaris 10.
Downtime is expected to be no longer than two hours. An update will be sent out once it is completed.
The other servers in the GL cluster will still be operational and available for use.
The following fileservers will be "restarted" at 2am on Sunday, April 30 2006:
hfs1.afs.umbc.edu
hfs2.afs.umbc.edu
hfs12.afs.umbc.edu
The service outage for users on these servers, which will last approx. 30 minutes, is to install some updated code to avoid a bug which caused hfs10.afs.umbc.edu to crash on Wednesday afternoon.
This will impact access to email & file services for users housed on these servers during the outage. We are sorry for any inconvenience that this may cause.
I restarted blackboard about 6am this morning to add yet another variable to yet another file that is suppose to make the automatic log rotation work again...
Worked on the log rotator again early this morning.
Blackboard was down from ~5:30 - ~7:00 while I tried
removing all traces of our custom auth to rule out that it is the problem.
On Tuesday 7/25 our production blackboard system
died ~4:19 in the early AM. It was brought back up
~ 7:30 AM when someone from the helpdesk arrived who
had access to the computer room and could follow my instructions.
The logs show the server could not boot correctly and
seemed to hang on boot up.
Blackboard was down from ~ 11:30PM 8/23 till 7:00am
8/24 while file system checks were run. This was a planned downtime.
After this, I discover a bad memory dimm and the server lost half of its fans. The system was back
up before 8:00AM
Blackboard went down again ~12:40 11/22 due to a Java memory overflow
Blackboard's Java process was using 1.7GB... The number of doom.
This means that a crash will probably happen in the next 2 hours.
I rebooted the server to avoid my pager going off at 2AM or extended
downtime if netsaint didn't catch it.
Another preemptive reboot for blackboard tonight 12/13 12/14
Will restart the system around midnight as java is about to run out of memory.
Rebooted blackboard today at 5:15AM to clear Java overflows for hopefully for the last time.
This page contains an archive of all entries posted to OIT SysCore in the Downtime category. They are listed from oldest to newest.
Documentation is the previous category.
Etcetera is the next category.
Many more can be found on the main index page or by looking through the archives.