« BB7 | Main | Recent goings-on with the Core Storage Fabric »

hfs10/11/12 outage

We had an outage of hfs10,11 & 12 during the morning of 10/25. These three fileservers experienced the "thread starvation" problem which causes all clients that are accessing them to "hang". This was very quickly identified, and the fileserver processess were forcibly restarted. Unfortunatly,this meant the salvager had to run on all of the volumes -- which took between 30-45 minutes per server to complete. Service was restored to the affected volumes by 12:45pm.

All three of these servers were running OpenAFS 1.4.0-rc4. Two had been running since 11/17, while the third was last restarted on 11/9, which rules out the "I've been running for this long and now I'm going to die" theory.

The rxdebug output of one of the servers while it was in it's "hung" state was uninteresting. However, I've forwarded the output to the OpenAFS developers list to see what someone may think of it.

Since waiting for the salvaging to complete is pretty unproductive time, the fileserver + volserver binaries of these machines were upgraded to 1.4.0-rc8.

Post a comment

About

This page contains a single entry from the blog posted on October 25, 2005 12:09 PM.

The previous post in this blog was BB7.

The next post in this blog is Recent goings-on with the Core Storage Fabric.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34