« openssh 4 osx | Main | hfs11/hfs12 server downtime, Jul 17 [updated] »

HFS12 fileserver problems

HFS12 experienced a kernel panic related to filesystem corruption on July 14, at approximatly 3:30pm. A few hours before we had noticed evidence of some filesystem corruption, that seemed to stem from some work earlier in the day attempting to enable LUN masking on the backend storage attached to hfs11 & hfs12.
While no "fatal" errors were noticed on these systems when making the backend storage changes, it seems that some disk writes to the mirror pair that was being worked on were lost, resulting in out of sync mirrors, and hense, not-quite-right filesystem data -- depending on which mirror was being read from.

This shouldn't have happened; if a single write was not sucessful to a mirror pair, the system should have offlined that half and continued on with the working mirror. During the application of certain configuration settings, the Xserve RAID controller performs a very fast restart -- not fast enough to trigger a timeout, so, you'd think operations would simply retry and continue on when the storage became available again.

There could be three things at play here:


  1. Perhaps during a system restart, the XServe RAID "discards" some pending
    IO operations? (that would be bad!)

  2. The Solaris multipathing (MPxIO) stuff may silently have thrown away an OP
    when the path was no longer available. (haven't seen that before!)

  3. The Solaris md subsystem might also have some problems.

As you can see, many things -- we'll have to continue researching this.

I would tend not to think #2 is the culprit, as in some work on the 13th (upgrading firmware on our switches) would have caused many "path failures" while the switches were restarted, however, no similar badness was detected on
any of the systems after that event.

We will be taking downtime on Sunday to run a filesystem check to clear up any
remaining corruption (and install some patches, and really enable the lun masking on the storage gear.)

Post a comment

About

This page contains a single entry from the blog posted on July 14, 2005 5:19 PM.

The previous post in this blog was openssh 4 osx.

The next post in this blog is hfs11/hfs12 server downtime, Jul 17 [updated].

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34