Last week, as many of you know, OIT's primary datacenter suffered a very brief power outage taking many of our services off-line. SysCore managed systems weathered this outage rather well, however, there were some lessons both good and bad that we took away from the experience that may help us in the future.
Good: ZFS - All of OIT's AFS servers have been converted to use the Solaris ZFS filesystem. ZFS seems amazingly resilient at dealing with catastrphic failures, and not leaving you with "muck" for a filesystem after coming up from an outage. Our hosts in the ECS/ENG datacenter had no problems getting their data back on line without manual intervention, and reattaching and re-syncing their mirrors almost immediately. Our server located in the Public Policy (PUP) datacenter never lost touch with it's files (as a mirror half was still availble in that building), but almost instantly found it's ENG mirrors and re-synced them when they became available.
Bad: the boot archive - Solaris has this thing called it's "boot archive"; for some reason on all of our servers this was "out of sync", causing manual intervention for them all to boot.
Good: PUP - Damn, it's nice not to have to come up from "zero". Having basic services such as Kerberos authentication, the IFS fileserver located there, the database server, YP & DNS services already up and waiting removes the "you have to boot this before you boot that" problem.
Bad: Wireless Access - We've only got two hard-wired systems in the ENG computer room to access our systems. We should probably bring in a linksys WAP to fire up during these sort of events so that we have the "break glass" network access we need to, say, bring up the systems that the wireless network authentication system relies on? ;)
Good: Not Salvaging AFS Servers - Yup. They just come up like fast. Dahm.
Bad: Nagios Is Slow - All of our servers are monitored by a single Nagios instance. Even in the best of times, when all services are up, it takes awhile to cycle through all the services it monitors to give you a status report. We need to either do something to tune the nagios system to process through it's checks faster, or cascade it to use multiple servers for the actual monitoring -- bubbling their results up to a single monitoring interface.
Annoying: Xraids - The Apple Xraids seem to forget some of their configuration information after a power outage -- specifically their caching parameters which make them "go fast." Grr.
What ya'll think?