UMBC logo

« Experienced Hybrid Teachers Share Lessons Learned | Main | Information for New Students »

June 9, 2009

ECS Computer Room Power Outage Updates

This morning beginning at 3:00AM, UMBC’s main computer room suffered a series of three brief, but serious, power fluctuations. Under normal circumstances the uninterruptable power supply (UPS) that protects the computer room would prevent power fluctuations from causing problems. Unfortunately, the UPS itself suffered a hardware failure that left the main computer room in the ECS building unprotected. This resulted in all servers being abruptly shut down and rebooted. At 6:00AM we called in staff and began remediation.

As a result of the UPS failing, one of the decisions we had to make this morning was whether or not to remain on BGE power or to transfer operations over to using power from the generator. Ultimately, the decision was made to transfer to generator power due to the forecast of severe thunderstorms in the late afternoon. If we had remained on BGE power and a brief power fluctuation were to occur, the UPS would not have protected us. The decision to transfer to generator power meant that we had to spend 2.5 hours safely shutting down the servers that had survived the original power fluctuations before we could transfer to generator. We successfully moved to generator at 11:15AM.

While a loss of power is serious we have traditionally fared quite well suffering little equipment damage. Unfortunately, this power outage caused a considerable amount of damage to our equipment, which complicated the restoration of services. Some of the damage that occurred is listed below:

• Three of our four Fibre Channel Switches that support file storage for AFS, Mail and Web services were destroyed. As part of our disaster recovery (DR) plan we maintain some excess capacity. DoIT staff were able to utilize other equipment by running new fiber cabling and get this restored later in the afternoon.
• One of our two IBM DS4800 storage systems, each holding twenty terabytes of storage, were damaged and will need to be replaced. These storage systems provide data storage for Blackboard, Windows file shares, and Mail. As part of our DR plan DoIT staff reconfigured the disk storage and we use for mirroring data and brought this back up.
• The Host Management Console that is required by the PS Finance server was corrupted by the outage. Our staff spent several hours restoring this so that we could restore operations for the Finance database server.
• Power Supply Failure on the server that runs the Legato backup software. This server manages all of the nightly tape backups that occur on all systems. We are currently awaiting a replacement part from the vendor.
• The repeated up-and-down power outages caused synchronization issues with our virtual machine infrastructure and this required we work with our vendor, VMWARE, to reestablish Windows file services and Blackboard.
• The decision early Tuesday morning to move operations over to using power from the generator turned out to be prescient as we learned late in the morning that the UPS system would need to have a part flown in overnight. The vendor will be working on this on Wednesday and we will schedule a time outside normal business hours to move off generator power.

As part of the outage we have been utilizing our disaster recovery plans and validate what has worked well and where we need to focus on in the future.

Actions that worked well:
• Using the text messaging to get word out to the campus when email was down;
• Putting up a quick web page to keep the campus informed;
• Having run “virtual” simulations of disasters scenario’s; and
• Designing redundancy into the systems we deploy.

Issues we have identified that we need to address in future.
• Lessening the inter-dependence of services. We have a number of services that would seem to be independent of one another but as a result of inter-dependencies are not. This makes it more difficult to get services restored;
• Better standardization of our file storage. We have a large amount of file storage purchased over the last four years from different vendors for cost reasons. This adds to the complexity of restoring service during an outage; and
• A simple technical change, setting our servers so if they lose power they shutdown and don’t reboot. This change could have lessened the synchronization issues we encountered.

|

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)