Backup and Recovery

From Syscore

Contents

SysCore System Backup & Recovery Strategy

The systems core group runs many servers, however, we back up very few -- and this is our philosophy:

Most of our systems contain no, or very little native volatile data -- and the ones that do have this data restricted to specific directory trees. Our system configurations, from bare disk to functioning server are for the most part taken care of via custom Solaris Jumpstart scripts and cfengine configuration files. The Solaris Jumpstart environment is duplicated nightly to a disk located in the Public Policy computer room, and the configuration files used by cfengine are replicated via AFS between datacenters.

With a few exceptions, all of the binaries, programs, and files for web services, electronic mail, etc. are located in AFS. When possible, these are in read-only volumes which are replicated in an active/active fashion across datacenters.

Some systems, such as our web servers or mail servers, contain some volatile data that is *not* managed by cfengine, such as Kerberos keytab or SSL keys. If this service is a replicated service, the keytab or SSL key needed for the service can be copied from the alternate server. For non-replicated services, a new or SSL cert can be generated and signed if a backup copy is not available. (we keep copies of certain important SSL cert keys in a safe place)

For servers which DO contain volatile, non-reproducable data, the filesystems or portions of filesystems which contain this data is backed up nightly with Legato Networker -- with exception of the AFS filesystem, as it requires its own backup system. All of the AFS filesystem contents, as well as systems containing critical volatile data, have that data housed on our SAN, which actively mirrors these filesystems between both data centers -- therefore, even in the event of a primary hardware failure or loss of a datacenter, this data would still be available and easily re-provisioned to be accessible by a replacement server.

You may be interested in these other topics:

General Procedure for System Recovery

Procedure for System Recovery of an AFS Server

  • Jumpstart the replacement hardware
  • Install the appropriate AFS server binaries (easiest to copy these from another server, see Configure_An_AFS_Fileserver for details and caveats)
  • If the storage devices are still available on the SAN, import their zfs pools onto the new server. (see documentation on docs.sun.com if you don't know how to do this),

otherwise follow the procedure on the page referenced above to recreate their ZFS pools.

  • Fire up /usr/afs/bin/bosserver.
  • If you had to recreate the ZFS pools, you'll have to do a full AFS server restore. You'll be using backup /usr/afs/bin/backup diskrestore -- see OpenAFS org, or builtin help for the specifics.)

Procedure for the System Recovery of the Kerberos KDC

  • Put a disk in a sun.
  • Look at /usr/kdc on jumpcore.umbc.edu -- you'll find:
total 852246
drwxr-xr-x   3 kdc      other        512 Aug 18  2006 .
drwxr-xr-x  41 root     root        1024 Sep 11  2005 ..
drwxr-xr-x   2 kdc      other        512 Sep 11  2005 .ssh
-rw-r--r--   1 kdc      other    20818099 Aug 13  2006 root.tar.gz.pgp
-rw-r--r--   1 kdc      other    71004342 Aug 18  2006 usr.k5s.tar.gz.pgp
-rw-r--r--   1 kdc      other    344258738 Aug 13  2006 usr.tar.gz.pgp

These are pgp-encrypted tar files of the /, /usr, and /usr/k5s filesystems from that server. The pgp key used to decrypt them is stored on a USB key located in a location known to the core systems staff. Another copy is kept at the home of the Director of Computing Infrastructure.

  • Restore the stuff to the disk, run installboot on it, and fire it up.

I'll note, that the only time you'd have to restore the KDC is when the KDC and the secondary KDC were destroyed... which would suck, because that would mean both datacenters are toast. The machines, except for their hostnames and that one runs kadmind are duplicates of each other. Really. I dd'd the disks. And the master syncs its database to the slave every 15 minutes.

The Worst Case Scenario

The Worst Case Scenario situation is that both datacenters have been destoyed -- which would include the replicas of the jumpstart servers, the AFS backup server (tapestud), and the networker server as well. Oh boy would that stink. However, magically, you (the system administrator), and the tapes all survived. Perhaps it was just an EMP? So, here's what one might do...

  • Install Solaris and Sun Enterprise Backup Suite onto a Sun, and hook it up to the networker tape stacker that somehow survived. (we're still wondering about that)
  • Grab another server that will be the gold server for the system restore. It should have at least 1TB of disk attached to it.
  • Restore to it the contents of jumpcore-ecs.umbc.edu
  • Also restore to it the contents of /usr/afs from db1.afs.umbc.edu, and /usr/afs/backup from tapestud.afs.umbc.edu.
  • Find another machine and restore the contents of the KDC to it. (see above). You restored that stuff when you restored jumpcore.
  • Hook it up to the AFS tape stacker, that also magically survived.
  • Using the contents from /usr/afs from db1, set the server up as a single-server AFS db server, and fileserver. (make some filesystems for /vicepa-/vicepd with that 1TB of space you have.)
  • Use the AFS backup system to restore the contents of ifs1.afs.umbc.edu to this AFS server.
  • Now you have a working AFS "infrastructure", and can start restoring the rest of the stuff. Good luck, you'll need it.

What can we do to make this a better?

  • rsync the data from jumpcore-ecs to a remote site nightly
  • also include in that the 'IFS' fileserver data, as well as the data from a db server...

(at least then you won't be stuck going to tape to get the ball rolling. Not like you'd ever want to be in this situation.)