Backup boot disk -- partial solution to a FAQ

Rich Kulawiec (rsk@itw.com)
Thu, 27 Nov 1997 08:57:13 -0500

--Boundary_(ID_weHEg5Nke493EwCOx1oV+Q)
Content-type: text/plain; charset=us-ascii

A number of questions have come across recently that have all related
to emergency recovery scenarios; a lot of those have involved booting
from backup disks, or CD, or from disks transplanted from another machine.

Attached to this message is an EXPERIMENTAL script which attempts to
address that issue for SunOS and Solaris systems; it's designed to
allow admins to create a backup disk (with multiple filesystems, e.g. /,
/usr, /opt, /var) that can be used to get out of trouble. I haven't
written a man page for it yet -- but it is heavily commented. Speaking
of comments, yours are invited. Once this works, I'll get stash it
somewhere on the 'net, get an entry put in the FAQ for it, and hopefully
that will not only help out a number of admins, but also cut down on
the number of questions of this type.

---Rsk
Rich Kulawiec
rsk@itw.com

--Boundary_(ID_weHEg5Nke493EwCOx1oV+Q)
Content-type: text/plain; NAME=mkbootdisk; charset=us-ascii
Content-disposition: attachment; filename=mkbootdisk

#!/bin/sh
#
# $Header: mkbootdisk,v 1.20 1997/11/24 21:47:07 rsk Exp $
#
# Copyright Rich Kulawiec 1997. GNU redistribution terms apply.
#
# mkbootdisk: create a copy of the filesystems necessary to
# get the machine up and running on a second (spare) disk.
#
# My suggestion is that you use a 2G disk in an external SCSI enclosure
# for this procedure. That way, you can walk up to any machine (of the
# same OS and architecture) that's hosed, plug in the disk, and at least
# get the machine booted -- at which point you have all the standard Unix
# tools at your disposal to diagnose and solve the problem. That's also
# large enough that you can afford to have all the documentation
# (e.g. Answerbooks) on there as well -- another resource that's handy
# to have in an emergency, when you may not have access to your network.
#
# DO NOT RUN THIS UNLESS YOU ARE EXACTLY SURE WHAT YOU
# ARE DOING. IT'S POSSIBLE TO DO A LOT OF DAMAGE VERY
# RAPIDLY WITH THIS SCRIPT. CONSIDER THIS VERSION (1.20)
# EXPERIMENTAL, INCOMPLETE, AND VERY POSSIBLY RIDDLED WITH BUGS.
#
# Is that enough of a disclaimer?
#
# With that in mind, you'll probably want to run this with "-n -v"
# before you really turn it loose.
#
# Arguments:
#
# -n Not Really; doesn't actually modify anything
# -v Verbose; explains what it's about to try to do
# -d Debug; turns on detailed debugging messages
#
# Known limitations:
#
# The chunk of the script that remakes device paths for Solaris
# is just a stub at this point.
#
# Doesn't check to make sure copy of filesystem will fit in
# the designated partition (yet).
#
# No facilities to cope with DiskSuite.
#
# Only tries to deal with SunOS and Solaris at this point; probably
# worth extending to Digital Unix, SGI Irix, etc. once some of
# the wrinkles get ironed out.
#
# Design decision #1: This isn't written in Perl because there will be
# times that it will be run on machines that don't have Perl.
#
# Design decision #2: This uses dump/ufsdump to move the bits, because
# they will be present when other utilities might or might
# not be. Also dump has code to cope to non-quiescent filesystems,
# which none of the other archiving utilities (tar, cpio, etc.) have.
#
# Design decision #3: Regardless of how critical filesystems are
# distributed on the running system (i.e. if / is on one disk, /usr
# on another, etc.) places all backup copies on a single disk. This
# is done because it's presumed that when using that disk to do an
# emergency recovery that performance optimization is not an issue.
#
# Design decision #4: newfs the copy partitions to clean them out? Maybe not.
#
# Hmmm. This is certainly the easiest and fastest way to ensure
# that they're completely empty before proceeding. But on the
# other hand, if anything causes this script to terminate after
# the newfs runs but before the dump/restore pipeline completes,
# then the copy partition won't contain a complete copy of the
# original...and if that's a / or /usr partition, we may end
# up with a system that can't be booted (from the copy).
#
# So, considering that this isn't really an attempt to provide
# a "backup", in the sense of daily tape backups, but is instead
# intended to get us out of a jam if and when the boot disk fails,
# let's accept that the copy isn't going to be one-for-one; in
# particular, because we're not cleaning out the copy before doing
# a dump on top of it, files that have been deleted in the original
# won't be deleted in the copy. We could *still* end up with
# a non-bootable disk (for example, if the copy process fails
# while writing the kernel) but the time window in which that
# could happen is much smaller this way.
#
# Design decision #5: Leave the "restoresymtable" files right where they
# end up. The "restore rf" will leave one of these in each partition
# on the copy, but (a) they don't take up that much room and (b) they
# could come in very handy in an emergency situation. Again, because
# this isn't intended to be a backup, but a way out of a problem,
# having extra information around is probably the right way to go.
#
# What you have to do before running this script:
#
# 1. Install a disk that's going to be the backup. It needs to be
# large enough to hold everything that you're going to try to copy to it.
# *This script doesn't check that, at least not yet.*
#
# 2. Use format to (a) format, if necessary, and (b) partition the
# backup disk. Again, the partitions have to be large enough to
# hold copies of the filesystems you're cloning.
#
# 3. Use newfs to create filesystems in the disk partitions on the copy.
# See design decision #4.
#
# 4. Make sure that (a) nothing is mounted on /mnt and (b) none of the
# partitions on the backup disk are mounted, period. (Recall that
# /mnt is intended for use as a "scratch" mount point; you should not
# have anything permanently mounted there.)
#
# 5. Set the appropriate variables in the section just below.
#
# Note that steps 1-5 are a one-time operation: you don't need to do them
# every time you run this script.
#
# What this script does:
#
# A. It figures out which OS it's running under.
#
# B. It checks a few key variables for clues that they might be set
# incorrectly, as well as checking that a few key files are readable.
#
# C. It checks to make sure that /mnt isn't in use (because it intends
# to use it) and that none of the filesystems it intends to use on
# the *copy* disk are mounted.
#
# D. It checks to make sure that the filesystems it's going to copy
# are listed in /etc/fstab and are locally mounted.
#
# E. It uses a dump/restore (ufsdump/ufsrestore) pipeline to copy each
# filesystem to the backup disk. If the filesystem in question is
# the root filesystem, it installs a boot block, and mangles the
# copy's /etc/fstab (/etc/vfstab) to match the layout of the copy.
# It doesn't adjust device paths just yet because that subroutine
# is just a stub, but that's clearly going to have to be done
# to make that disk functional.
#

PATH="/sbin:/bin:/usr/bin:/usr/ucb:/etc:/usr/etc"
export PATH || (echo "OOPS, this isn't sh. Desperation time. I will feed myself to sh."; sh $0; kill $$)

#
# Variables that must be set. Yes, we could prompt for these, but then
# this script couldn't be easily run from cron. Maybe we'll make them
# arguments instead of wiring them in.
#
# Pick one of the following and uncomment the definition of FILESYS, COPYDISK,
# and PARTITIONS that are closest for your systems; then modify to suit.
#
# The filesystems listed in FILESYS will be copied to COPYDISK and be placed
# in the partitions listed in PARTITIONS.
#
# If you do not know which is closest or how to modify it, then you
# probably shouldn't even be considering running this script. For obvious
# reasons, the list in FILESYS must correspond one-to-one with the list
# in PARTITIONS.
#
# Typical for SunOS; backup / on /dev/sd2a, backup /usr on /dev/sd2g
FILESYS="/ /usr"
COPYDISK=/dev/sd2
PARTITIONS="a g"

# Typical for Solaris; backup /, /usr, /var, /opt in slices 0, 6, 4, 7
# FILESYS="/ /usr /var /opt"
# COPYDISK=/dev/dsk/c0t1d0s
# PARTITIONS="0 6 4 7"

#
# You shouldn't have to modify anything below here for normal operation.
#

#
# Arguments passed to dump/restore. This should be enough to prevent
# dump from thinking that it's run out of tape.
#

DUMPARGS="0sdbf 13000 54000 126 -"
RESTOREARGS="rf -"

#
# (hopefully) eye-catching string to flag errors with.
#

ERROR="ERROR:"

#
# Initial setting for flags
#

VERBOSE=""
NOTREALLY=""
DEBUG=""

#
# Subroutines start here. Main program is at the end.
#

#
# Figure out which OS we're running; set some variables based on it
#

determine_os_version ()
{
case "`uname -r | sed -e 's/\..*//'`" in
"4") OS=sunos
FSTAB=/etc/fstab
FSTYPE=4.2
DUMP=dump
RESTORE=restore
;;
"5") OS=solaris
FSTAB=/etc/vfstab
FSTYPE=ufs
DUMP=ufsdump
RESTORE=ufsrestore
;;
*) echo $0: $ERROR unknown OS version
exit 1
;;
esac
}

#
# Verify that whoever edited the script has set the variables
# that they're supposed to. Also verify that fstab is readable.
#

check_variables_and_files()
{
if [ -z "$FILESYS" ]; then
echo $0: $ERROR \$FILESYS isn\'t set
exit 1
fi
if [ -z "$COPYDISK" ]; then
echo $0: $ERROR \$COPYDISK isn\'t set
exit 1
fi
if [ -z "$PARTITIONS" ]; then
echo $0: $ERROR \$PARTITIONS isn\'t set
exit 1
fi
if [`echo $FILESYS | wc -w` -ne `echo $PARTITIONS | wc -w` ]; then
echo $0: $ERROR Filesystem list in \$FILESYS \($FILESYS\) doesn\'t match partition list in \$PARTITIONS \($PARTITIONS\)
exit 1
fi
if [ ! -r $FSTAB ]; then
echo $0: $ERROR can\'t read filesystem table $FSTAB
exit 1
fi
}

#
# Verify that nothing's mounted on /mnt, and that none of the
# filesystems we're going to use on the copy disk are mounted.
#

check_mounts ()
{
if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Checking to make sure nothing is mounted on /mnt
fi

if [ ! -z "`mount | sed -n -e 's/ on \/mnt .*//p'`" ]; then
echo $0: $ERROR `mount | sed -n -e 's/ on \/mnt .*//p'` is mounted on /mnt
exit 1
fi

for i in $PARTITIONS
do
if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Checking to make sure ${COPYDISK}${i} isn\'t mounted
fi
k=`mount | sed -n -e "s#^${COPYDISK}${i} on ##p" | sed -e "s/ .*//"`

if [ -n "$DEBUG" ]; then
echo $0: $DEBUG check_mounts\(\) set k to \"$k\"
fi

if [ ! -z "$k" ]; then
echo $0: $ERROR ${COPYDISK}${i} is mounted on $k
exit 1
fi
done
}

#
# Figure out which filesystems are on which partitions.
# The grep incantation tries to ignore commented-out entries
# (by anchoring the search to the beginning of the line)
# as well as anything who's fstype isn't 4.2 (sunos) or ufs (solaris).
# The final regular expression should ensure that the filesystem name
# is surrounded by whitespace.
#
# If we can't find a partition, complain and exit.
#
# Most of the rationale for this is just to check that we haven't
# been asked to copy a filesystem that's not mounted or is
# mounted via NFS or some kind of other exception.
#

check_filesystems ()
{
for i in $FILESYS
do
j=`egrep -v "^#" $FSTAB \
| grep -w "$FSTYPE" \
| egrep "[ ]$i[ ]" \
| sed -e "s/[ ].*//"`

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Making sure that $i is in $FSTAB
fi

if [ -z "$j" ]; then
echo $0: $ERROR can\'t locate filesystem $i in $FSTAB
exit 1
fi

if [ -n "$DEBUG" ]; then
echo $0: $DEBUG Found filesystem $i on $j in $FSTAB
fi

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Making sure that $i is mounted
fi

k=`mount | sed -n -e "s# on $i type $FSTYPE .*##p"`

if [ -z "$k" ]; then
echo $0: $ERROR filesystem $i isn\'t mounted
exit 1
fi

if [ -n "$DEBUG" ]; then
echo $0: $DEBUG found filesystem $i mounted on $k
fi
done
}

#
# Install boot block in the copied root filesystem; assume that it's
# mounted on /mnt (which it darn well ought to be when this is called).
#

install_boot_block ()
{

if [ -z "$BOOTPARTITION" ] ; then
echo $0: $ERROR \$BOOTPARTITION was\'t set properly
exit 1
fi

if [ -n "$DEBUG" ]; then
echo $0: $DEBUG Boot partition set to $BOOTPARTITION
fi

case "$OS" in
"sunos") if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Installing boot block on ${BOOTPARTITION} $NOTREALLY
fi
if [ -z "$NOTREALLY" ]; then
/usr/kvm/mdec/installboot -vlt /mnt/boot /usr/kvm/mdec/bootsd ${BOOTPARTITION}
fi
;;
"solaris") if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Installing boot block on ${BOOTPARTITION} $NOTREALLY
fi
if [ -z "$NOTREALLY" ]; then
/usr/sbin/installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk ${BOOTPARTITION}
fi
;;
*) echo $0: $ERROR \$OS wasn\'t set properly
exit 1
;;
esac
}

#
# For each filesystem that needs to be cloned, (a) mount the
# relevant partition on /mnt, and (b) execute a dump/restore
# pipe to copy it and (c) unmount it. If it happens to be
# the root filesystem that we're copying, then (1) install a new fstab
# while we have it mounted and (2) install a boot block as well.
# If it's a Solaris box, then (3) remake the device paths.
#

copy_filesystems ()
{

m=1

for i in $FILESYS
do
n=`echo $PARTITIONS | cut -d" " -f $m -`
if [ -n "$DEBUG" ]; then
echo $0: $DEBUG copy_filesystems\(\) set \$m to $m, \$n to $n
fi

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Mounting ${COPYDISK}$n on /mnt $NOTREALLY
fi

if [ -z "$NOTREALLY" ]; then
mount ${COPYDISK}$n /mnt
fi

k=`mount | sed -n -e "s#${COPYDISK}$n on /mnt type $FSTYPE ##p"`

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Checking that ${COPYDISK}$n really is mounted on /mnt
fi

if [ -z "$k" -a -z "$NOTREALLY" ]; then
echo $0: $ERROR filesystem ${COPYDISK}$n didn\'t get mounted on /mnt
exit 1
fi

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Dumping $i and restoring it in /mnt $NOTREALLY
fi

if [ -z "$NOTREALLY" ]; then
$DUMP $DUMPARGS $i | (cd /mnt; $RESTORE $RESTOREARGS)
fi

if [ $i = "/" ]; then
if [ -n "$DEBUG" ]; then
echo $0: $DEBUG Copying root filesystem -- will do boot block and fstab
fi
BOOTPARTITION=${COPYDISK}$n
install_boot_block
mangle_fstab
if [ $OS = solaris ]; then
remake_device_paths
fi
fi

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Unmounting ${COPYDISK}$n from /mnt $NOTREALLY
fi

if [ -z "$NOTREALLY" ]; then
umount /mnt
fi

k=`mount | sed -n -e "s#${COPYDISK}$n on /mnt type $FSTYPE ##p"`

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Checking that ${COPYDISK}$n was unmounted from /mnt
fi

if [ ! -z "$k" -a -z "$NOTREALLY" ]; then
echo $0: $ERROR filesystem ${COPYDISK}$n didn\'t get unmounted from /mnt
exit 1
fi

m=`expr $m + 1`
done
}

#
# Modify the copy's /etc/fstab or /etc/vfstab to match
#
# mangle_fstab

mangle_fstab ()
{
m=1
cp $FSTAB /tmp/fstab.clone.$$

for i in $FILESYS
do
n=`echo $PARTITIONS | cut -d" " -f $m -`

if [ -n "$DEBUG" ]; then
echo $0: $DEBUG mangle_fstab\(\) set \$m to $m, \$n to $n
fi

j=`egrep -v "^#" $FSTAB \
| grep -w "$FSTYPE" \
| egrep "[ ]$i[ ]" \
| sed -e "s/[ ].*//"`

if [ -n "$DEBUG" ]; then
echo $0: $DEBUG Change filesystem $i from $j to ${COPYDISK}$n in new $FSTAB
fi

if [ $OS = sunos ]; then
sed -e "s#^$j#${COPYDISK}$n#" /tmp/fstab.clone.$$ > /tmp/fstab.temp.$$
fi
if [ $OS = solaris ]; then
RCOPYDISK=`echo $COPYDISK | sed -e "s/dsk/rdsk/"`
k=`echo $j | sed -e "s/dsk/rdsk/"`
sed -e "s#^$j#${COPYDISK}$n#" -e "s#$k#${RCOPYDISK}$n#" /tmp/fstab.clone.$$ > /tmp/fstab.temp.$$
fi

mv /tmp/fstab.temp.$$ /tmp/fstab.clone.$$
m=`expr $m + 1`
done

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Checking number of changed lines in new fstab
fi

o=`diff $FSTAB /tmp/fstab.clone.$$ | egrep "^[<>]" | wc -l`
p=`echo $FILESYS | wc -w`
q=`expr $p \* 2`

if [ -n "$DEBUG" ]; then
echo $0: $DEBUG mangle_fstab\(\) set \$o to $o, \$p to $p, \$q to $q
fi

if [ $o -ne $q ]; then
echo $0: $ERROR Expected $q differences in fstab for $p filesystems, got $o
exit 1
fi

if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE Installing new fstab on copied root filesystem $NOTREALLY
fi

if [ -z "$NOTREALLY" ]; then
cp /mnt${FSTAB} /mnt${FSTAB}.BAK
cp /tmp/fstab.clone.$$ /mnt${FSTAB}
fi

rm -f /tmp/fstab.clone.$$
}

#
# for Solaris only, remake paths to devices on copy disk
#

remake_device_paths ()
{
if [ -n "$VERBOSE" ]; then
echo $0: $VERBOSE remake device paths $NOTREALLY

#
# NOTE: here's where Solaris trickiness comes in. We'll need to run
# some combination of:
#
# /usr/sbin/drvconfig
# /usr/sbin/devlinks
# /usr/sbin/tapes
# /usr/ucb/ucblinks
#
# and perhaps a few other related programs to get all of the device
# paths and aliases set up. Have to make sure that these use the
# *copy* disk, not the original.
#

fi
}

#
# This is actually the main part of this script
#

while [ $# -gt 0 ]
do
case "$1" in
"-v") VERBOSE="Info:"
;;
"-n") NOTREALLY="(not really)"
;;
"-d") DEBUG="Debug:"
;;
*) echo $0: $ERROR unknown flag
exit 1
;;
esac
shift
done

determine_os_version
check_variables_and_files
check_mounts
check_filesystems
copy_filesystems

exit 0

--Boundary_(ID_weHEg5Nke493EwCOx1oV+Q)--