SUMMARY: Axil Sparc 10 Clone Crashes

Tim Evans (tkevans@eplrx7.es.dupont.com)
Fri, 21 Nov 1997 08:41:09 -0500 (EST)

Last week, I wrote:

>I have an Axil SWS310 (Sparc 10 clone) with two TMS390Z55 CPU's,
>64MB RAM, 2 added Sbus SCSI adaptors (besides the built-in SCSI
>on the system board), and two tape jukeboxes.
>
>The system runs Solaris 2.6 and Legato NetWorker version 5.0,
>doing network backups to the two jukeboxes.
>
>For the last several weeks, we have had frequent (though not
>daily) crashes, during backups. Excerpts from the messages
>file:
>
>Nov 13 08:01:47 fantasia unix: BAD TRAP: type=9 rp=fc0666f4 addr=f5be301e mmu_f
>sr=b36 rw=1
>Nov 13 08:01:47 fantasia unix: modunload: Data fault
>Nov 13 08:01:47 fantasia unix: kernel read fault at addr=0xf5be301e, pme=0xe200
>183e
>Nov 13 08:01:47 fantasia unix: MMU sfsr=b36: Bus Access Error on supv data fetc
>h at level 3
>Nov 13 08:01:47 fantasia unix: M-Bus Timeout Error
>
>Once the system crashes, it often takes several reboots to get it to
>come up cleanly. Often, it will crash again, just prior to putting
>up the login prompt on the console. That is, the message "the system
>is ready appears," but the system then crashes again at this point.
>
>After anywhere from one to 10 repeats this boot-and-crash sequence,
>the system will finally make it all the way up to multi-user runlevel.
>Naturally, there aren't any backups running as the system boots,
>so this would seem to rule out the Legato package as the culprit.
>
>It may go along fine for a couple of days, including doing successful
>backups at night, but then the whole process repeats. Cleaning
>up the Legato indices then sucks up half a day's time.
>
>Our hardware support vendor seems not to be able to figure out what's
>wrong. They've replaced the system board, after which the problem
>got *worse*. On the one hand, the 'MMU' reference suggests memory,
>but the M-bus timeout error may refer to one of the CPU's. I
>don't know what the 'modunload' error refers to.
>
Thanks to the following for their replies:

jyoung@educate.com
"martha.crocker" <martha.crocker@sat.vlsi.com>
Rich Kulawiec <rsk@itw.com>
Glenn Satchell - Uniq Professional Services <Glenn.Satchell@uniq.com.au>
Joel Lee <jlee@thomas.com>
Brian Toscano <btoscano@shell.monmouth.com>

Turns out I had two separate hardware problems, neither of which had
anything to do with the Legato NetWorker backup software. Crash
dump analyses (suggested by jyoung@educate.com) showed the two things:

o The crashes-during-backup were caused by a flaky integrated
ethernet adaptor on the Sparc 10 system board. This is a
known problem, Sun bugid 1169946. Heavy network traffic,
generated by incoming backup data, causes the system to
panic.

o The crash-on-subsequent bootup, though it occurred immediately
after the crash-during backup, was caused by the driver for
a third-party (Rasterflex) 24-bit frame buffer. For whatever
reason, the driver was being mod-unloaded at the very last
stop of multi-user bootup (just prior to printing of the
initial login prompt), and this crashes the system. It
would appear that the old Rasterflex drivers don't work
well with Solaris 2.6.

Our hardware service vendor provided an Sbus ethernet adaptor, and
I removed the Rasterflex frame buffer and its drivers. System is
fine now, with NetWorker backups running just fine.

As a postscript, I'd like to express a personal opinion about the
response of one individual to my posting. Citing his vast
UNIX experience, Rich Kulawiec <rsk@itw.com>, told me to get rid
of NetWorker and use ufsdump for backups. While Rich is certainly
entitled to his opinion, his holier-than-thou attitude was quite
objectionable. If he wants to do backups with dump, edit with cat,
and not use a graphical user interface, since he's such a grand wizard,
that's fine with me (though I wonder why he recommended *ufs*dump,
since that's clearly a suspect utility, having been invented after
the release of BSD 4.2). Members of this mailing list don't need this
sort of religious bigotry when they're trying to solve problems.

As to using ufsdump for backups, perhaps Rich can tell us all how
to meet mega-mass backup requirements (40+ hosts) in any reasonable
time window without having to manage boo-koo separate backup devices.
I abandoned ufsdump because such one-filesystem-at-a-time
backups would run for days on end, losing all semblence of
consistent backups.

-- 
Tim Evans                     |    E.I. du Pont de Nemours & Co.
tkevans@eplrx7.es.dupont.com  |    Experimental Station
(302) 695-9353/8638 (FAX)     |    P.O. Box 80357
EVANSTK AT A1 AT ESVAX        |    Wilmington, Delaware 19880-0357