SUMMARY:watchdog reset

Antonia Gomez (antonia@fib.upc.es)
Wed, 03 Dec 1997 13:51:51 +0100

Hello!

My original question:

>I know that this question was do in this list in May but I have now
>the same problem with a Ultra Enterprise with solaris 2.5.1 and I
>don't know the meaning the error "watchdog reset". What is a watchdog?
>Can anybody help me, please?
>
>Thanks in advanced.

Thanks at :
Casper Dik
Johnie Stafford
Benjamin Cline
sandeep patni
Tinh Do
and
Arora, Samir

Some answers:

Hi,

This is indication of some Hardware problem which can be due to Heating
up of some component in mother board or Some device failing.
According to Documents:-
This message doesn't come from the kernel, but from the OpenBoot PROM
monitor, a piece of Forth software that gives you the ok prompt before
you boot UNIX. If the CPU detects a trap when traps are disabled (an
unrecoverable error), it signals a watchdog. The OpenBoot PROM monitor
detects the watchdog,
issues this message, and brings down the system

______________________________________________________________________
A watchdog reset is an unrecoverable situation that forces
the cpu to reset. most of the time , this indicates a software
problem, but may be hardware related, Some Sun FE claimed 99%
hardware related.And it is very difficult to diag.
Most of the time the system will drop to the ok prompt (I
usually
call it 'crash to ok prompt') . you can do the following to
gather or trace the probem :
.register - to display internal register.
.locals
.psr - processor status register
.ctrace - trace the current thread

Gather all the messages and send it to Sun to analize. Unless
you already attended a class in coredump analysis or system
fault analysis.
Hope this can help
Tinh.

______________________________________________________________________
A watchdog reset is an unrecoverable situation that forces the CPU
to reset. It is caused as a result of the machine trapping while
handling a trap with the "Enable Traps" bit in the Processor Status
Register (PSR) being disabled. The reason traps have been disabled
is that no other traps should occur unit the first trap has been
handled. But because a second trap has occurred and the cpu cannot
handle it the machine resets.

Typically this indicates a software problem, but may be hardware
related.

What to do?:

If the machine dropped to the boot PROM ok prompt there are a few
special PROM command you can run to gather information.

.registers - Displays kernel internal registers.

.locals - Displays the registers in the current register
window.

.psr - Displays the Processor Status Register.

ctrace - Displays the trace of the current thread.

Sun4d's Only:

On sun4d systems if the machine automatically rebooted after a
system watchdog reset, you can run the command /usr/kvm/prtdiag
to gather the information that wassaved to the NVRAM once the
machine has finished its reboot.

If the machine dropped to the boot PROM ok prompt you can run
wd-dump command at the prompt. This displays the watchdog information
on the screen including the address of the instruction that caused
the reset.

_____________________________________________________________________
Here's the response I sent previously. Also, there is a good
(searchable!)
archive of Sun-managers summary at http://www.latech.edu/sunman.html

____________________________________----__-------____________________
It could mean many things.

I could indicate a hardware error but software errors are known to cause
this as well.

A watchdog is a timer. When the system is operating properly, it will
reset the timer every so often. If it doesn't reset the timer, the
timer
will go off and a "watchdog reset" is the result. It does happen with
recursive traps (on V8 SPARCs) which are an inidcation of an OS bug.
It could also be cause by component failure.

Thanks.