1
Last reviewed: 10/05/2007
4
The Linux Watchdog driver API.
6
Copyright 2002 Christer Weingel <wingel@nano-system.com>
8
Some parts of this document are copied verbatim from the sbc60xxwdt
9
driver which is (c) Copyright 2000 Jakob Oestergaard <jakob@ostenfeld.dk>
11
This document describes the state of the Linux 2.4.18 kernel.
15
A Watchdog Timer (WDT) is a hardware circuit that can reset the
16
computer system in case of a software fault. You probably knew that
19
Usually a userspace daemon will notify the kernel watchdog driver via the
20
/dev/watchdog special device file that userspace is still alive, at
21
regular intervals. When such a notification occurs, the driver will
22
usually tell the hardware watchdog that everything is in order, and
23
that the watchdog should wait for yet another little while to reset
24
the system. If userspace fails (RAM error, kernel bug, whatever), the
25
notifications cease to occur, and the hardware watchdog will reset the
26
system (causing a reboot) after the timeout occurs.
28
The Linux watchdog API is a rather ad-hoc construction and different
29
drivers implement different, and sometimes incompatible, parts of it.
30
This file is an attempt to document the existing usage and allow
31
future driver writers to use it as a reference.
35
All drivers support the basic mode of operation, where the watchdog
36
activates as soon as /dev/watchdog is opened and will reboot unless
37
the watchdog is pinged within a certain time, this time is called the
38
timeout or margin. The simplest way to ping the watchdog is to write
39
some data to the device. So a very simple watchdog daemon would look
40
like this source file: see Documentation/watchdog/src/watchdog-simple.c
42
A more advanced driver could for example check that a HTTP server is
43
still responding before doing the write call to ping the watchdog.
45
When the device is closed, the watchdog is disabled, unless the "Magic
46
Close" feature is supported (see below). This is not always such a
47
good idea, since if there is a bug in the watchdog daemon and it
48
crashes the system will not reboot. Because of this, some of the
49
drivers support the configuration option "Disable watchdog shutdown on
50
close", CONFIG_WATCHDOG_NOWAYOUT. If it is set to Y when compiling
51
the kernel, there is no way of disabling the watchdog once it has been
52
started. So, if the watchdog daemon crashes, the system will reboot
53
after the timeout has passed. Watchdog devices also usually support
54
the nowayout module parameter so that this option can be controlled at
59
If a driver supports "Magic Close", the driver will not disable the
60
watchdog unless a specific magic character 'V' has been sent to
61
/dev/watchdog just before closing the file. If the userspace daemon
62
closes the file without sending this special character, the driver
63
will assume that the daemon (and userspace in general) died, and will
64
stop pinging the watchdog without disabling it first. This will then
65
cause a reboot if the watchdog is not re-opened in sufficient time.
69
All conforming drivers also support an ioctl API.
71
Pinging the watchdog using an ioctl:
73
All drivers that have an ioctl interface support at least one ioctl,
74
KEEPALIVE. This ioctl does exactly the same thing as a write to the
75
watchdog device, so the main loop in the above program could be
79
ioctl(fd, WDIOC_KEEPALIVE, 0);
83
the argument to the ioctl is ignored.
85
Setting and getting the timeout:
87
For some drivers it is possible to modify the watchdog timeout on the
88
fly with the SETTIMEOUT ioctl, those drivers have the WDIOF_SETTIMEOUT
89
flag set in their option field. The argument is an integer
90
representing the timeout in seconds. The driver returns the real
91
timeout used in the same variable, and this timeout might differ from
92
the requested one due to limitation of the hardware.
95
ioctl(fd, WDIOC_SETTIMEOUT, &timeout);
96
printf("The timeout was set to %d seconds\n", timeout);
98
This example might actually print "The timeout was set to 60 seconds"
99
if the device has a granularity of minutes for its timeout.
101
Starting with the Linux 2.4.18 kernel, it is possible to query the
102
current timeout using the GETTIMEOUT ioctl.
104
ioctl(fd, WDIOC_GETTIMEOUT, &timeout);
105
printf("The timeout was is %d seconds\n", timeout);
109
Some watchdog timers can be set to have a trigger go off before the
110
actual time they will reset the system. This can be done with an NMI,
111
interrupt, or other mechanism. This allows Linux to record useful
112
information (like panic information and kernel coredumps) before it
116
ioctl(fd, WDIOC_SETPRETIMEOUT, &pretimeout);
118
Note that the pretimeout is the number of seconds before the time
119
when the timeout will go off. It is not the number of seconds until
120
the pretimeout. So, for instance, if you set the timeout to 60 seconds
121
and the pretimeout to 10 seconds, the pretimout will go of in 50
122
seconds. Setting a pretimeout to zero disables it.
124
There is also a get function for getting the pretimeout:
126
ioctl(fd, WDIOC_GETPRETIMEOUT, &timeout);
127
printf("The pretimeout was is %d seconds\n", timeout);
129
Not all watchdog drivers will support a pretimeout.
131
Get the number of seconds before reboot:
133
Some watchdog drivers have the ability to report the remaining time
134
before the system will reboot. The WDIOC_GETTIMELEFT is the ioctl
135
that returns the number of seconds before reboot.
137
ioctl(fd, WDIOC_GETTIMELEFT, &timeleft);
138
printf("The timeout was is %d seconds\n", timeleft);
140
Environmental monitoring:
142
All watchdog drivers are required return more information about the system,
143
some do temperature, fan and power level monitoring, some can tell you
144
the reason for the last reboot of the system. The GETSUPPORT ioctl is
145
available to ask what the device can do:
147
struct watchdog_info ident;
148
ioctl(fd, WDIOC_GETSUPPORT, &ident);
150
the fields returned in the ident struct are:
152
identity a string identifying the watchdog driver
153
firmware_version the firmware version of the card if available
154
options a flags describing what the device supports
156
the options field can have the following bits set, and describes what
157
kind of information that the GET_STATUS and GET_BOOT_STATUS ioctls can
158
return. [FIXME -- Is this correct?]
160
WDIOF_OVERHEAT Reset due to CPU overheat
162
The machine was last rebooted by the watchdog because the thermal limit was
165
WDIOF_FANFAULT Fan failed
167
A system fan monitored by the watchdog card has failed
169
WDIOF_EXTERN1 External relay 1
171
External monitoring relay/source 1 was triggered. Controllers intended for
172
real world applications include external monitoring pins that will trigger
175
WDIOF_EXTERN2 External relay 2
177
External monitoring relay/source 2 was triggered
179
WDIOF_POWERUNDER Power bad/power fault
181
The machine is showing an undervoltage status
183
WDIOF_CARDRESET Card previously reset the CPU
185
The last reboot was caused by the watchdog card
187
WDIOF_POWEROVER Power over voltage
189
The machine is showing an overvoltage status. Note that if one level is
190
under and one over both bits will be set - this may seem odd but makes
193
WDIOF_KEEPALIVEPING Keep alive ping reply
195
The watchdog saw a keepalive ping since it was last queried.
197
WDIOF_SETTIMEOUT Can set/get the timeout
199
The watchdog can do pretimeouts.
201
WDIOF_PRETIMEOUT Pretimeout (in seconds), get/set
204
For those drivers that return any bits set in the option field, the
205
GETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current
206
status, and the status at the last reboot, respectively.
209
ioctl(fd, WDIOC_GETSTATUS, &flags);
213
ioctl(fd, WDIOC_GETBOOTSTATUS, &flags);
215
Note that not all devices support these two calls, and some only
216
support the GETBOOTSTATUS call.
218
Some drivers can measure the temperature using the GETTEMP ioctl. The
219
returned value is the temperature in degrees fahrenheit.
222
ioctl(fd, WDIOC_GETTEMP, &temperature);
224
Finally the SETOPTIONS ioctl can be used to control some aspects of
228
ioctl(fd, WDIOC_SETOPTIONS, &options);
230
The following options are available:
232
WDIOS_DISABLECARD Turn off the watchdog timer
233
WDIOS_ENABLECARD Turn on the watchdog timer
234
WDIOS_TEMPPANIC Kernel panic on temperature trip
236
[FIXME -- better explanations]