Note: If a power failure is currently taking place, please proceed directly to the usage section below.
After the power has returned, the shutdown levels can be used as a guide for the order to reboot machines.
ln -s ../init.d/powerfail /etc/rc3.d/S99powerfail
chkconfig --add powerfail
chkconfig powerfail on
/etc/init.d/powerfail start
Each time the
remote
process is started, it rechecks the configuration level for the
machine.
It reports changes on startup.
You can see where in the
shutdown order
the machine is configured with the command
/lcsr/master/power/configure -v
If the machine should be put at a different level than automatically
determined, you can specify where by adding a file to
/lcsr/master/power/manual.
If you need to stop an automated shutdown which has already been initiated, see the automated shutdown section below.Each participating machine checks every so often (currently every 15 minutes, defined by the DSLEEPMINS variable in power.conf) to see if it needs to initiate a shutdown. If some machines will need to be shut down soon but not now, you can increase the frequency by putting the number of seconds between checks in remote.sleeptime, eg,
echo 120 > /lcsr/master/power/remote.sleeptime
You need to be logged in on
ns-lcsr to do this.
(It's probably not advisable to set the delay time to less than 120
seconds.)
/lcsr/master/power/down-presidents
This script will signal a shutdown for all the presidents, then tell
you which still remain up every 60 seconds until all are down.
If you want to temporarily stop emergency-powerdown-control from proceeding with the shutdown, touch emergency-powerdown-control.pause in /lcsr/master/power. If you want to permanently stop emergency-powerdown-control, just kill the process (and remove or rename any shutdown.level.n files it may have created). watch-ups will not restart emergency-powerdown-control until the UPS goes off battery and back on it again. Note: The first action emergency-powerdown-control takes is to delete emergency-powerdown-control.pause, so you will have to re-touch this file to re-pause the shutdown process if desired.
Powerdown control: emergency-powerdown-control sets the check-in time for clients to 2 minutes (but it will take 15 minutes for this to take effect on all clients). After 15 minutes, emergency-powerdown-control will signal shutdown for all machines at level 1 (by touching shutdown.level.1 . Every 6 minutes after that (allowing time for user warning and shutdown for current level to take place), the next shutdown level will be initiated. If for some reason you don't want to wait so long until the next level is initiated, you can touch emergency-powerdown-control.proceed , which will terminate the wait to escalate to the next level. (You must touch emergency-powerdown-control.proceed every time you want the next level initiated immediately.
Client participation: Every so often (15 minutes by default), the client script, remote, wakes up and checks in with the central control. If the shutdown level associated with this machine has been declared, a shutdown is initiated. If users are present, a warning ("Emergency shutdown -- please save work and log off now") is sent and they are given 3 minutes to log off. If no shutdown has been signalled, the delay until the next check is determined (based on the presence or absence of remote.sleeptime) and the process goes back to sleep.
Staff participation: If you need to stop an automated shutdown which has already been initiated, this is described in the automated shutdown description above. If a power failure has occurred, automated shutdown has been initiated, and you'd like it to proceed (you have 15 minutes from the initiation to decide), you can run run a process to monitor machines as they're shut down:
./status -m2 &
This will periodically print out the status of lots of the machines
and equipment in the machine room which must be powered off.
It also prints the status of some machines which do not take part in the
automatic shutdown and must be done manually.
When
status
indicates machines are no longer responding to
ping,
wait about 30 seconds, then go to the machine and power it and any
peripherals it has (eg, disks or tape drives) off.
As machines are confirmed down and powered off, you can check them off
on one copy of the shutdown order (a current copy of which can be
found in the "Disaster book" -- a brown looseleaf labelled "Disaster
Config Info" in the operator's office).
Then we start up a remote script on machines not normally configured to run the emergency powerdown system. (Before this step, be sure that the file remote.suicide does not exist on /lcsr/master/power. remote.suicide is the way you signal the remote scripts on normally non-participating machines to kill themselves off.) If there are Solaris machines not running the automatic shutdown system, you can temporarily start it on those machines by running
/lcsr/master/power/emergency-run-powerdown-script
on
farside
as yourself, not
root.
(Hopefully,you are set up to be able to
rsh
everywhere from
farside.)
As each level is completed, touch the next shutdown.level (eg, shutdown.level.2) file to escalate to the next shutdown level. (Note: shutdown.level.3 does not imply shutdown.level.2. Each level has it's own file.)
You should also remove all shutdown.level files, although as a safety precaution the remote script will not honor a shutdown.level file created before the machine was rebooted.
The command
./status -m2 -l -r &
will repetitively print out the status of machines broken down by
shutdown level
(-l)
and in reverse order
(-r).
This is useful to see that machines are coming back up in the proper
order.
You can keep track of what's in progress and/or done on the second
copy of the shutdown order (which is
not
in reverse order).
When all is done, let Don know how things went and what improvements you might like.
remote also restarts itself if the configure script has been updated (so that machines which would be reclassified by any modifications to configure are).
Rob Toth points out that Suns which power themselves down will power themselves back up if power goes off and back on. It is therefore advisable to power off (in the back) Suns which have powered themselves off if there is the possibility of the UPS failing completely.