Emergency powerdown system

Emergency powerdown system

Note: If a power failure is currently taking place, please proceed directly to the usage section below.

This document is currently being reviewed and updated. There is an indication (bracketed like this text) how far the process has gotten.

Overview

The emergency powerdown system is a set of scripts to manage the shutting down of participating machines in the machine room in the event of an extended power failure. It currently runs on Solaris and Linux (Red Hat/Fedora Core) machines. Machines configured to run this system check in periodically to see if an emergency powerdown should be performed and log the fact that they are still running. Machines are configured to shut themselves down once their priority level (1, 2, 3, etc.) is reached. Once a shutdown level is declared (by touching shutdown.level.n, to begin level n), machines at that level will issue an orderly shutdown (with warning to logged in users if they exist) and power themselves off if possible. A status script can repetitively report what systems are down or up.

After the power has returned, the shutdown levels can be used as a guide for the order to reboot machines.

Configuration

To configure a machine to take part in the emergency powerdown system: You can then start the remote script with the command
    /etc/init.d/powerfail start
Each time the remote process is started, it rechecks the configuration level for the machine. It reports changes on startup. You can see where in the shutdown order the machine is configured with the command
    /lcsr/master/power/configure -v
If the machine should be put at a different level than automatically determined, you can specify where by adding a file to /lcsr/master/power/manual.

Usage

If you need to stop an automated shutdown which has already been initiated, see the automated shutdown section below.
Each participating machine checks every so often (currently every 15 minutes, defined by the DSLEEPMINS variable in power.conf) to see if it needs to initiate a shutdown. If some machines will need to be shut down soon but not now, you can increase the frequency by putting the number of seconds between checks in remote.sleeptime, eg,
    echo 120 > /lcsr/master/power/remote.sleeptime
You need to be logged in on ns-lcsr to do this. (It's probably not advisable to set the delay time to less than 120 seconds.)

Bringing individual machines down

Each participating machine looks for two files to see if it should shut itself down. It determines if it's level shutdown is active by the existance of tile file /lcsr/master/power/shutdown.level.n, where n is the shutdown level for that machine. Individual machines can be triggered by touching /lcsr/master/power/shutdown.level.<hostname7gt;, where <hostname7gt; is the host's name (with .rutgers.edu stripped off).

Bringing groups machines down

Currently, there is only a mechanism for bringing down the presidents research cluster:
    /lcsr/master/power/down-presidents
This script will signal a shutdown for all the presidents, then tell you which still remain up every 60 seconds until all are down.

Automated shutdown

UPS Monitor: The script watch-ups checks the status of the machine room UPS every 2 minutes. Immediately on seeing that the machine room is on battery power, it initiates an emergency shutdown by running the script emergency-powerdown-control and paging staff that the automatic powerdown has been initiated. Without intervention, all participating machines should be shut down in just under one hour.
If you want to temporarily stop emergency-powerdown-control from proceeding with the shutdown, touch emergency-powerdown-control.pause in /lcsr/master/power. If you want to permanently stop emergency-powerdown-control, just kill the process (and remove or rename any shutdown.level.n files it may have created). watch-ups will not restart emergency-powerdown-control until the UPS goes off battery and back on it again. Note: The first action emergency-powerdown-control takes is to delete emergency-powerdown-control.pause, so you will have to re-touch this file to re-pause the shutdown process if desired.

Powerdown control: emergency-powerdown-control sets the check-in time for clients to 2 minutes (but it will take 15 minutes for this to take effect on all clients). After 15 minutes, emergency-powerdown-control will signal shutdown for all machines at level 1 (by touching shutdown.level.1 . Every 6 minutes after that (allowing time for user warning and shutdown for current level to take place), the next shutdown level will be initiated. If for some reason you don't want to wait so long until the next level is initiated, you can touch emergency-powerdown-control.proceed , which will terminate the wait to escalate to the next level. (You must touch emergency-powerdown-control.proceed every time you want the next level initiated immediately.

Client participation: Every so often (15 minutes by default), the client script, remote, wakes up and checks in with the central control. If the shutdown level associated with this machine has been declared, a shutdown is initiated. If users are present, a warning ("Emergency shutdown -- please save work and log off now") is sent and they are given 3 minutes to log off. If no shutdown has been signalled, the delay until the next check is determined (based on the presence or absence of remote.sleeptime) and the process goes back to sleep.

Staff participation: If you need to stop an automated shutdown which has already been initiated, this is described in the automated shutdown description above. If a power failure has occurred, automated shutdown has been initiated, and you'd like it to proceed (you have 15 minutes from the initiation to decide), you can run run a process to monitor machines as they're shut down:

    ./status -m2 &
This will periodically print out the status of lots of the machines and equipment in the machine room which must be powered off. It also prints the status of some machines which do not take part in the automatic shutdown and must be done manually. When status indicates machines are no longer responding to ping, wait about 30 seconds, then go to the machine and power it and any peripherals it has (eg, disks or tape drives) off. As machines are confirmed down and powered off, you can check them off on one copy of the shutdown order (a current copy of which can be found in the "Disaster book" -- a brown looseleaf labelled "Disaster Config Info" in the operator's office).
This document has been reviewed to this point.

Then we start up a remote script on machines not normally configured to run the emergency powerdown system. (Before this step, be sure that the file remote.suicide does not exist on /lcsr/master/power.   remote.suicide is the way you signal the remote scripts on normally non-participating machines to kill themselves off.) If there are Solaris machines not running the automatic shutdown system, you can temporarily start it on those machines by running

    /lcsr/master/power/emergency-run-powerdown-script
on farside as yourself, not root. (Hopefully,you are set up to be able to rsh everywhere from farside.)

As each level is completed, touch the next shutdown.level (eg, shutdown.level.2) file to escalate to the next shutdown level. (Note: shutdown.level.3 does not imply shutdown.level.2. Each level has it's own file.)

Bringing machines back up

If you did not power off all machines which were temporarily running the remote script, you can have them kill themselves off by touching the file remote.suicide in /lcsr/master/power.

You should also remove all shutdown.level files, although as a safety precaution the remote script will not honor a shutdown.level file created before the machine was rebooted.

The command

    ./status -m2 -l -r &
will repetitively print out the status of machines broken down by shutdown level (-l) and in reverse order (-r). This is useful to see that machines are coming back up in the proper order. You can keep track of what's in progress and/or done on the second copy of the shutdown order (which is not in reverse order).

When all is done, let Don know how things went and what improvements you might like.

Testing

It is possible to test the system on a particular machine by touching the file shutdown.level.n.hostname in /lcsr/master/power (where n is the machine's shutdown level as determined by /lcsr/master/power/configure and hostname is it's hostname (minus the .rutgers.edu).

Miscellaneous

The script, remote, checks itself against the copy on /lcsr/master/power, restarting itself with a new copy should the central copy be newer. So if you edit remote on /lcsr/master/power, all machines will start running the new copy the next time they check in. This is a two-edged sword. While you can fix bugs on the fly (as I was able to do during the system's first acid test), you can also kill the entire system by making a mistake in editing. Be careful with this!

remote also restarts itself if the configure script has been updated (so that machines which would be reclassified by any modifications to configure are).

Rob Toth points out that Suns which power themselves down will power themselves back up if power goes off and back on. It is therefore advisable to power off (in the back) Suns which have powered themselves off if there is the possibility of the UPS failing completely.

[URL: file://localhost/lcsr/master/power/emergency-powerdown-system.html or http://farside.rutgers.edu/~watrous/emergency-powerdown-system.html]

This page last updated March 21, 2013.