Slurm Power Saving Guide
Slurm provides an integrated power saving mechanism for powering down idle nodes. Nodes that remain idle for a configurable period of time can be placed in a power saving mode, which can reduce power consumption or fully power down the node. The nodes will be restored to normal operation once work is assigned to them. For example, power saving can be accomplished using a cpufreq governor that can change CPU frequency and voltage (note that the cpufreq driver must be enabled in the Linux kernel configuration). Of particular note, Slurm can power nodes up or down at a configurable rate to prevent rapid changes in power demands. For example, starting a 1000 node job on an idle cluster could result in an instantaneous surge in power demand of multiple megawatts without Slurm's support to increase power demands in a gradual fashion.
A great deal of flexibility is offered in terms of when and how idle nodes are put into or removed from power save mode. Note that the Slurm control daemon, slurmctld, must be restarted to initially enable power saving mode. Changes in the configuration parameters (e.g. SuspendTime) will take effect after modifying the slurm.conf configuration file and executing "scontrol reconfig". The following configuration parameters are available:
- SuspendTime: Nodes becomes eligible for power saving mode after being idle for this number of seconds. For efficient system utilization, it is recommended that the value of SuspendTime be at least as large as the sum of SuspendTimeout plus ResumeTimeout. A negative number disables power saving mode. The default value is -1 (disabled).
- SuspendRate: Maximum number of nodes to be placed into power saving mode per minute. A value of zero results in no limits being imposed. The default value is 60. Use this to prevent rapid drops in power consumption.
- ResumeRate: Maximum number of nodes to be removed from power saving mode per minute. A value of zero results in no limits being imposed. The default value is 300. Use this to prevent rapid increases in power consumption.
- SuspendProgram: Program to be executed to place nodes into power saving mode. The program executes as SlurmUser (as configured in slurm.conf). The argument to the program will be the names of nodes to be placed into power savings mode (using Slurm's hostlist expression format).
- ResumeProgram: Program to be executed to remove nodes from power saving mode. The program executes as SlurmUser (as configured in slurm.conf). The argument to the program will be the names of nodes to be removed from power savings mode (using Slurm's hostlist expression format). This program may use the scontrol show node command to insure that a node has booted and the slurmd daemon started. If the slurmd daemon fails to respond within the configured SlurmdTimeout value with an updated BootTime, the node will be placed in a DOWN state and the job requesting the node will be requeued. If the node isn't actually rebooted (i.e. when multiple-slurmd is configured) starting slurmd with "-b" option might be useful. For reasons of reliability, ResumeProgram may execute more than once for a node when the slurmctld daemon crashes and is restarted.
- SuspendTimeout: Maximum time permitted (in second) between when a node suspend request is issued and when the node shutdown is complete. At that time the node must ready for a resume request to be issued as needed for new workload. The default value is 30 seconds.
- ResumeTimeout: Maximum time permitted (in second) between when a node resume request is issued and when the node is actually available for use. Nodes which fail to respond in this time frame will be marked DOWN and the jobs scheduled on the node requeued. Nodes which reboot after this time frame will be marked DOWN with a reason of "unexpected reboot." The default value is 60 seconds.
- SuspendExcNodes: List of nodes to never place in power saving mode. Use Slurm's hostlist expression format. By default, no nodes are excluded.
- SuspendExcParts: List of partitions with nodes to never place in power saving mode. Multiple partitions may be specified using a comma separator. By default, no nodes are excluded.
- BatchStartTimeout: Specifies how long to wait after a batch job start request is issued before we expect the batch job to be running on the compute node. Depending upon how nodes are returned to service, this value may need to be increased above its default value of 10 seconds.
Note that SuspendProgram and ResumeProgram execute as SlurmUser on the node where the slurmctld daemon runs (primary and backup server nodes). Use of sudo may be required for SlurmUserto power down and restart nodes. If you need to convert Slurm's hostlist expression into individual node names, the scontrol show hostnames command may prove useful. The commands used to boot or shut down nodes will depend upon your cluster management tools.
Note that SuspendProgram and ResumeProgram are not subject to any time limits. They should perform the required action, ideally verify the action (e.g. node boot and start the slurmd daemon, thus the node is no longer non-responsive to slurmctld) and terminate. Long running programs will be logged by slurmctld, but not aborted.
Also note that the stderr/out of the suspend and resume programs are not logged. If logging is desired it should be added to the scripts.
#!/bin/bash # Example SuspendProgram echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log hosts=`scontrol show hostnames $1` for host in $hosts do sudo node_shutdown $host done #!/bin/bash # Example ResumeProgram echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log hosts=`scontrol show hostnames $1` for host in $hosts do sudo node_startup $host done
Subject to the various rates, limits and exclusions, the power save code follows this logic:
- Identify nodes which have been idle for at least SuspendTime.
- Execute SuspendProgram with an argument of the idle node names.
- Identify the nodes which are in power save mode (a flag in the node's state field), but have been allocated to jobs.
- Execute ResumeProgram with an argument of the allocated node names.
- Once the slurmd responds, initiate the job and/or job steps allocated to it.
- If the slurmd fails to respond within the value configured for SlurmdTimeout, the node will be marked DOWN and the job requeued if possible.
- Repeat indefinitely.
The slurmctld daemon will periodically (every 10 minutes) log how many nodes are in power save mode using messages of this sort:
[May 02 15:31:25] Power save mode 0 nodes ... [May 02 15:41:26] Power save mode 10 nodes ... [May 02 15:51:28] Power save mode 22 nodes
Using these logs you can easily see the effect of Slurm's power saving support. You can also configure Slurm with programs that perform no action as SuspendProgram and ResumeProgram to assess the potential impact of power saving mode before enabling it.
Use of Allocations
A resource allocation request will be granted as soon as resources are selected for use, possibly before the nodes are all available for use. The launching of job steps will be delayed until the required nodes have been restored to service (it prints a warning about waiting for nodes to become available and periodically retries until they are available).
In the case of an sbatch command, the batch program will start when node zero of the allocation is ready for use and pre-processing can be performed as needed before using srun to launch job steps. Waiting for all nodes to be booted can be accomplished by adding the command "scontrol wait_job $SLURM_JOB_ID" within the script or by adding that command to the system Prolog or PrologSlurmctld as configured in slurm.conf, which would create the delay for all jobs on the system. Insure that the Prolog code is zero to avoid draining the node (do not use the scontrol exit code to avoid draining the node on error, for example if the job is explicitly cancelled during startup).
The salloc and srun commands, which create a resource allocation, automatically wait for the nodes to power up.
Execution of the salloc command also triggers execution of the Prolog script if the Alloc flag is set in PrologFlags. In this case salloc waits for script termination before returning control to the user. To not wait on the salloc side set the NoHold flag in PrologFlags. This will automatically set the Alloc flag and use the slurmd to wait for the prolog to finish instead of salloc. This flag should be used when srun is used to launch a step.
If the slurmctld daemon is terminated gracefully, it will wait up to SuspendTimeout or ResumeTimeout (whichever is larger) for any spawned SuspendProgram or ResumeProgram to terminate before the daemon terminates. If the spawned program does not terminate within that time period, the event will be logged and slurmctld will exit in order to permit another slurmctld daemon to be initiated. Synchronization problems could also occur when the slurmctld daemon crashes (a rare event) and is restarted.
In either event, the newly initiated slurmctld daemon (or the backup server) will recover saved node state information that may not accurately describe the actual node state. In the case of a failed SuspendProgram, the negative impact is limited to increased power consumption, so no special action is currently taken to execute SuspendProgram multiple times in order to insure the node is in a reduced power mode. The case of a failed ResumeProgram is more serious in that the node could be placed into a DOWN state and/or jobs could fail. In order to minimize this risk, when the slurmctld daemon is started and node which should be allocated to a job fails to respond, the ResumeProgram will be executed (possibly for a second time).
Booting Different Images
Slurm's PrologSlurmctld configuration parameter can identify a program to boot different operating system images for each job based upon it's constraint field (or possibly comment). If you want ResumeProgram to boot a various images according to job specifications, it will need to be a fairly sophisticated program and perform the following actions:
- Determine which jobs are associated with the nodes to be booted
- Determine which image is required for each job and
- Boot the appropriate image for each node
Last modified 22 June 2016