Slurm Elastic Computing

Overview

Slurm has the ability to support a cluster that grows and shrinks on demand, typically relying upon a service such as Amazon Elastic Computing Cloud (Amazon EC2) for resources. These resources can be combined with an existing cluster to process excess workload (cloud bursting) or it can operate as an independent self-contained cluster. Good responsiveness and throughput can be achieved while you only pay for the resources needed.

The rest of this document describes details about Slurm's infrastructure that can be used to support Elastic Computing.

Slurm's Elastic Computing logic relies heavily upon the existing power save logic. Review of Slurm's Power Saving Guide is strongly recommended. This logic initiates programs when nodes are required for use and another program when those nodes are no longer required. For Elastic Computing, these programs will need to provision the resources from the cloud and notify Slurm of the node's name and network address and later relinquish the nodes back to the cloud. Most of the Slurm changes to support Elastic Computing were changes to support node addressing that can change.

Slurm Configuration

There are many ways to configure Slurm's use of resources. See the slurm.conf man page for more details about these options. Some general Slurm configuration parameters that are of interest include:

ResumeProgram
The program executed when a node has been allocated and should be made available for use.If the slurmd daemon fails to respond within the configured SlurmdTimeout value with an updated BootTime, the node will be placed in a DOWN state and the job requesting the node will be requeued. If the node isn't actually rebooted (i.e. when multiple-slurmd is configured) starting slurmd with "-b" option might be useful.
SelectType
Generally must be "select/linear". If Slurm is configured to allocate individual CPUs to jobs rather than whole nodes (e.g. SelectType=select/cons_res rather than SelectType=select/linear), then Slurm maintains bitmaps to track the state of every CPU in the system. If the number of CPUs to be allocated on each node is not known when the slurmctld daemon is started, one must allocate whole nodes to jobs rather than individual processors. The use of "select/cons_res" requires each node to have a CPU count set and the node eventually selected must have at least that number of CPUs.
SuspendExcNodes
Nodes not subject to suspend/resume logic. This may be used to avoid suspending and resuming nodes which are not in the cloud. Alternately the suspend/resume programs can treat local nodes differently from nodes being provisioned from cloud.
SuspendProgram
The program executed when a node is no longer required and can be relinquished to the cloud.
SuspendTime
The time interval that a node will be left idle before a request is made to relinquish it. Units are seconds.
TreeWidth
Since the slurmd daemons are not aware of the network addresses of other nodes in the cloud, the slurmd daemons on each node should be sent messages directly and not forward those messages between each other. To do so, configure TreeWidth to a number at least as large as the maximum node count. The value may not exceed 65533.

Some node parameters that are of interest include:

Feature
A node feature can be associated with resources acquired from the cloud and user jobs can specify their preference for resource use with the "--constraint" option.
NodeName
This is the name by which Slurm refers to the node. A name containing a numeric suffix is recommended for convenience. The NodeAddr and NodeHostname should not be set, but will be configured later using scripts.
State
Nodes which are to be be added on demand should have a state of "CLOUD". These nodes will not actually appear in Slurm commands until after they are configured for use.
Weight
Each node can be configured with a weight indicating the desirability of using that resource. Nodes with lower weights are used before those with higher weights.

Nodes to be acquired on demand can be placed into their own Slurm partition. This mode of operation can be used to use these nodes only if so requested by the user. Note that jobs can be submitted to multiple partitions and will use resources from whichever partition permits faster initiation. A sample configuration in which nodes are added from the cloud when the workload exceeds available resources. Users can explicitly request local resources or resources from the cloud by using the "--constraint" option.

. Slurm configuration
# Excerpt of slurm.conf
SelectType=select/linear

SuspendProgram=/usr/sbin/slurm_suspend
ResumeProgram=/usr/sbin/slurm_suspend
SuspendTime=600
SuspendExcNodes=tux[0-127]
TreeWidth=128

NodeName=tux[0-127] Weight=1 Feature=local State=UNKNOWN
NodeName=ec[0-127]  Weight=8 Feature=cloud State=CLOUD
PartitionName=debug MaxTime=1:00:00 Nodes=tux[0-32] Default=yes
PartitionName=batch MaxTime=8:00:00 Nodes=tux[0-127],ec[0-127] Default=no

Operational Details

When the slurmctld daemon starts, all nodes with a state of CLOUD will be included in its internal tables, but these node records will not be seen with user commands or used by applications until allocated to some job. After allocated, the ResumeProgram is executed and should do the following:

  1. Boot the node
  2. Configure and start Munge (depends upon configuration)
  3. Install the Slurm configuration file, slurm.conf, on the node. Note that configuration file will generally be identical on all nodes and not include NodeAddr or NodeHostname configuration parameters for any nodes in the cloud. Slurm commands executed on this node only need to communicate with the slurmctld daemon on the ControlMachine.
  4. Notify the slurmctld daemon of the node's hostname and network address:
    scontrol update nodename=ec0 nodeaddr=123.45.67.89 nodehostname=whatever
    Note that the node address and hostname information set by the scontrol command are are preserved when the slurmctld daemon is restarted unless the "-c" (cold-start) option is used.
  5. Start the slurmd daemon on the node

The SuspendProgram only needs to relinquish the node back to the cloud.

An environment variable SLURM_NODE_ALIASES contains sets of node name, communication address and hostname. The variable is set by salloc, sbatch, and srun. It is then used by srun to determine the destination for job launch communication messages. This environment variable is only set for nodes allocated from the cloud. If a job is allocated some resources from the local cluster and others from the cloud, only those nodes from the cloud will appear in SLURM_NODE_ALIASES. Each set of names and addresses is comma separated and the elements within the set are separated by colons. For example:
SLURM_NODE_ALIASES=ec0:123.45.67.8:foo,ec2,123.45.67.9:bar

Remaining Work

  • We need scripts to provision resources from EC2.
  • The SLURM_NODE_ALIASES environment variable needs to change if a job expands (adds resources).
  • Some MPI implementations will not work due to the node naming.
  • Some tests in Slurm's test suite fail.

Last modified 15 April 2015