BlueGene/Q User and Administrator Guide

Support for BlueGene/Q systems is deprecated as of 17.11, and will be removed in a future release.

Beginning with the 17.02 release only BlueGene/Q systems are supported by Slurm. Support for the BlueGene/L and BlueGene/P systems has been removed.

Overview

This document describes the unique features of Slurm on the IBM BlueGene/Q systems. You should be familiar with Slurm's mode of operation on Linux clusters before studying the relatively few differences in BlueGene operation described in this document.

BlueGene systems have several unique features making for a few differences in how Slurm operates there. BlueGene systems consists of one or more base partitions or midplanes connected in a four-dimensional - AXYZ - torus. Each midplane typically includes 512 c-nodes or compute nodes each containing two or more cores; one core is typically designed primarily for managing communications while the other cores are used primarily for computations. Each c-node can execute only one process and thus are unable to execute both the user's application plu and Slurm's slurmd daemon. Thus the slurmd daemon(s) executes on one or more of the BlueGene Front End Nodes. The slurmd daemons provide (almost) all of the normal Slurm services for every midplane on the system.

Internally Slurm treats each midplane as one node with a processor count equal to the number of cores on the midplane, which keeps the number of entities being managed by Slurm more reasonable.

All BlueGene systems can sub-allocate a midplane into smaller blocks, this allows more than one user job to execute on each midplane.

In the case of BlueGene/Q systems, more than one user job can also execute in each block (see AllowSubBlockAllocation option in 'man bluegene.conf').

To effectively utilize this environment, Slurm tools present the user with the view that each c-node is a separate node, so allocation requests and status information use c-node counts. Since the c-node count can be very large, the suffix "k" can be used to represent multiples of 1024 or "m" for multiples of 1,048,576 (1024 x 1024). For example, "2k" is equivalent to "2048".

If you are running a system that is smaller than 1 midplane (a nodecard/nodeboard or such) you can configure your system up like this in the bluegene.conf file. Below is an example for a BlueGene/Q system:

# Excerpt from bluegene.conf file for BlueGene/Q system
...
BasePartitionNodeCnt=512
NodeCardNodeCnt=32
SubMidplaneSystem=YES
LayoutMode=STATIC
MPs=0000 type=small 32cnblocks=16
...

This will create a small block on each nodeboard on the system. If your system is different than this, adjust appropriately. The idea is Slurm will create the smallest block possible on every possible hardware location. The system will then check for missing hardware and remove blocks that are invalid. This will get around the problem if you have, for instance, the 4th nodeboard populated instead of the 1st.

--geometry	Specify job size in each dimension, (i.e. 1x4x4 = 16 nodes)
--no-rotate	Disable rotation of geometry, by default 1x4x4 could be rotated to be 4x1x4)
--conn-type	Specify interconnect type between midplanes, mesh or torus. On BlueGene/Q systems you can specify a different conn-type for each dimension, TTMT would give you Torus in all dimensions except the Y dimension, where it would be Mesh.
--mloader-image	Specify alternative mloader image for bluegene block. Default if not set.

The --nodes option with a minimum and (optionally) maximum node count continues to be available. Note that this is a c-node count.

Task Launch on BlueGene/Q only

Use Slurm's srun command to launch tasks (srun uses an api interface into IBM's runjob command). Slurm job step information, including accounting, functions as expected. Totalview and other debuggers will also work with srun. If Slurm is installed and configured correctly IBM's runjob will not work.

The srun --launcher-opts option is designed to convey options to runjob that are not available using srun alone. Node selection options are conveyed automatically by srun and cannot be overridden using this option. Two leading dashes are required when listing the runjob options, e.g., srun --launcher-opts='--mapping TEDCBA'. See the runjob man page for the list of available options.

Naming Conventions

The naming of midplanes includes a numeric suffix representing the its coordinates with a zero origin. The suffix contains three digits on BlueGene/L and BlueGene/P systems, while four digits are required for the BlueGene/Q systems. For example, "bgp012" represents the midplane whose coordinate is at X=0, Y=1 and Z=2. Slurm uses an abbreviated format for describing midplanes in which the end-points of the block enclosed are in square-brackets and separated by an "x". For example, "bgp[620x731]" is used to represent the eight midplanes enclosed in a block with end-points and bgp620 and bgp731 (bgp620, bgp621, bgp630, bgp631, bgp720, bgp721, bgp730 and bgp731).

IMPORTANT: Slurm can support up to 36 elements in each BlueGene dimension by supporting "A-Z" as valid numbers. Slurm requires the prefix to be lower case and any letters in the suffix must always be upper case. This schema must be used in both the slurm.conf and bluegene.conf configuration files when specifying midplane/node names (the prefix is optional). This schema should also be used to specify midplanes or locations in configure mode of smap:
valid: bgl[000xC44], bgl000, bglZZZ
invalid: BGL[000xC44], BglC00, bglb00, Bglzzz

IMPORTANT: Slurm requires that all systems start with 0 in each dimension. So if you have a BlueGene/Q system and only want Slurm to run on a portion of it you need to define the entire system and mark midplanes down in the slurm.conf file or with scontrol/sview.
In example of this with a BGQ system of [0000x2333] but only can use [2000x2333] you could define it in your slurm.conf like this.

...
NodeName=bgq[0000x1333] state=down
NodeName=bgq[2000x2333] state=unknown
...

This would mark your nodes not managed as down and only create blocks on the portion of the machine you want to use.

In a system configured with small blocks (any block less than a full midplane), there will be divisions in the midplane notation. On BlueGene/L and BlueGene/P systems, the midplane name may be followed by a square bracket enclosing ID numbers of the IO nodes associated with the block. For example, if there are 64 psets in a BlueGene/L configuration, "bgl012[0-15]" represents the first quarter or first 16 IO nodes of a midplane. In BlueGene/L this would be 128 c-node block. To represent the first nodecard in the second quarter or IO nodes 16-19, the notation would be "bgl012[16-19]", or a 32 c-node block. On BlueGene/Q systems, the specific c-nodes would be identified in square brackets using their five digit coordinates. For example "bgq0123[00000x11111]" would represent the 32 c-nodes in midplane "bgq0123" having coordinates (within that midplane) from zero to one in each of the five dimensions.

Two topology-aware graphical user interfaces are provided: smap and sview (sview provides more viewing and configuring options). See each command's man page for details. A sample of smap output is provided below showing the location of five jobs. Note the format of the list of midplanes allocated to each job. Also note that idle (unassigned) midplanes are indicated by a period. Down and drained midplanes (those not available for use) are indicated by a number sign (bg703 in the display below). The legend is for illustrative purposes only. The origin (zero in every dimension) is shown at the rear left corner of the bottom plane. Each set of four consecutive lines represents a plane in the Y dimension. Values in the X dimension increase to the right. Values in the Z dimension increase down and toward the left.

   a a a a b b d d    ID JOBID PARTITION BG_BLOCK USER   NAME ST TIME NODES BP_LIST
  a a a a b b d d     a  12345 batch     RMP0     joseph tst1 R  43:12  32k bg[000x333]
 a a a a b b c c      b  12346 debug     RMP1     chris  sim3 R  12:34   8k bg[420x533]
a a a a b b c c       c  12350 debug     RMP2     danny  job3 R   0:12   4k bg[622x733]
		      d  12356 debug     RMP3     dan    colu R  18:05   8k bg[600x731]
   a a a a b b d d    e  12378 debug     RMP4     joseph asx4 R   0:34   2k bg[612x713]
  a a a a b b d d
 a a a a b b c c
a a a a b b c c

   a a a a . . d d
  a a a a . . d d
 a a a a . . e e              Y
a a a a . . e e               |
			      |
   a a a a . . d d            0----X
  a a a a . . d d            /
 a a a a . . . .            /
a a a a . . . #            Z

If the block is in a READY state, the job will begin execution almost immediately. Otherwise the execution of the job will not actually begin until the block is in a READY state, which can require booting the block and a delay of minutes to do so. During this time a job will be in the CONFIGURING state. You can identify the block associated with your job using the command smap -Dj -c and the state of the block with the command smap -Db -c. The time to boot a block is related to its size, but should range from from a few minutes to about 15 minutes for a block containing 128 midplanes (on a BlueGene/L system). Only after the block is READY will your job's output file be created and the script execution begin. If the block boot fails, Slurm will attempt to reboot several times (3) before draining the associated midplanes and aborting/requeueing the job.

The job will continue to be in a RUNNING state until the bgjob has completed and the block ownership is changed. The time for completing a bgjob has frequently been on the order of five minutes. In summary, your job may appear in Slurm as RUNNING for 15 minutes before the script actually begins to 5 minutes after it completes. These delays are the result of the BlueGene infrastructure issues and are not due to anything in Slurm. These times have improved considerably on the more recent BlueGene/P and BlueGene/Q systems.

When using smap in default output mode you can scroll through the different windows using the arrow keys. The up and down arrow keys scroll the window containing the grid, and the left and right arrow keys scroll the window containing the text information.

System Administration for BlueGene/Q only

IMPORTANT: The SlurmUser defined in the slurm.conf must be added to the bgadmin group. This allows the slurmctld to access information from the system and manipulate blocks.

In order to make srun operate correctly with the underlying system and to ensure security for new MPI jobs, it is necessary to enable the Slurm plugin for the IBM runjob_mux. This is done by altering the bg.properties file. In the [runjob.mux] section of the bg.properties file change the plugin option to $prefix/lib/slurm/runjob_plugin.so and also set the plugin_flags option to 0x0109 (RTLD_LAZY | RTLD_GLOBAL | RTLD_DEEPBIND) which allows the forwarding of symbols to shared objects like Slurm uses for plugins.

[runjob.mux]
...
plugin = /usr/lib64/slurm/runjob_plugin.so
    # Path to the plugin used for communicating with a
    # job scheduler. This value can be updated by the
    # runjob_mux_refresh_config command on the
    # Login Node where a runjob_mux process runs.
...
plugin_flags = 0x0109 # RTLD_LAZY | RTLD_GLOBAL | RTLD_DEEPBIND

You also need to set things up so the runjob_mux is ran by the SlurmUser. This can be done by editing 2 files.

Back in your bg.properties file alter the [master.user] section.

[master.user]
...
runjob_mux=slurm

Then in /etc/init.d/bgagent add SlurmUser to the --users line.

OPTIONS="--users bgqadmin,bgws,bgqsysdb,slurm"

After these settings are set flush the runjob_server and (re)start each runjob_mux running on your system.

> /bgsys/drivers/ppcfloor/sbin/master_stop binaries
stopped
> sudo /etc/init.d/bgagent restart
Shutting down bgagentd:                                    [  OK  ]
Starting bgagentd:
Startup of bgagentd completed:                             [  OK  ]
> /bgsys/drivers/ppcfloor/sbin/master_start binaries
> /bgsys/drivers/ppcfloor/sbin/bgmaster_server_refresh_config
success!
> /bgsys/drivers/ppcfloor/sbin/master_start runjob_mux
started runjob_mux
> ps aux | grep runjob_mux
slurm       25461  0.0  0.3 518528 48064 ?        Sl   13:00   0:00 runjob_mux

When a new version of Slurm is installed it is a wise idea to "refresh" the runjob_mux with the new plugin. This can be done in one of two ways.

Stopping and restarting the runjob_mux. While this option works every time jobs running under the runjob_mux will not survive so plan your updates accordingly.
```
> /bgsys/drivers/ppcfloor/sbin/master_stop runjob_mux
stopped runjob_mux
> /bgsys/drivers/ppcfloor/sbin/master_start runjob_mux
started runjob_mux
```
WARNING! You need at least IBM driver V1R1M1 efix 008 or this method will not work. Previous versions would load the old plugin (presumably still in memory) other than the new one. Slurm will print its version when the plugin is loaded for validation.
This method allows for no job loss using the IBM runjob_mux_refresh_config command. This should reload the plugin and all should be good afterwards. After doing this you may see some warning/error messages about the current running jobs when finishing not being known. This is expected and can usually be ignored.

Notes about sub-block allocations:

There is a current limitation for sub-block jobs and how the system (used for I/O) and user (used for MPI) torus class routes are configured. The network device hardware has cutoff registers to prevent packets from flowing outside of the sub-block. Unfortunately, when the sub-block has a size 3, the job can attempt to send user packets outside of its sub-block. This causes it to be terminated by signal 36. To prevent this from happening Slurm does not allow a sub-block to be used with any dimension of 3.

In the current IBM API it does not allow wrapping inside a midplane. Meaning you can not create a sub-block of 2 with nodes in the 0 and 3 position. Slurm will support this in the future when the underlying system allows it.

System Administration for all BlueGene Systems

The slurmctld daemon should execute on the system's service node. If an optional backup daemon is used, it must be in some location where it is capable of executing Bridge APIs. The slurmd daemons executes the user scripts and there must be at least one front end node configured for this purpose. Multiple front end nodes may be configured for slurmd use to improve performance and fault tolerance. Each slurmd can execute jobs for every midplane and the work will be distributed among the slurmd daemons to balance the workload. You can use the scontrol command to drain individual compute nodes as desired and return them to service.

The slurm.conf (configuration) file needs to have the value of InactiveLimit set to zero or not specified (it defaults to a value of zero). This is because we don't want to purge jobs prematurely if there are no job steps. The value of SelectType must be set to "select/bluegene" (which happens automatically) in order to have node selection performed using a system aware of the system's topography and interfaces. The value of TopologyPlugin must be set to "topology/none" (which happens automatically) since topology information is managed by the select/bluegene plugin. The value of Prolog should be set to the full pathname of a program that will delay execution until the job's block is ready for use by the user running the job. It is recommended that you construct a script that serves this function and calls the supplied program sbin/slurm_prolog. The value of Epilog should be set to the full pathname of a program that will wait until the job's block has relinquished the resources acquired by the job and is no longer usable by this job. It is recommended that you construct a script that serves this function and calls the supplied program sbin/slurm_epilog. The prolog and epilog programs are used to ensure proper synchronization between the slurmctld daemon, the user job, and MMCS. A multitude of other functions may also be placed into the prolog and epilog as desired (e.g. enabling/disabling user logins, purging file systems, etc.). Sample prolog and epilog scripts follow.

#!/bin/bash
# Sample BlueGene Prolog script
#
# Wait for block to be ready for this job's use
/usr/sbin/slurm_prolog

#!/bin/bash
# Sample BlueGene Epilog script
#
# Cancel job to start the termination process for this job
# and release the block
/usr/bin/scancel $SLURM_JOB_ID
#
# Wait for block to be released from this job's use
/usr/sbin/slurm_epilog

Since jobs with different geometries or other characteristics might not interfere with each other, scheduling is somewhat different on a BlueGene system than typical clusters.

Starting in 2.4.3 SchedType=sched/backfill works in all modes and for all job sizes. Before this release there were issues backfilling jobs smaller than a midplane. It is encouraged to upgrade to at least 2.4.3 for better backfill behavior.

Slurm does support different partitions with an assortment of different scheduling parameters. For example, Slurm can have defined a partition for full system jobs that is enabled to execute jobs only at certain times; while a default partition could be configured to execute jobs at other times. Jobs could still be queued in a partition that is configured in a DOWN state and scheduled to execute when changed to an UP state. Midplanes can also be moved between Slurm partitions either by changing the slurm.conf file and restarting the slurmctld daemon or by using the scontrol reconfig command.

Slurm node and partition descriptions should make use of the naming conventions described above. For example, "NodeName=bg[000x733]" is used in slurm.conf to define a BlueGene/L system with 128 midplanes in an 8 by 4 by 4 matrix. The node name prefix of "bg" defined by NodeName can be anything you want, but needs to be consistent throughout the slurm.conf file. No computer is actually expected to a hostname of "bg000" and no attempt will be made to route message traffic to this address. Starting in version 2.4, Slurm can determine how many Sockets, CoresPerSocket, and ThreadsPerCore are available on each midplane, so no configuration is needed to determine how many cores are on each midplane.

Front end nodes used for executing the slurmd daemons must also be defined in the slurm.conf file. It is recommended that at least two front end nodes be dedicated to use by the slurmd daemons for fault tolerance. For example: "FrontendName=frontend[00-03] State=UNKNOWN" is used to define four front end nodes for running slurmd daemons.

# Portion of slurm.conf for BlueGene system
InactiveLimit=0
SelectType=select/bluegene
Prolog=/usr/sbin/prolog
Epilog=/usr/sbin/epilog
#
FrontendName=frontend[00-01] State=UNKNOWN
NodeName=bg[000x733] State=UNKNOWN

It is best to minimize other work on the front end nodes executing slurmd so as to maximize its performance and minimize other risk factors.

bluegene.conf File Creation

In addition to the normal slurm.conf file, a new bluegene.conf configuration file is required with information pertinent to the system. Put bluegene.conf into the Slurm configuration directory with slurm.conf. A sample file is installed in bluegene.conf.example. If a System administrators chooses against dynamic partitioning for some reason they should use the smap tool to build appropriate configuration file for static/overlap partitioning. Note that smap -Dc can be run without the Slurm daemons active to establish the initial configuration. Note when using static partitioning the blocks defined using smap may not overlap (except for the full-system block, which is implicitly created). See the smap man page for more information.

There are 3 different modes which the system administrator can define BlueGene partitions (or blocks) available to execute jobs: static, overlap, and dynamic. Jobs must then execute in one of the created blocks. (NOTE: blocks are unrelated to Slurm partitions.)

The default mode of partitioning is static. In this mode, the system administrator must explicitly define each of the blocks in the bluegene.conf file. Each of these blocks are explicitly configured with either a mesh or torus interconnect. They must also not overlap, except for the implicitly defined full-system block. Note that blocks are not rebooted between jobs in the mode except when going to/from full-system jobs. Eliminating block booting can significantly improve system utilization (eliminating boot time) and reliability.

The second mode is overlap partitioning. Overlap partitioning is very similar to static partitioning in that each blocks must be explicitly defined in the bluegene.conf file, but these partitions can overlap each other. In this mode it is highly recommended that none of the blocks have any passthroughs in the X-dimension associated to them. Usually this is only an issue on larger BlueGene systems. It is advisable to use this mode with extreme caution. Make sure you know what you doing to assure the blocks will boot without dependency on the state of any midplane not included the block.

In the two previous modes you must ensure that the midplanes defined in bluegene.conf are consistent with those defined in slurm.conf. Note the bluegene.conf file contains only the numeric coordinates of midplanes while slurm.conf contains the name prefix in addition to the numeric coordinates.

The final mode is dynamic partitioning. While dynamic partitioning was developed primarily for smaller BlueGene systems, it is commonly used on larger systems. A warning about dynamic partitioning is it may introduce fragmentation of resources. Dynamic partitioning is very capable, easy to set up, and is the default for many systems including LLNL's Sequoia. With the advent of sub-block allocations (see AllowSubBlockAllocation option in 'man bluegene.conf') fragmentation has become less of a concern. Blocks need not be assigned in the bluegene.conf file for this mode.

Blocks can be freed or set in an error state using the scontrol, command (i.e. "scontrol update BlockName=RMP0 state=error"). This will terminate any job on the block and set the state of the block to ERROR making it so no job will run on the block. To set it back to a usable state, you can resume the block with the scontrol option state=resume (i.e. "scontrol update BlockName=RMP0 state=resume"). This is useful if you temporarily put the block in an error state and the block is really booted and ready to start jobs. You can also put the block in free state using the state=free option. Valid states are Error, Free, Recreate, Remove, Resume.

Alternatively, if only part of a midplane needs to be put into an error state which isn't already in a block of the size you need, you can set a collection of IO nodes into an error state using scontrol (i.e. "scontrol update submpname=bg000[0-3] state=error"). NOTE: Even on BGQ where node names are given in bg0000[00000] format this option takes an ionode name bg0000[0]. This will end any job on the nodes listed, create a block there, and set the state of the block to ERROR making it so no job will run on the block. Then resume the block when it is ready to be used again (i.e. "scontrol update BlockName=RMP0 state=resume"). This is helpful to allow other jobs to run on the unaffected nodes in the midplane.

One of these modes must be defined in the bluegene.conf file with the option LayoutMode=MODE (where MODE=STATIC, DYNAMIC or OVERLAP).

The number of c-nodes in a midplane and in a node card must be defined. This is done using the keywords MidplaneNodeCnt=NODE_COUNT and NodeCardNodeCnt=NODE_COUNT respectively in the bluegene.conf file (i.e. MidplaneNodeCnt=512 and NodeCardNodeCnt=32).

Note that the IONodesPerMP value defined in bluegene.conf represents how many ionodes are on each midplane. Slurm does not support heterogeneous ionode configurations so if your environment is like this place the smallest number here. For most BlueGene/L systems this value is either 8 (for IO poor systems) or 64 (for IO rich systems). For BlueGene/Q systems 4 to 16 are most common.

The Images file specifications identify which images are used when booting a block and the valid images are different for each BlueGene system type (e.g. L, P and Q). Their values can change during job allocation based on input from the user. If you change the block layout, then slurmctld and slurmd should both be cold-started (without preserving any state information, "/etc/init.d/slurm startclean").

If you wish to modify the IONodesPerMP value after blocks have already been created, either modify the blocks manually or destroy them and le. Slurm recreate them. Note that in addition to the blocks defined in bluegene.conf, an additional block is created containing all resources defined all of the other defined blocks. Make use of the Slurm partition mechanism to control access to these blocks. A sample bluegene.conf file is shown below.

###############################################################################
# Global specifications for a BlueGene/L system
#
# BlrtsImage:           BlrtsImage used for creation of all blocks.
# LinuxImage:           LinuxImage used for creation of all blocks.
# MloaderImage:         MloaderImage used for creation of all blocks.
# RamDiskImage:         RamDiskImage used for creation of all blocks.
#
# You may add extra images which a user can specify from the srun
# command line (see man srun).  When adding these images you may also add
# a Groups= at the end of the image path to specify which groups can
# use the image.
#
# AltBlrtsImage:           Alternative BlrtsImage(s).
# AltLinuxImage:           Alternative LinuxImage(s).
# AltMloaderImage:         Alternative MloaderImage(s).
# AltRamDiskImage:         Alternative RamDiskImage(s).
#
# LayoutMode:           Mode in which Slurm will create blocks:
#                       STATIC:  Use defined non-overlapping blocks
#                       OVERLAP: Use defined blocks, which may overlap
#                       DYNAMIC: Create blocks as needed for each job
# MidplaneNodeCnt: Number of c-nodes per midplane
# NodeCardNodeCnt:      Number of c-nodes per node card.
# IONodesPerMP:         Number of ionodes per midplane, needed to
#                       determine smallest creatable block..
#
# BridgeAPILogFile:  Pathname of file in which to write the
#                    Bridge API logs.
# BridgeAPIVerbose:  How verbose the BG Bridge API logs should be
#                    0: Log only error and warning messages
#                    1: Log level 0 and information messages
#                    2: Log level 1 and basic debug messages
#                    3: Log level 2 and more debug message
#                    4: Log all messages
# DenyPassthrough:   Prevents use of passthrough ports in specific
#                    dimensions, X, Y, and/or Z, plus ALL
#
# NOTE: The bgl_serial value is set at configuration time using the
#       "--with-bgl-serial=" option. Its default value is "BGL".
###############################################################################
# These are the default images with are used if the user doesn't specify
# which image they want
BlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts
LinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf
MloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts
RamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf

#Only group jette can use these images
AltBlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw2.rts Groups=jette
AltLinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage2.elf Groups=jette
AltMloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader2.rts Groups=jette
AltRamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk2.elf Groups=jette

# Since no groups are specified here any user can use them
AltBlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw3.rts
AltLinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage3.elf
AltMloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader3.rts
AltRamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk3.elf

# Another option for images would be a "You can use anything you like image" *
# This allows the user to use any image entered with no security checking
AltBlrtsImage=* Groups=da,adamb
AltLinuxImage=* Groups=da,adamb
AltMloaderImage=* Groups=da,adamb
AltRamDiskImage=*  Groups=da,adamb

LayoutMode=STATIC
MidplaneNodeCnt=512
NodeCardNodeCnt=32
IONodesPerMP=64	# An I/O rich environment
BridgeAPILogFile=/var/log/slurm/bridgeapi.log
BridgeAPIVerbose=0

#DenyPassthrough=X,Y,Z

###############################################################################
# Define the static/overlap partitions (blocks)
#
# BPs: The midplanes (midplanes) in the block using XYZ coordinates
# Type:  Connection type "MESH" or "TORUS" or "SMALL", default is "TORUS"
#        Type SMALL will divide a midplane into multiple blocks
#        based off options NodeCards and Quarters to determine type of
#        small blocks.
#
# IMPORTANT NOTES:
# * Ordering is very important for laying out switch wires.  Please create
#   blocks with smap, and once done don't move the order of blocks
#   created.
# * A block is implicitly created containing all resources on the system
# * Blocks must not overlap (except for implicitly created block)
#   This will be the case when smap is used to create a configuration file
# * All midplanes defined here must also be defined in the slurm.conf file
# * Define only the numeric coordinates of the blocks here. The prefix
#   will be based upon the name defined in slurm.conf
###############################################################################
# LEAVE NEXT LINE AS A COMMENT, Full-system block, implicitly created
# BPs=[000x001] Type=TORUS       # 1x1x2 = 2 midplanes
###############################################################################
# volume = 1x1x1 = 1
BPs=[000x000] Type=TORUS                            # 1x1x1 =  1 midplane
BPs=[001x001] Type=SMALL 32CNBlocks=4 128CNBlocks=3 # 1x1x1 = 4-Nodecard sized
						    # c-node blocks 3-Base
						    # Partition Quarter sized
						    # c-node blocks

The above bluegene.conf file defines multiple blocks to be created in a single midplane (see the "SMALL" option). Using this mechanism, up to 32 independent jobs each consisting of 32 c-nodes can be executed simultaneously on a one-rack BlueGene system. If defining blocks of Type=SMALL, the Slurm partition containing them as defined in slurm.conf must have the parameter OverSubscribe=force to enable scheduling of multiple jobs on what Slurm considers a single node. Slurm partitions that do not contain blocks of Type=SMALL may have the parameter OverSubscribe=no for a slight improvement in scheduler performance. As in all Slurm configuration files, parameters and values are case insensitive.

The valid image names on a BlueGene/P system are CnloadImage, MloaderImage, and IoloadImage. The only image name on BlueGene/Q systems is MloaderImage. Alternate images may be specified as described above for all BlueGene system types.

When slurmctld is initially started on an idle system, the blocks already defined in MMCS are read using the Bridge APIs. If these blocks do not correspond to those defined in the bluegene.conf file, the old blocks with a prefix of "RMP" are destroyed and new ones created. When a job is scheduled, the appropriate block is identified, its user set, and it is booted. Subsequent jobs use this same block without rebooting by changing the associated user field. The only time blocks should be freed and rebooted, in normal operation, is when going to or from full-system jobs (two or more blocks sharing midplanes can not be in a ready state at the same time). When this logic became available at LLNL, approximately 85 percent of block boots were eliminated and the overhead of job startup went from about 24% to about 6% of total job time. Note that blocks will remain in a ready (booted) state when the Slurm daemons are stopped. This permits Slurm daemon restarts without loss of running jobs or rebooting of blocks.

Be aware tha. Slurm will issue multiple block boot requests as needed (e.g. when the boot fails). If the block boot requests repeatedly fail (>3 times), Slurm will configure the failing block to an ERROR state so as to avoid continuing repeated reboots and the likely failure of user jobs. A system administrator should address the problem before returning the block to service with scontrol.

If the slurmctld daemon is cold-started (/etc/init.d/slurm startclean or slurmctld -c) it is recommended that the slurmd daemon(s) be cold-started at the same time. Failure to do so may result in errors being reported by both slurmd and slurmctld due to blocks that previously existed being deleted.

Resource Reservations

Slurm's advance reservation mechanism can accept a node count specification as input rather than identification of specific nodes/midplanes. Cnodes can be reserved instead of full midplanes, so asking for 32 nodes will result in a reservation consisting of 32 cnodes. Also you can request specific cnodes in your reservation as well (e.g. scontrol create reservation nodes=bgq0000[00000x11111] will result in the first nodecard of midplane bgq0000 being reserved). Multiple block sizes can also be specified and a reservation will be made that includes those block sizes (e.g. scontrol create reservation nodecnt=4k,2k ...). In earlier versions of Slurm, the nodes/midplanes selected for a reservation when specifying a node count might not be suitable for creating block(s) of the desired size(s).

Debugging

All of the testing and debugging guidance provided in Quick Start Administrator Guide apply to BlueGene systems. One can start the slurmctld and slurmd daemons in the foreground with extensive debugging to establish basic functionality. Once running in production, the configured SlurmctldLog and SlurmdLog files will provide historical system information. On BlueGene systems, there is also a BridgeAPILogFile defined in bluegene.conf which can be configured to contain detailed information about every Bridge API call issued.

Note that slurmctld log messages of the sort Nodes bg[000x133] not responding are indicative of the slurmd daemon serving as a front-end to those midplanes is not responding (on non-BlueGene systems, the slurmd actually does run on the compute nodes, so the message is more meaningful there).

Note that you can emulate a BlueGene/Q system on stand-alone Linux system. Run configure with the --enable-bgq-emulation option. This will define "HAVE_BG", "HAVE_BGQ", and "HAVE_FRONT_END" in the config.h file. Then execute make normally. These variables will build the code as if it were running on an actual BlueGene computer, but avoid making calls to the Bridge library (that is controlled by the variable "HAVE_BG_FILES", which is left undefined). You can use this to test configurations, scheduling logic, etc.

Last modified 7 November 2017

Slurm Workload Manager