Slurm Switch Plugin API

Overview

This document describe. Slurm switch (interconnect) plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own Slurm switch plugins. Note that many of the API functions are used only by one of the daemons. For example the slurmctld daemon builds a job step's switch credential (switch_p_build_jobinfo) while the slurmd daemon enables and disables that credential for the job step's tasks on a particular node(switch_p_job_init, etc.).

Slurm switch plugins are Slurm plugins that implement the Slurm switch or interconnect API described herein. They must conform to the Slurm Plugin API with the following specifications:

const char plugin_type[]
The major type must be "switch." The minor type can be any recognizable abbreviation for the type of switch. We recommend, for example:

  • none — A plugin that implements the API without providing any actual switch service. This is the case for Ethernet and Myrinet interconnects.

const char plugin_name[]
Some descriptive name for the plugin. There is no requirement with respect to its format.

const uint32_t plugin_version
If specified, identifies the version of Slurm used to build this plugin and any attempt to load the plugin from a different version of Slurm will result in an error. If not specified, then the plugin may be loaded by Slurm commands and daemons from any version, however this may result in difficult to diagnose failures due to changes in the arguments to plugin functions or changes in other Slurm functions used by the plugin.

Data Objects

The implementation must support two opaque data classes. One is used as an job step's switch "credential." This class must encapsulate all job step specific information necessary for the operation of the API specification below. The second is a node's switch state record. Both data classes are referred to in Slurm code using an anonymous pointer (void *).

API Functions

The following functions must appear. Functions which are not implemented should be stubbed.

int init (void)

Description:
Called when the plugin is loaded, before any other functions are called. Put global initialization here.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void fini (void)

Description:
Called when the plugin is removed. Clear any allocated storage here.

Returns: None.

Note: These init and fini functions are not the same as those described in the dlopen (3) system library. The C run-time system co-opts those symbols for its own initialization. The system _init() is called before the Slurm init(), and the Slurm fini() is called before the system's _fini().

Global Switch State Functions

int switch_p_libstate_save (char *dir_name);

Description: Save any global switch state to a file within the specified directory. The actual file name used is plugin specific. It is recommended that the global switch state contain a magic number for validation purposes. This function is called by the slurmctld daemon on shutdown. Note that if the slurmctld daemon fails, this function will not be called. The plugin may save state independently and/or make use of the switch_p_job_step_allocated function to restore state.

Arguments: dir_name    (input) fully-qualified pathname of a directory into which user SlurmUser (as defined in slurm.conf) can create a file and write state information into that file. Cannot be NULL.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_libstate_restore(char *dir_name, bool recover);

Description: Restore any global switch state from a file within the specified directory. The actual file name used is plugin specific. It is recommended that any magic number associated with the global switch state be verified. This function is called by the slurmctld daemon on startup.

Arguments:
dir_name    (input) fully-qualified pathname of a directory containing a state information file from which user SlurmUser (as defined in slurm.conf) can read. Cannot be NULL.
recover  true of restart with state preserved, false if no state recovery.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_libstate_clear (void);

Description: Clear switch state information.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Node's Switch State Monitoring Functions

Nodes will register with current switch state information when the slurmd daemon is initiated. The slurmctld daemon will also request that slurmd supply current switch state information on a periodic basis.

int switch_p_clear_node_state (void);

Description: Initialize node state. If any switch state has previously been established for a job step, it will be cleared. This will be used to establish a "clean" state for the switch on the node upon which it is executed.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_alloc_node_info(switch_node_info_t *switch_node);

Description: Allocate storage for a node's switch state record. It is recommended that the record contain a magic number for validation purposes.

Arguments: switch_node    (output) location for writing location of node's switch state record.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_build_node_info(switch_node_info_t switch_node);

Description: Fill in a previously allocated switch state record for the node on which this function is executed. It is recommended that the magic number be validated.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_pack_node_info (switch_node_info_t switch_node, Buf buffer);

Description: Pack the data associated with a node's switch state into a buffer for network transmission.

Arguments:
switch_node    (input) an existing node's switch state record.
buffer    (input/output) buffer onto which the switch state information is appended.

Returns: The number of bytes written should be returned if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_unpack_node_info (switch_node_info_t **switch_node, Buf buffer);

Description: Allocate and unpack the data associated with a node's switch state record from a buffer.

Arguments:
switch_node    (output) a node switch state record will be allocated and filled in with data read from the buffer.
buffer    (input/output) buffer from which the record's contents are read.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

void switch_p_free_node_info (switch_node_info_t switch_node);

Description: Release the storage associated with a node's switch state record.

Arguments: switch_node    (input/output) a previously allocated node switch state record.

Returns: None

char * switch_p_sprintf_node_info (switch_node_info_t switch_node, char *buf, size_t size);

Description: Print the contents of a node's switch state record to a buffer.

Arguments:
switch_node    (input) a node's switch state record.
buf    (input/output) point to buffer into which the switch state record is to be written.
of buf in bytes.
size    (input) size of buf in bytes.

Returns: Location of buffer, same as buf.

Job's Switch Credential Management Functions

int switch_p_alloc_jobinfo(switch_jobinfo_t *switch_job, uint32_t job_id, uint32_t step_id);

Description: Allocate storage for a job step's switch credential. It is recommended that the credential contain a magic number for validation purposes.

Arguments:
switch_job    (output) location for writing location of job step's switch credential. job_id    (input) the job id for this job step NO_VAL for not set.
step_id    (input) the step id for this job step NO_VAL for not set.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_build_jobinfo (switch_jobinfo_t switch_job, slurm_step_layout_t *step_layout, char *network);

Description: Build a job's switch credential. It is recommended that the credential's magic number be validated.

Arguments:
switch_job    (input/output) Job's switch credential to be updated
step_layout    (input) the layout of the step with at least the node_list, tasks and tids set.
network    (input) Job step's network specification from srun command.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

switch_jobinfo_t switch_p_copy_jobinfo (switch_jobinfo_t switch_job);

Description: Allocate storage for a job's switch credential and copy an existing credential to that location.

Arguments: switch_job    (input) an existing job step switch credential.

Returns: A newly allocated job step switch credential containing a copy of the function argument.

void switch_p_free_jobinfo (switch_jobinfo_t switch_job);

Description: Release the storage associated with a job's switch credential.

Arguments: switch_job    (input) an existing job step switch credential.

Returns: None

int switch_p_pack_jobinfo (switch_jobinfo_t switch_job, Buf buffer);

Description: Pack the data associated with a job step's switch credential into a buffer for network transmission.

Arguments:
switch_job    (input) an existing job step switch credential.
buffer    (input/output) buffer onto which the credential's contents are appended.

Returns: The number of bytes written should be returned if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_unpack_jobinfo (switch_jobinfo_t **switch_job, Buf buffer);

Description: Allocate and unpack the data associated with a job's switch credential from a buffer.

Arguments:
switch_job    (output) a job step switch credential will be allocated and filled in with data read from the buffer.
buffer    (input/output) buffer from which the credential's contents are read.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_get_jobinfo (switch_jobinfo_t switch_job, int data_type, void *data);

Description: Get some specific data from a job's switch credential.

Arguments:
switch_job    (input) a job's switch credential.
data_type    (input) identification as to the type of data requested. The interpretation of this value is plugin dependent.
data    (output) filled in with the desired data. The form of this data is dependent upon the value of data_type and the plugin.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_step_complete (switch_jobinfo_t switch_job, char *nodelist);

Description: Note that the job step associated with the specified nodelist has completed execution.

Arguments:
switch_job    (input) The completed job step's switch credential.
nodelist    (input) A list of nodes on which the job step has completed. This may contain expressions to specify node ranges. (e.g. "linux[1-20]" or "linux[2,4,6,8]").

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_step_part_comp (switch_jobinfo_t switch_job, char *nodelist);

Description: Note that the job step has completed execution on the specified node list. The job step is not necessarily completed on all nodes, but switch resources associated with it on the specified nodes are no longer in use.

Arguments:
switch_job    (input) The completed job's switch credential.
nodelist    (input) A list of nodes on which the job step has completed. This may contain expressions to specify node ranges. (e.g. "linux[1-20]" or "linux[2,4,6,8]").

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

bool switch_p_part_comp (void);

Description: Indicate if the switch plugin should process partial job step completions (i.e. switch_g_job_step_part_comp). Support of partition completions is compute intensive, so it should be avoided unless switch resources are in short supply (e.g. former switch/nrt).

Returns: True if partition step completions are to be recorded. False if only full job step completions are to be noted.

void switch_p_print_jobinfo(FILE *fp, switch_jobinfo_t switch_job);

Description: Print the contents of a job's switch credential to a file.

Arguments:
fp    (input) pointer to an open file.
switch_job    (input) a job's switch credential.

Returns: None.

char *switch_p_sprint_jobinfo(switch_jobinfo_t switch_job, char *buf, size_t size);

Description: Print the contents of a job's switch credential to a buffer.

Arguments:
switch_job    (input) a job's switch credential.
buf    (input/output) pointer to buffer into which the job credential information is to be written.
size    (input) size of buf in bytes

Returns: location of buffer, same as buf.

int switch_p_get_data_jobinfo(switch_jobinfo_t switch_job, int key, void *resulting_data);

Description: Get data from a job step's switch credential.

Arguments:
switch_job    (input) a job step's switch credential.
key    (input) identification of the type of data to be retrieved from the switch credential. NOTE: The interpretation of this key is dependent upon the switch type.
resulting_data    (input/output) pointer to where the requested data should be stored.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Node Specific Switch Management Functions

int switch_p_node_init (void);

Description: This function is run from the top level slurmd only once per slurmd run. It may be used, for instance, to perform some one-time interconnect setup or spawn an error handling thread.

Arguments: None

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_node_fini (void);

Description: This function is called once as slurmd exits (slurmd will wait for this function to return before continuing the exit process).

Arguments: None

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Job Step Management Functions

=========================================================================
Process 1 (root)        Process 2 (root, user)  |  Process 3 (user task)
                                                |
switch_p_job_preinit                            |
fork ------------------ switch_p_job_init       |
waitpid                 setuid, chdir, etc.     |
                        fork N procs -----------+--- switch_p_job_attach
                        wait all                |    exec mpi process
                        switch_p_job_fini*      |
switch_p_job_postfini                           |
=========================================================================

int switch_p_job_preinit (switch_jobinfo_t jobinfo switch_job);

Description: Preinit is run as root in the first slurmd process, the so called job step manager. This function can be used to perform any initialization that needs to be performed in the same process as switch_p_job_fini().

Arguments: switch_job    (input) a job's switch credential.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_init (stepd_step_rec_t *job, uid_t uid);

Description: Initialize interconnect on node for a job. This function is run from the second slurmd process (some interconnect implementations may require the switch_p_job_init functions to be executed from a separate process than the process executing switch_p_job_fini() [e.g. Quadrics Elan]).

Arguments:
job    (input) structure representing the slurmstepd's view of the job step.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_attach ( switch_jobinfo_t switch_job, char ***env, uint32_t nodeid, uint32_t procid, uint32_t nnodes, uint32_t nprocs, uint32_t rank );

Description: Attach process to interconnect (Called from within the process, so it is appropriate to set interconnect specific environment variables here).

Arguments:
switch_job    (input) a job's switch credential.
env    (input/output) the environment variables to be set upon job step initiation. Switch specific environment variables are added as needed.
nodeid    (input) zero-origin id of this node.
procid    (input) zero-origin process id local to slurmd and not equivalent to the global task id or MPI rank.
nnodes    (input) count of nodes allocated to this job step.
nprocs    (input) total count of processes or tasks to be initiated for this job step.
rank    (input) zero-origin id of this task.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_fini (switch_jobinfo_t jobinfo switch_job);

Description: This function is run from the same process as switch_p_job_init() after all job tasks have exited. It is *not* run as root, because the process in question has already setuid to the job step owner.

Arguments: switch_job    (input) a job step's switch credential.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_postfini ( stepd_step_rec_t *job );

Description: This function is run from the initial slurmd process (same process as switch_p_job_preinit()), and is run as root. Any cleanup routines that need to be run with root privileges should be run from this function.

Arguments:
job    (input) structure representing the slurmstepd's view of the job step.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int switch_p_job_step_allocated (switch_jobinfo_t jobinfo switch_job, char *nodelist);

Description: Note that the identified job step is active at restart time. This function can be used to restore global switch state information based upon job steps known to be active at restart time. Use of this function is preferred over switch state saved and restored by the switch plugin. Direct use of job step switch information eliminates the possibility of inconsistent state information between the switch and job steps.

Arguments:
switch_job    (input) a job's switch credential.
nodelist    (input) the nodes allocated to a job step.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

Job Management Suspend/Resume Functions

int switch_p_job_suspend_test(switch_jobinfo_t *switch_job);

Description: Determine if a specific job step can be preempted.

Arguments:
switch_job    (input) a job step's switch credential.

Returns: SLURM_SUCCESS if the job step can be preempted and SLURM_ERROR otherwise.

void switch_p_job_suspend_info_get(switch_jobinfo_t *switch_job, void **suspend_info);

Description: Pack any information needed for a job step to be preempted into an opaque data structure.
NOTE: Use switch_p_job_suspend_info_free() to free the opaque data structure.

Arguments:
switch_job    (input) a job step's switch credential.
suspend_info    (input/output) information needed for a job to be preempted. This should be NULL for the first call and data about job steps will be added to the opaque data structure for addition function call (i.e. for each addition job step).

void switch_p_job_suspend_info_pack(void *suspend_info, Buf buffer);

Description: Pack the information needed for a job to be preempted into a buffer

Arguments:
suspend_info    (input) information needed for a job to be preempted, including information for all steps in that job.
buffer    (input/output) the buffer that has suspend_info added to it.

int switch_p_job_suspend_info_unpack(void **suspend_info, Buf buffer);

Description: Unpack the information needed for a job to be preempted from a buffer.
NOTE: Use switch_p_job_suspend_info_free() to free the opaque data structure.

Arguments:
suspend_info    (output) information needed for a job to be preempted, including information for all steps in that job.
buffer    (input/output) the buffer that has suspend_info extracted from it.

Returns: SLURM_SUCCESS if the suspend_info data was successfully read from buffer and SLURM_ERROR otherwise.

int switch_p_job_suspend(void *suspend_info, int max_wait);

Description: Suspend a job's use of switch resources. This may reset MPI timeout values and/or release switch resources.

Arguments:
suspend_info    (input) information needed for a job to be preempted, including information for all steps in that job.
max_wait    (input) maximum time interval to wait for the operation to complete, in seconds

Returns: SLURM_SUCCESS if job's switch resources suspended and SLURM_ERROR otherwise.

int switch_p_job_resume(void *suspend_info, int max_wait);

Description: Resume a job's use of switch resources. This may reset MPI timeout values and/or release switch resources.

Arguments:
suspend_info    (input) information needed for a job to be resumed, including information for all steps in that job.
max_wait    (input) maximum time interval to wait for the operation to complete, in seconds

Returns: SLURM_SUCCESS if job's switch resources resumed and SLURM_ERROR otherwise.

void switch_p_job_suspend_info_free(void *suspend_info);

Description: Free the resources allocated to store job suspend/resume information as generated by the switch_p_job_suspend_info_get() and switch_p_job_suspend_info_unpack() functions.

Arguments:
suspend_info    (input) information needed for a job to be preempted, including information for all steps in that job.

Job Step Management Suspend/Resume Functions

int switch_p_job_step_pre_suspend (stepd_step_rec_t *jobstep);

Description: Perform any job step pre-suspend functionality (done before the application PIDs are stopped).

Arguments:
job    (input) structure representing the slurmstepd's view of the job step.

Returns: SLURM_SUCCESS if the job step can be suspended and SLURM_ERROR otherwise.

int switch_p_job_step_post_suspend (stepd_step_rec_t *jobstep);

Description: Perform any job step post-suspend functionality (done after the application PIDs are stopped).

Arguments:
job    (input) structure representing the slurmstepd's view of the job step.

Returns: SLURM_SUCCESS if the job step has been suspended and SLURM_ERROR otherwise.

int switch_p_job_step_pre_resume (stepd_step_rec_t *jobstep);

Description: Perform any job step pre-resume functionality (done before the application PIDs are re-started).

Arguments:
job    (input) structure representing the slurmstepd's view of the job step.

Returns: SLURM_SUCCESS if the job step can be resumed and SLURM_ERROR otherwise.

int switch_p_job_step_post_resume (stepd_step_rec_t *jobstep);

Description: Perform any job step post-resume functionality (done after the application PIDs are re-started).

Arguments:
job    (input) structure representing the slurmstepd's view of the job step.

Returns: SLURM_SUCCESS if the job step has been resumed and SLURM_ERROR otherwise.

Last modified 7 March 2019