Slurm Job Checkpoint Plugin Programmer Guide

Overview

This document describes Slurm job checkpoint plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own Slurm job checkpoint plugins.

Slurm job checkpoint plugins are Slurm plugins that implement the Slurm API for checkpointing and restarting jobs. The plugins must conform to the Slurm Plugin API with the following specifications:

const char plugin_type[]
The major type must be "checkpoint." The minor type can be any recognizable abbreviation for the type of checkpoint mechanism. We recommend, for example:

  • none — No job checkpoint.
  • ompi — OpenMPI checkpoint (requires OpenMPI version 1.3 or higher).

const char plugin_name[]
Some descriptive name for the plugin. There is no requirement with respect to its format.

const uint32_t plugin_version
If specified, identifies the version of Slurm used to build this plugin and any attempt to load the plugin from a different version of Slurm will result in an error. If not specified, then the plugin may be loaded by Slurm commands and daemons from any version, however this may result in difficult to diagnose failures due to changes in the arguments to plugin functions or changes in other Slurm functions used by the plugin.

Data Objects

The implementation must maintain (though not necessarily directly export) an enumerated errno to allow Slurm to discover as practically as possible the reason for any failed API call. Plugin-specific enumerated integer values may be used when appropriate.

These values must not be used as return values in integer-valued functions in the API. The proper error return value from integer-valued functions is SLURM_ERROR. The implementation should endeavor to provide useful and pertinent information by whatever means is practical. Successful API calls are not required to reset any errno to a known value. However, the initial value of any errno, prior to any error condition arising, should be SLURM_SUCCESS.

There is also a checkpoint-specific error code and message that may be associated with each job step.

API Functions

The following functions must appear. Functions which are not implemented should be stubbed.

int init (void)

Description:
Called when the plugin is loaded, before any other functions are called. Put global initialization here.

Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.

void fini (void)

Description:
Called when the plugin is removed. Clear any allocated storage here.

Returns: None.

Note: These init and fini functions are not the same as those described in the dlopen (3) system library. The C run-time system co-opts those symbols for its own initialization. The system _init() is called before the Slurm init(), and the Slurm fini() is called before the system's _fini().

int slurm_ckpt_alloc_job (check_jobinfo_t *jobinfo);

Description: Allocate storage for job-step specific checkpoint data.

Argument: jobinfo (output) returns pointer to the allocated storage.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int slurm_ckpt_free_job (check_jobinfo_t jobinfo);

Description: Release storage for job-step specific checkpoint data that was previously allocated by slurm_ckpt_alloc_job.

Argument: jobinfo (input) pointer to the previously allocated storage.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int slurm_ckpt_pack_job (check_jobinfo_t jobinfo, Buf buffer, uint16_t protocol_version);

Description: Store job-step specific checkpoint data into a buffer.

Arguments:
jobinfo (input) pointer to the previously allocated storage.
Buf (input/output) buffer to which jobinfo has been appended.
protocol_version (input) communication protocol version.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

int slurm_ckpt_unpack_job (check_jobinfo_t jobinfo, Buf buffer, uint16_t protocol_version);

Description: Retrieve job-step specific checkpoint data from a buffer.

Arguments:
jobinfo (output) pointer to the previously allocated storage.
Buf (input/output) buffer to which jobinfo has been appended.
protocol_version (input) communication protocol version.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the errno to an appropriate value to indicate the reason for failure.

check_jobinfo_t slurm_ckpt_copy_job (check_jobinfo_t jobinfo);

Description: Duplicate job-step specific checkpoint data.

Arguments:
jobinfo (input) pointer to the previously allocated storage.

Returns: copy of jobinfo if successful. NULL on failure.

int slurm_ckpt_op ( uint32_t job_id, uint32_t step_id, struct step_record *step_ptr, uint16_t op, uint16_t data, char *image_dir, time_t *event_time, uint32_t *error_code, char **error_msg );

Description: Perform some checkpoint operation on a specific job step.

Arguments:
job_id (input) identifies the job to be operated upon. May be SLURM_BATCH_SCRIPT for a batch job or NO_VAL for all steps of the specified job.
step_id (input) identifies the job step to be operated upon.
step_ptr (input) pointer to the job step to be operated upon. Used by checkpoint/aix only.
op (input) specifies the operation to be performed. Currently supported operations include CHECK_ABLE (is job step currently able to be checkpointed), CHECK_DISABLE (disable checkpoints for this job step), CHECK_ENABLE (enable checkpoints for this job step), CHECK_CREATE (create a checkpoint for this job step and continue its execution), CHECK_VACATE (create a checkpoint for this job step and terminate it), CHECK_RESTART (restart this previously checkpointed job step), and CHECK_ERROR (return checkpoint-specific error information for this job step).
data (input) operation-specific data.
image_dir (input) directory to be used to save or restore state.
event_time (output) identifies the time of a checkpoint or restart operation.
error_code (output) returns checkpoint-specific error code associated with an operation.
error_msg (output) identifies checkpoint-specific error message associated with an operation.

Returns:
SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.

int slurm_ckpt_comp ( struct step_record * step_ptr, time_t event_time, uint32_t error_code, char *error_msg );

Description: Note the completion of a checkpoint operation.

Arguments:
step_ptr (input/output) identifies the job step to be operated upon.
event_time (input) identifies the time that the checkpoint operation began.
error_code (input) checkpoint-specific error code associated with an operation.
error_msg (input) checkpoint-specific error message associated with an operation.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.

int slurm_ckpt_stepd_prefork ( void *slurmd_job );

Description: Do preparation work for the checkpoint/restart support. This function is called by slurmstepd before forking the user tasks.

Arguments:
slurmd_job (input) pointer to job structure internal to slurmstepd.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.

int slurm_ckpt_signal_tasks ( void *slurmd_job, char *image_dir );

Description: Forward the checkpoint request to tasks managed by slurmstepd.

Arguments:
slurmd_job (input) pointer to job structure internal to slurmstepd.
image_dir (input) directory to be used to save or restore state.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.

int slurm_ckpt_restart_task ( void *slurmd_job, char *image_dir, int gtid);

Description: Restart the execution of a tasks from a checkpoint image, called by slurmstepd.

Arguments:
slurmd_job (input) pointer to job structure internal to slurmstepd.
image_dir (input) directory to be used to save or restore state.
gtid (input) global task ID to be restarted

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and set the error_code and error_msg to an appropriate value to indicate the reason for failure.

Last modified 27 March 2015