The generic scheduler interface provides complete flexibility to configure the interaction of the MATLAB® client, MATLAB workers, and a third-party scheduler. The plugin scripts define how MATLAB interacts with your setup.
The following table lists the supported plugin script functions and the stage at which they are evaluated:
File Name | Stage |
independentSubmitFcn.m | Submitting an independent job |
communicatingSubmitFcn.m | Submitting a communicating job |
getJobStateFcn.m | Querying the state of a job |
canceJobFcn.m | Canceling a job |
cancelTaskFcn.m | Canceling a task |
deleteJobFcn.m | Deleting a job |
deleteTaskFcn.m | Deleting a task |
postConstructFcn.m | After creating a parallel.cluster.Generic instance
|
These plugin scripts are evaluated only if they have the expected file name and are located in the folder specified by the PluginScriptsLocation property of the cluster. For more information about how to configure a generic cluster profile, see Configure Using the Generic Scheduler Interface (MATLAB Parallel Server).
Note
The independentSubmitFcn.m
must exist to submit an independent
job, and the communicatingSubmitFcn.m
must exist to submit a
communicating job.
To support usage of the generic scheduler interface, plugin scripts are available for the following third-party schedulers:
Each installer provides scripts for three possible submission modes:
Shared – The client can submit directly to the scheduler, and the client and the cluster nodes (or machines) have a shared file system.
Remote – The client and cluster nodes have a shared file system, but the
client machine cannot submit directly to the scheduler, such as when the
client utilities of the scheduler are not installed. This mode uses the
ssh
protocol to submit commands to the scheduler
using a remote host.
Nonshared – The client and cluster nodes do not have a shared file system.
This mode uses the ssh
protocol to submit commands to the
scheduler using a remote host, and it uses the sftp
protocol to copy job and task files to the cluster file system.
Each submission mode has its own subfolder within the installation folder. This
subfolder contains a README
file that provides specific
instructions on how to use the scripts. Before using the scripts, decide which
submission mode describes your network setup.
To run the installer, download the appropriate support package for your scheduler, and open it in your MATLAB client. The installer includes a wizard to guide you through creating a cluster profile for your cluster configuration.
If you want to customize the behavior of the plugin scripts, you can set
additional properties, such as AdditionalSubmitArgs
. For more
information, see Customize Behavior of Sample Plugin Scripts (MATLAB Parallel Server).
If your scheduler or cluster configuration is not supported by one of the support packages, it is recommended that you modify the scripts of one of these packages. For more information on how to write a set of plugin scripts for generic schedulers, see Writing Custom Plugin Scripts.
The sample plugin scripts use wrapper scripts to simplify
the implementation of independentSubmitFcn.m
and
communicatingSubmitFcn.m
. These scripts are not required,
however, using them is a good practice to make your code more readable. This
table describes these scripts:
File name | Description |
independentJobWrapper.sh | Used in independentSubmitFcn.m to embed a
call to the MATLAB executable with the appropriate arguments. It uses
environment variables for the location of the executable and its
arguments. For an example of its use, see Sample script for a SLURM scheduler. |
communicatingJobWrapper.sh | Used in communicatingSubmitFcn.m to
distribute a communicating job in your cluster. This script
implements the steps in Submit scheduler job to launch MPI process. For an example of its use, see Sample script for a SLURM scheduler. |
Note
When writing your own plugin scripts, it is a good practice to start by modifying one of the sample plugin scripts that most closely matches your setup (see Sample Plugin Scripts).
When you submit an independent job to a generic cluster, the
independentSubmitFcn.m
function executes in the
MATLAB client session.
The declaration line of this function must be:
function independentSubmitFcn(cluster,job,environmentProperties)
Each task in a MATLAB independent job corresponds to a single job on your scheduler. The
purpose of this function is to submit N
jobs to your
third-party scheduler, where N
is the number of tasks in the
independent job. Each of these jobs must:
Set the five environment variables required by the worker MATLAB to identify the individual task to run. For more information, see Configure the worker environment.
Call the appropriate MATLAB executable to start the MATLAB worker and run the task. For more information, see Submit scheduler jobs to run MATLAB workers.
Configure the worker environment. This table identifies the five environment variables and values that must be set on the worker MATLAB to run an individual task:
Environment Variable Name | Environment Variable Value |
PARALLEL_SERVER_DECODE_FUNCTION | 'parallel.cluster.generic.independentDecodeFcn' |
PARALLEL_SERVER_STORAGE_CONSTRUCTOR | environmentProperties.StorageConstructor |
PARALLEL_SERVER_STORAGE_LOCATION |
|
PARALLEL_SERVER_JOB_LOCATION | environmentProperties.JobLocation |
PARALLEL_SERVER_TASK_LOCATION | environmentProperties.TaskLocation{n}
for the nth task |
Many schedulers support copying the client environment as part of the submission command. If so, you can set the previous environment variables in the client, so the scheduler can copy them to the worker environment. If not, you must modify your submission command to forward these variables.
Submit scheduler jobs to run MATLAB workers. Once the five required parameters for a given job and task are defined on
a worker, the task is run by calling the MATLAB executable with suitable arguments. The MATLAB executable to call is defined in
environmentProperties.MatlabExecutable
. The arguments
to pass are defined in
environmentProperties.MatlabArguments
.
Note
If you cannot submit directly to your scheduler from the client
machine, see Submitting from a Remote Host
for instructions on how to submit using ssh
.
Sample script for a SLURM scheduler. This script shows a basic submit function for a SLURM scheduler with a shared file system. For a more complete example, see Sample Plugin Scripts.
function independentSubmitFcn(cluster,job,environmentProperties) % Specify the required environment variables. setenv('PARALLEL_SERVER_DECODE_FUNCTION', 'parallel.cluster.generic.independentDecodeFcn'); setenv('PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor); setenv('PARALLEL_SERVER_STORAGE_LOCATION', environmentProperties.StorageLocation); setenv('PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation); % Specify the MATLAB executable and arguments to run on the worker. % These are used in the independentJobWrapper.sh script. setenv('PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable); setenv('PARALLEL_SERVER_MATLAB_ARGS', environmentProperties.MatlabArguments); for ii = 1:environmentProperties.NumberOfTasks % Specify the environment variable required to identify which task to run. setenv('PARALLEL_SERVER_TASK_LOCATION', environmentProperties.TaskLocations{ii}); % Specify the command to submit the job to the SLURM scheduler. % SLURM will automatically copy environment variables to workers. commandToRun = 'sbatch --ntasks=1 independentJobWrapper.sh'; [cmdFailed, cmdOut] = system(commandToRun); end end
The previous example submits a simple bash script,
independentJobWrapper.sh
, to the scheduler. The
independentJobWrapper.sh
script embeds the
MATLAB executable and arguments using environment variables:
#!/bin/sh # PARALLEL_SERVER_MATLAB_EXE - the MATLAB executable to use # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use exec "${PARALLEL_SERVER_MATLAB_EXE}" ${PARALLEL_SERVER_MATLAB_ARGS}
When you submit a communicating job to a generic cluster, the
communicatingSubmitFcn.m
function executes in the
MATLAB client session.
The declaration line of this function must be:
function communicatingSubmitFcn(cluster,job,environmentProperties)
The purpose of this function is to submit a single job to your scheduler. This job must:
Set the four environment variables required by the MATLAB workers to identify the job to run. For more information, see Configure the worker environment.
Call MPI to distribute your job to N
MATLAB workers. N
corresponds to the maximum
value specified in the NumWorkersRange
property of
the MATLAB job. For more information, see Submit scheduler job to launch MPI process.
Configure the worker environment. This table identifies the four environment variables and values that must be set on the worker MATLAB to run a task of a communicating job:
Environment Variable Name | Environment Variable Value |
PARALLEL_SERVER_DECODE_FUNCTION | 'parallel.cluster.generic.communicatingDecodeFcn' |
PARALLEL_SERVER_STORAGE_CONSTRUCTOR | environmentProperties.StorageConstructor |
PARALLEL_SERVER_STORAGE_LOCATION |
|
PARALLEL_SERVER_JOB_LOCATION | environmentProperties.JobLocation |
Many schedulers support copying the client environment as part of the submission command. If so, you can set the previous environment variables in the client, so the scheduler can copy them to the worker environment. If not, you must modify your submission command to forward these variables.
Submit scheduler job to launch MPI process. After you define the four required parameters for a given job, run your
job by launching N
worker MATLAB processes using mpiexec
.
mpiexec
is software shipped with the Parallel Computing Toolbox™ that implements the Message Passing Interface (MPI) standard
to allow communication between the worker MATLAB processes. For more information about
mpiexec
, see the MPICH home page.
To run your job, you must submit a job to your scheduler, which executes
the following steps. Note that matlabroot
refers to the
MATLAB installation location on your worker nodes.
Request N
processes from the scheduler.
N
corresponds to the maximum value specified
in the NumWorkersRange
property of the
MATLAB job.
Call mpiexec
to start worker MATLAB processes. The number of worker MATLAB processes to start on each host should match the
number of processes allocated by your scheduler. The
mpiexec
executable is located at
matlabroot/bin/mw_mpiexec
.
The mpiexec
command automatically forwards
environment variables to the launched processes. Therefore, ensure
the environment variables listed in Configure the worker environment are set before running mpiexec
.
To learn more about options for mpiexec
, see
Using the Hydra Process Manager.
Note
For a complete example of the previous steps, see the
communicatingJobWrapper.sh
script provided
with any of the sample plugin scripts in Sample Plugin Scripts. Use this script as a starting point if you need to write your
own script.
Sample script for a SLURM scheduler. The following script shows a basic submit function for a SLURM scheduler with a shared file system.
The submitted job is contained in a bash script,
communicatingJobWrapper.sh
. This script implements
the relevant steps in Submit scheduler job to launch MPI process for a
SLURM scheduler. For a more complete example, see Sample Plugin Scripts.
function communicatingSubmitFcn(cluster,job,environmentProperties) % Specify the four required environment variables. setenv('PARALLEL_SERVER_DECODE_FUNCTION', 'parallel.cluster.generic.communicatingDecodeFcn'); setenv('PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor); setenv('PARALLEL_SERVER_STORAGE_LOCATION', environmentProperties.StorageLocation); setenv('PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation); % Specify the MATLAB executable and arguments to run on the worker. % Specify the location of the MATLAB install on the cluster nodes. % These are used in the communicatingJobWrapper.sh script. setenv('PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable); setenv('PARALLEL_SERVER_MATLAB_ARGS', environmentProperties.MatlabArguments); setenv('PARALLEL_SERVER_CMR', cluster.ClusterMatlabRoot); numberOfTasks = environmentProperties.NumberOfTasks; % Specify the command to submit a job to the SLURM scheduler which % requests as many processes as tasks in the job. % SLURM will automatically copy environment variables to workers. commandToRun = sprintf('sbatch --ntasks=%d communicatingJobWrapper.sh', numberOfTasks); [cmdFailed, cmdOut] = system(commandToRun); end
When you query the state of a job created with a generic cluster, the
getJobStateFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function state = getJobStateFcn(cluster,job,state)
When using a third-party scheduler, it is possible that the scheduler can have more up-to-date information about your jobs than what is available to the toolbox from the local job storage location. This situation is especially true if using a nonshared file system, where the remote file system could be slow in propagating large data files back to your local data location.
To retrieve that information from the scheduler, add a function called
getJobStateFcn.m
to the
PluginScriptsLocation of your cluster.
The state passed into this function is the state derived from the local job
storage. The body of this function can then query the scheduler to determine a
more accurate state for the job and return it in place of the stored state. The
function you write for this purpose must return a valid value for the state of a
job object. Allowed values are ‘pending’
,
‘queued’
, ‘running’
,
‘finished’
, or ‘failed’
.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
When you cancel a job created with a generic cluster, the
cancelJobFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function OK = cancelJobFcn(cluster,job)
When you cancel a job created using the generic scheduler interface, by
default this action affects only the job data in storage. To cancel the
corresponding jobs on your scheduler, you must provide instructions on what to
do and when to do it to the scheduler. To achieve this, add a function called
cancelJobFcn.m
to the
PluginScriptsLocation of your cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding jobs from the queue. The function must return a logical scalar indicating the success or failure of canceling the jobs on the scheduler.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
When you cancel a task created with a generic cluster, the
cancelTaskFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function OK = cancelTaskFcn(cluster,task)
When you cancel a task created using the generic scheduler interface, by
default, this affects only the task data in storage. To cancel the corresponding
job on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
cancelTaskFcn.m
to the
PluginScriptsLocation of your cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding job from the scheduler queue. The function must return a logical scalar indicating the success or failure of canceling the job on the scheduler.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
When you delete a job created with a generic cluster, the
deleteJobFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function deleteTaskFcn(cluster,task)
When you delete a job created using the generic scheduler interface, by
default, this affects only the job data in storage. To remove the corresponding
jobs on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
deleteJobFcn.m
to the
PluginScriptsLocation of your cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding jobs from the scheduler queue.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
When you delete a task created with a generic cluster, the
deleteTaskFcn.m
function executes in the MATLAB client session. The declaration line of this function must
be:
function deleteTaskFcn(cluster,task)
When you delete a task created using the generic scheduler interface, by
default, this affects only the task data in storage. To remove the corresponding
job on your scheduler, you must provide instructions on what to do and when to
do it to the scheduler. To achieve this, add a function called
deleteTaskFcn.m
to the
PluginScriptsLocation of your cluster.
The body of this function can then send a command to the scheduler, for example, to remove the corresponding job from the scheduler queue.
For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.
After you create an instance of your cluster in MATLAB, the postConstructFcn.m
function executes in
the MATLAB client session. For example, the following line of code creates an
instance of your cluster and runs the postConstructFcn
function associated with the ‘myProfile’
cluster
profile:
c = parcluster('myProfile');
The declaration line of the postConstructFcn
function must
be:
function postConstructFcn(cluster)
If you need to perform custom configuration of your cluster before its use,
add a function called postConstructFcn.m
to the
PluginScriptsLocation
of your cluster. The body of this
function can contain any extra setup steps you require.
If you need to modify the functionality of your plugin scripts at run time, then use the AdditionalProperties property of the generic scheduler interface.
As an example, consider the SLURM scheduler. The submit command for SLURM accepts
a –-nodelist
argument that allows you to specify the nodes you
want to run on. You can change the value of this argument without having to modify
your plugin scripts. To add this functionality, include the following code pattern
in your independentSubmitFcn.m
and
communicatingSubmitFcn.m
scripts:
% Basic SLURM submit command submitCommand = 'sbatch'; % Check if property is defined if isprop(cluster.AdditionalProperties, 'NodeList') % Add appropriate argument and value to submit string submitCommand = [submitCommand ' --nodelist=' cluster.AdditionalProperties.NodeList]; end
For an example of how to use this coding pattern, see the nonshared submit functions of the scripts in Sample Plugin Scripts.
With the modification to your scripts in the previous example, you can add an AdditionalProperties entry to your generic cluster profile to specify a list of nodes to use. This provides a method of documenting customization added to your plugin scripts for anyone you share the cluster profile with.
To add the NodeList
property to your cluster
profile:
Start the Cluster Profile Manager from the MATLAB desktop by selecting Parallel > Manage Cluster Profiles.
Select the profile for your generic cluster, and click Edit.
Navigate to the AdditionalProperties table, and click Add.
Enter NodeList
as the
Name.
Set String as the Type.
Set the Value to the list of nodes.
With the modification to your scripts in Adding User Customization, you can edit the list of nodes from the MATLAB command line by setting the appropriate property of the cluster object before submitting a job:
c = parcluster; c.AdditionalProperties.NodeList = 'gpuNodeName'; j = c.batch('myScript');
Display the AdditionalProperties
object to see all currently
defined properties and their
values:
>> c.AdditionalProperties ans = AdditionalProperties with properties: ClusterHost: 'myClusterHost' NodeList: 'gpuNodeName' RemoteJobStorageLocation: '/tmp/jobs'
The first requirement for job management is to identify the jobs on the scheduler
corresponding to a MATLAB job object. When you submit a job to the scheduler, the command that
does the submission in your submit function can return some data about the job from
the scheduler. This data typically includes a job ID. By storing that scheduler job
ID with the MATLAB job object, you can later refer to the scheduler job by this job ID
when you send management commands to the scheduler. Similarly, you can store a map
of MATLAB task IDs to scheduler job IDs to help manage individual tasks. The
toolbox function that stores this cluster data is setJobClusterData
.
This example shows how to modify the independentSubmitFcn.m
function to parse the output of each command submitted to a SLURM scheduler. You
can use regular expressions to extract the scheduler job ID for each task and
then store it using setJobClusterData
.
% Pattern to extract scheduler job ID from SLURM sbatch output searchPattern = '.*Submitted batch job ([0-9]+).*'; jobIDs = cell(numberOfTasks, 1); for ii = 1:numberOfTasks setenv('PARALLEL_SERVER_TASK_LOCATION', environmentProperties.TaskLocations{ii}); commandToRun = 'sbatch --ntasks=1 independentJobWrapper.sh'; [cmdFailed, cmdOut] = system(commandToRun); jobIDs{ii} = regexp(cmdOut, searchPattern, 'tokens', 'once'); end % set the job IDs on the job cluster data cluster.setJobClusterData(job, struct('ClusterJobIDs', {jobIDs}));
This example modifies the cancelJobFcn.m
to cancel the
corresponding jobs on the SLURM scheduler. The example uses getJobClusterData
to retrieve
job scheduler
data.
function OK = cancelJobFcn(cluster, job) % Get the scheduler information for this job data = cluster.getJobClusterData(job); jobIDs = data.ClusterJobIDs; for ii = 1:length(jobIDs) % Tell the SLURM scheduler to cancel the job commandToRun = sprintf('scancel ''%s''', jobIDs{ii}); [cmdFailed, cmdOut] = system(commandToRun); end OK = true;
If the MATLAB client is unable to submit directly to your scheduler, use parallel.cluster.RemoteClusterAccess
to establish a connection and run commands on a remote host.
This object uses the ssh
protocol, and hence requires an
ssh
daemon service running on the remote host. To establish a
connection, you must either provide a user name and password for the remote host, or
a valid identity file.
The following code executes a command on a remote host,
remoteHostname
, as the user,
user
.
% This will prompt for the password of user access = parallel.cluster.RemoteClusterAccess.getConnectedAccess('remoteHostname', 'user'); % Execute a command on remoteHostname [cmdFailed, cmdOut] = access.runCommand(commandToRun);
For an example of plugin scripts using remote host submission, see the remote submission mode in Sample Plugin Scripts.
If the MATLAB client does not have a shared file system with the cluster nodes, use
parallel.cluster.RemoteClusterAccess
to establish a connection and copy job and task files between the client and cluster
nodes.
This object uses the ssh
protocol, and hence requires an
ssh
daemon service running on the remote host. To establish a
connection, you must either provide a user name and password for the remote host or
a valid identity file.
When using nonshared submission, you must specify both a local job storage location to use on the client and a remote job storage location to use on the cluster. The remote job storage location must be available to all nodes of the cluster.
parallel.cluster.RemoteClusterAccess
uses file mirroring to
continuously synchronize the local job and task files with those on the cluster.
When file mirroring first starts, local job and task files are uploaded to the
remote job storage location. As the job executes, the file mirroring continuously
checks the remote job storage location for new files and updates, and copies the
files to the local storage on the client. This procedure ensures the MATLAB client always has an up-to-date view of the jobs and tasks executing
on the scheduler.
This example connects to the remote host, remoteHostname
, as
the user, user
, and establishes
/remote/storage
as the remote cluster storage location to
synchronize with. It then starts file mirroring for a job, copying the local files
of the job to /remote/storage
on the cluster, and then syncing
any changes back to the local
machine.
% This will prompt for the password of user access = parallel.cluster.RemoteClusterAccess.getConnectedAccessWithMirror('remoteHostname', '/remote/storage', 'user'); % Start file mirroring for a job access.startMirrorForJob(job);
For an example of plugin scripts without a shared file system, see the nonshared submission mode in Sample Plugin Scripts.