This example shows how to send deep learning training batch jobs to a cluster so that you can continue working or close MATLAB during training.
Training deep neural networks often takes hours or days. To use time efficiently, you can train neural networks as batch jobs and fetch the results from the cluster when they are ready. You can continue working in MATLAB while computations take place or close MATLAB and obtain the results later using the Job Monitor. This example sends the parallel parameter sweep in Use parfor to Train Multiple Deep Learning Networks as a batch job. After the job is complete, you can fetch the trained networks and compare their accuracies.
Before you can run this example, you need to configure a cluster and upload your data to the Cloud. In MATLAB, you can create clusters in the cloud directly from the MATLAB Desktop. On the Home tab, in the Parallel menu, select Create and Manage Clusters. In the Cluster Profile Manager, click Create Cloud Cluster. Alternatively, you can use MathWorks Cloud Center to create and access compute clusters. For more information, see Getting Started with Cloud Center. For this example, ensure that your cluster is set as default on the MATLAB Home tab, in Parallel > Select a Default Cluster. After that, upload your data to an Amazon S3 bucket and use it directly from MATLAB. This example uses a copy of the CIFAR-10 data set that is already stored in Amazon S3. For instructions, see Upload Deep Learning Data to the Cloud.
Send a script as a batch job to the cluster by using the batch
function. The cluster allocates one worker to execute the contents of your script. If the parallel code in the script benefits from extra workers, for example, it includes automatic parallel support or a parfor loop, you need to request the workers explicitly. batch
uses one worker for the client running the script. You can specify more workers by using the 'Pool' name-value pair argument.
In this case, send the trainMultipleNetworks
script to the cluster. This script contains the parallel parameter sweep in Use parfor to Train Multiple Deep Learning Networks. Because the script contains a parfor
loop, specify 4
extra workers with the Pool
name-value pair argument.
totalNumberOfWorkers = 5; job1 = batch('trainMultipleNetworks', ... 'Pool',totalNumberOfWorkers-1);
You can see the current status of your job in the cluster by checking the Job Monitor. In the Environment section on the Home tab, select Parallel > Monitor Jobs to open the Job Monitor.
You can submit additional jobs to the cluster. If the cluster is not available because it is running other jobs, any new job you submit remains queued until the cluster becomes available.
After submitting jobs to the cluster, you can continue working in MATLAB while computations take place. If the rest of your code depends on completion of a job, block MATLAB by using the wait
command. In this case, wait for the job to finish.
wait(job1);
After the job finishes, fetch the results by using the load
function. In this case, fetch the trained networks from the parallel parameter sweep in the submitted script and their accuracies.
load(job1,'accuracies');
accuracies
accuracies = 4×1
0.8312
0.8276
0.8288
0.8258
load(job1,'trainedNetworks');
trainedNetworks
trainedNetworks = 4×1 cell array
{1×1 SeriesNetwork}
{1×1 SeriesNetwork}
{1×1 SeriesNetwork}
{1×1 SeriesNetwork}
To load all the variables in the batch job, use the load
function without arguments.
load(job1);
If you close MATLAB, you can still recover the job in the cluster to fetch the results either while the computation is taking place or after the computation is complete. Before closing MATLAB, make a note of the job ID and then retrieve the job later by using the findJob
function.
To retrieve a job, first create a cluster object for your cluster by using the parcluster
function. Then, provide the job ID to findJob
. In this case, the job ID is 1
.
c = parcluster('MyClusterInTheCloud'); job = findJob(c,'ID',1);
Delete the job when you are done. The job is removed from the Job Monitor.
delete(job1);
When you submit batch jobs, all the computations happen in the cluster and you can safely close MATLAB. You can check the status of your jobs by using the Job Monitor in another MATLAB session.
When a job is done, you can retrieve the results from the Job Monitor. In the Environment section on the Home tab, select Parallel > Monitor Jobs to open the Job Monitor. Then right-click a job to display the context menu. From this menu, you can:
Load the job into the workspace by clicking Show Details
Load all variables in the job by clicking Load Variables
Delete the job when you are done by clicking Delete
batch
(Parallel Computing Toolbox)