Use Experiment Manager to Train Networks in Parallel

This example uses:

This example shows how to train deep networks in parallel using Experiment Manager. Running an experiment in parallel allows you to try different training configurations at the same time. You can also use MATLAB® while the training is in progress. Parallel execution requires Parallel Computing Toolbox™.

In this example, you train two networks to classify images of digits from 0 to 9. The experiment trains the networks with augmented image data produced by applying random translations and horizontal reflections to the Digits data set. Data augmentation prevents the networks from overfitting and memorizing the exact details of the training images. When you run the experiment, Experiment Manager starts the parallel pool and executes multiple simultaneous trials, depending on the number of parallel workers available. Each trial uses a different combination of network and training options. While you monitor the training progress, you can stop trials that appear to be underperforming.

As an alternative, you can use parfor or parfeval to train multiple networks in parallel programmatically. For more information, see Train Deep Learning Networks in Parallel.

Open Experiment

First, open the example. Experiment Manager loads a project with a preconfigured experiment that you can inspect and run. To open the experiment, in the Experiment Browser pane, double-click the name of the experiment (AugmentedDataExperiment).

An experiment definition consists of a description, a table of hyperparameters, a setup function, and (optionally) a collection of metric functions to evaluate the results of the experiment. For more information, see Configure Deep Learning Experiment.

The Description box contains a textual description of the experiment. For this example, the description is:

Classification using data image augmentation to apply random
translations and horizontal reflections to the Digits data set.

The Hyperparameters section specifies the strategy (Exhaustive Sweep) and hyperparameter values to use for the experiment. When you run the experiment, Experiment Manager trains the network using every combination of hyperparameter values specified in the hyperparameter table. This example uses two hyperparameters, Network and TrainingOptions.

Network specifies the network to train. The possible values for this hyperparameter are:

"7 layers" — Simple network with 7 layers that includes one convolutional block consisting of a convolution2dLayer, a reluLayer, and a maxPooling2dLayer
"16 layers" — Network of 16 layers that includes three convolutional blocks, each consisting of a convolution2dLayer, a batchNormalizationLayer, a reluLayer, and a maxPooling2dLayer

TrainingOptions indicates the set of options used to train the network. The possible values for this hyperparameter are:

"fast" — Experiment Manager trains the network for a maximum of 10 epochs with an initial learning rate of 0.1.
"slow" — Experiment Manager trains the network for a maximum of 15 epochs with an initial learning rate of 0.001.

The Setup Function configures the training data, network architecture, and training options for the experiment. To inspect the setup function, under Setup Function, click Edit. The setup function opens in MATLAB Editor.

In this example, the input to the setup function is a structure with fields from the hyperparameter table. The setup function returns three outputs that you use to train a network for image classification problems. The setup function has three sections.

Load Image Data loads images from the Digits data set and splits this data set into training and validation sets. For the training data, this example creates an augmentedImageDatastore object by applying random translations and horizontal reflections. The validation data is stored in an imageDatastore object with no augmentation. For more information on this data set, see Image Data Sets.
Define Network Architecture defines the architecture for a convolutional neural network for deep learning classification. This example trains the network you specify for the hyperparameter Network.
Specify Training Options defines a trainingOptions object for the experiment. In this example, the value you specify for the hyperparameter TrainingOptions determines the training options 'InitialLearnRate' and 'MaxEpochs'.

Note that Experiment Manager does not support parallel execution when you set the training option 'ExecutionEnvironment' to 'multi-gpu' or 'parallel' or enable the training option 'DispatchInBackground'. For more information, see Configure Deep Learning Experiment.

The Metrics section specifies optional functions that evaluate the results of the experiment. This example does not include any custom metric functions.

Start Parallel Pool

If you have multiple GPUs, parallel execution typically increases the speed of your experiment. For best results, before you run your experiment, start a parallel pool with as many workers as GPUs. You can check the number of available GPUs by using the gpuDeviceCount function:

numGPUs = gpuDeviceCount;
parpool(numGPUs);

However, if you have a single GPU, all workers share that GPU, so you do not obtain the training speed-up and you increase the chances of the GPU running out of memory. To continue using MATLAB while you train a deep network on a single GPU, start a parallel pool with a single worker before you run your experiment in parallel.

Run Experiment in Parallel

To run your experiment, on the Experiment Manager toolstrip, click Use Parallel and then Run. If there is no current parallel pool, Experiment Manager starts one using the default cluster profile. Experiment Manager then executes multiple simultaneous trials, depending on the number of parallel workers available. Each trial uses a different combination of hyperparameter values.

A table of results displays the accuracy and loss for each trial.

While the experiment is running, you can track its progress by displaying the training plot for each trial. Select a trial and click Training Plot.

Stop, Cancel, and Restart Trials

Experiment Manager runs as many simultaneous trials as there are workers in your parallel pool. All other trials in your experiment are queued for later evaluation. While your experiment is running, you can stop a trial that is running or cancel a queued trial. In the Progress column of the results table, click the red square icon for each trial you want to stop or cancel.

For example, the validation loss for trials that use the "7 layers" network becomes undefined after only a few iterations.

Continuing the training for those trials does not produce any useful results, so you can stop those trials before the training is complete. Experiment Manager continues the training for the remaining trials.

When the training is complete, you can rerun a trial that you stopped or canceled. In the Progress column of the results table, click the green triangle icon for the trial.

Alternatively, to rerun all the trials that you canceled, in the Experiment Manager toolstrip, click Restart All Canceled.

Close Experiment

In the Experiment Browser pane, right-click the name of the project and select Close Project. Experiment Manager closes all of the experiments and results contained in the project.

Documentation

Use Experiment Manager to Train Networks in Parallel

Open Experiment

Start Parallel Pool

Run Experiment in Parallel

Stop, Cancel, and Restart Trials

Close Experiment

See Also

Related Topics

Deep Learning Toolbox Documentation

Support