This example shows how to train an actor-critic (AC) agent to balance a cart-pole system modeled in MATLAB® by using asynchronous parallel training. For an example that shows how to train the agent without using parallel training, see Train AC Agent to Balance Cart-Pole System.
When you use parallel computing with AC agents, each worker generates experiences from its copy of the agent and the environment. After every N
steps, the worker computes gradients from the experiences and sends the computed gradients back to the host agent. The host agent updates its parameters as follows.
For asynchronous training, the host agent applies the received gradients without waiting for all workers to send gradients, and sends the updated parameters back to the worker that provided the gradients. Then, the worker continues to generate experiences from its environment using the updated parameters.
For synchronous training, the host agent waits to receive gradients from all of the workers and updates its parameters using these gradients. The host then sends updated parameters to all the workers at the same time. Then, all workers continue to generate experiences using the updated parameters.
Create a predefined environment interface for the cart-pole system. For more information on this environment, see Load Predefined Control System Environments.
env = rlPredefinedEnv("CartPole-Discrete");
env.PenaltyForFalling = -10;
Obtain the observation and action information from the environment interface.
obsInfo = getObservationInfo(env); numObservations = obsInfo.Dimension(1); actInfo = getActionInfo(env);
Fix the random generator seed for reproducibility.
rng(0)
An AC agent approximates the long-term reward, given observations and actions, using a critic value function representation. To create the critic, first create a deep neural network with one input (the observation) and one output (the state value). The input size of the critic network is 4 since the environment provides 4 observations. For more information on creating a deep neural network value function representation, see Create Policy and Value Function Representations.
criticNetwork = [ featureInputLayer(4,'Normalization','none','Name','state') fullyConnectedLayer(32,'Name','CriticStateFC1') reluLayer('Name','CriticRelu1') fullyConnectedLayer(1, 'Name', 'CriticFC')]; criticOpts = rlRepresentationOptions('LearnRate',1e-2,'GradientThreshold',1); critic = rlValueRepresentation(criticNetwork,obsInfo,'Observation',{'state'},criticOpts);
An AC agent decides which action to take, given observations, using an actor representation. To create the actor, create a deep neural network with one input (the observation) and one output (the action). The output size of the actor network is 2 since the agent can apply 2 force values to the environment, –10 and 10.
actorNetwork = [ featureInputLayer(4,'Normalization','none','Name','state') fullyConnectedLayer(32, 'Name','ActorStateFC1') reluLayer('Name','ActorRelu1') fullyConnectedLayer(2,'Name','action')]; actorOpts = rlRepresentationOptions('LearnRate',1e-2,'GradientThreshold',1); actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,... 'Observation',{'state'},actorOpts);
To create the AC agent, first specify the AC agent options using rlACAgentOptions
.
agentOpts = rlACAgentOptions(... 'NumStepsToLookAhead',32,... 'EntropyLossWeight',0.01,... 'DiscountFactor',0.99);
Then create the agent using the specified actor representation and agent options. For more information, see rlACAgent
.
agent = rlACAgent(actor,critic,agentOpts);
To train the agent, first specify the training options. For this example, use the following options.
Run each training for at most 1000
episodes, with each episode lasting at most 500
time steps.
Display the training progress in the Episode Manager dialog box (set the Plots
option) and disable the command line display (set the Verbose
option).
Stop training when the agent receives an average cumulative reward greater than 500
over 10
consecutive episodes. At this point, the agent can balance the pendulum in the upright position.
trainOpts = rlTrainingOptions(... 'MaxEpisodes',1000,... 'MaxStepsPerEpisode', 500,... 'Verbose',false,... 'Plots','training-progress',... 'StopTrainingCriteria','AverageReward',... 'StopTrainingValue',500,... 'ScoreAveragingWindowLength',10);
You can visualize the cart-pole system can during training or simulation using the plot
function.
plot(env)
To train the agent using parallel computing, specify the following training options.
Set the UseParallel
option to True
.
Train the agent in parallel asynchronously by setting the ParallelizationOptions.Mode
option to "async"
.
After every 32 steps, each worker computes gradients from experiences and send them to the host.
The AC agent requires workers to send "gradients"
to the host.
The AC agent requires 'StepsUntilDataIsSent'
to be equal to agentOptions.NumStepsToLookAhead
.
trainOpts.UseParallel = true; trainOpts.ParallelizationOptions.Mode = "async"; trainOpts.ParallelizationOptions.DataToSendFromWorkers = "gradients"; trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 32;
For more information, see rlTrainingOptions
.
Train the agent using the train
function. Training the agent is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
. Due to randomness in the asynchronous parallel training, you can expect different training results from the following training plot. The plot shows the result of training with six workers.
doTraining = false; if doTraining % Train the agent. trainingStats = train(agent,env,trainOpts); else % Load the pretrained agent for the example. load('MATLABCartpoleParAC.mat','agent'); end
You can visualize the cart-pole system with the plot function during simulation.
plot(env)
To validate the performance of the trained agent, simulate it within the cart-pole environment. For more information on agent simulation, see rlSimulationOptions
and sim
.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
totalReward = sum(experience.Reward)
totalReward = 500
[1] Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. ‘Asynchronous Methods for Deep Reinforcement Learning’. ArXiv:1602.01783 [Cs], 16 June 2016. https://arxiv.org/abs/1602.01783.