train

Train a reinforcement learning agent within a specified environment

Syntax

trainStats = train(agent,env,trainOpts)

Description

trainStats = train(agent,env,trainOpts) trains a reinforcement learning agent with a specified environment. After each training episode, train updates the parameters of agent to maximize the expected long-term reward of the environment. When training terminates, the agent reflects the state of training at termination.

Use the training options trainOpts to specify training parameters such as the criteria for termination of training, when to save agents, the maximum number of episodes to train, and the maximum number of steps per episode.

Examples

collapse all

Train a Reinforcement Learning Agent

Open Live Script

Configure the training parameters and train a reinforcement learning agent. Typically, before training, you must configure your environment and agent. For this example, load an environment and agent that are already configured. The environment is a discrete cart-pole environment created with rlPredefinedEnv. The agent is a Policy Gradient (rlPGAgent) agent. For more information about the environment and agent used in this example, see Train PG Agent to Balance Cart-Pole System.

rng(0) % for reproducibility
load RLTrainExample.mat
env

env = 
  CartPoleDiscreteAction with properties:

                  Gravity: 9.8000
                 MassCart: 1
                 MassPole: 0.1000
                   Length: 0.5000
                 MaxForce: 10
                       Ts: 0.0200
    ThetaThresholdRadians: 0.2094
               XThreshold: 2.4000
      RewardForNotFalling: 1
        PenaltyForFalling: -5
                    State: [4×1 double]

agent

agent = 
  rlPGAgent with properties:

    AgentOptions: [1×1 rl.option.rlPGAgentOptions]

To train this agent, you must first specify training parameters using rlTrainingOptions. These parameters include the maximum number of episodes to train, the maximum steps per episode, and the conditions for terminating training. For this example, use a maximum of 1000 episodes and 500 steps per episode. Instruct the training to stop when the average reward over the previous five episodes reaches 500. Create a default options set and use dot notation to change some of the parameter values.

trainOpts = rlTrainingOptions;

trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 500;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 500;
trainOpts.ScoreAveragingWindowLength = 5;

During training, the train command can save candidate agents that give good results. Further configure the training options to save an agent when the episode reward exceeds 500. Save the agent to a folder called savedAgents.

trainOpts.SaveAgentCriteria = "EpisodeReward";
trainOpts.SaveAgentValue = 500;
trainOpts.SaveAgentDirectory = "savedAgents";

Finally, turn off the command-line display. Turn on the Reinforcement Learning Episode Manager so you can observe the training progress visually.

trainOpts.Verbose = false;
trainOpts.Plots = "training-progress";

You are now ready to train the PG agent. For the predefined cart-pole environment used in this example. you can use plot to generate a visualization of the cart-pole system.

plot(env)

When you run this example, both this visualization and the Reinforcement Learning Episode Manager update with each training episode. Place them side by side on your screen to observe the progress, and train the agent. (This computation can take 20 minutes or more.)

trainingInfo = train(agent,env,trainOpts);

The Episode Manager shows that the training successfully reaches the termination condition of a reward of 500 averaged over the previous five episodes. At each training episode, train updates agent with the parameters learned in the previous episode. When training terminates, you can simulate the environment with the trained agent to evaluate its performance. The environment plot updates during simulation as it did during training.

simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);

During training, train saves to disk any agents that meet the condition specified with trainOps.SaveAgentCritera and trainOpts.SaveAgentValue. To test the performance of any of those agents, you can load the data from the data files in the folder you specified using trainOpts.SaveAgentDirectory, and simulate the environment with that agent.

Input Arguments

collapse all

`agent` — Agent
reinforcement learning agent object

Agent to train, specified as a reinforcement learning agent object, such as an rlACAgent or rlDDPGAgent object, or a custom agent. Before training, you must configure the actor and critic representations of the agent. For more information about how to create and configure agents for reinforcement learning, see Reinforcement Learning Agents.

`env` — Environment
reinforcement learning environment object

Environment in which the agent acts, specified as a reinforcement learning environment object, such as:

A predefined MATLAB^® or Simulink^® environment created using rlPredefinedEnv
A custom MATLAB environment you create with functions such as rlFunctionEnv or rlCreateEnvTemplate
A custom Simulink environment you create using rlSimulinkEnv

For more information about creating and configuring environments, see:

When env is a Simulink environment, calling train compiles and simulates the model associated with the environment.

`trainOpts` — Training parameters and options
`rlTrainingOptions` object

Training parameters and options, specified as an rlTrainingOptions object. Use this argument to specify such parameters and options as:

Criteria for ending training
Criteria for saving candidate agents
How to display training progress
Options for parallel computing

For details, see rlTrainingOptions.

Output Arguments

collapse all

`trainStats` — Training episode data
structure

Training episode data, returned as a structure containing the following fields.

`EpisodeIndex` — Episode numbers
`[1;2;…;N]`

Episode numbers, returned as the column vector [1;2;…;N], where N is the number of episodes in the training run. This vector is useful if you want to plot the evolution of other quantities from episode to episode.

`EpisodeReward` — Reward for each episode
column vector

Reward for each episode, returned in a column vector of length N. Each entry contains the reward for the corresponding episode.

`EpisodeSteps` — Number of steps in each episode
column vector

Number of steps in each episode, returned in a column vector of length N. Each entry contains the number of steps in the corresponding episode.

`AverageReward` — Average reward over the averaging window
column vector

Average reward over the averaging window specified in trainOpts, returned as a column vector of length N. Each entry contains the average award computed at the end of the corresponding episode.

`TotalAgentSteps` — Total number of steps
column vector

Total number of agent steps in training, returned as a column vector of length N. Each entry contains the cumulative sum of the entries in EpisodeSteps up to that point.

`EpisodeQ0` — Critic estimate of long-term reward for each episode
column vector

Critic estimate of long-term reward using the current agent and the environment initial conditions, returned as a column vector of length N. Each entry is the critic estimate (Q₀) for the agent of the corresponding episode. This field is present only for agents that have critics, such as rlDDPGAgent and rlDQNAgent.

`SimulationInfo` — Information collected during simulation
structure | vector of `Simulink.SimulationOutput` objects

Information collected during the simulations performed for training, returned as:

For training in MATLAB environments, a structure containing the field SimulationError. This field is a column vector with one entry per episode. When the StopOnError option of rlTrainingOptions is "off", each entry contains any errors that occurred during the corresponding episode.
For training in Simulink environments, a vector of Simulink.SimulationOutput objects containing simulation data recorded during the corresponding episode. Recorded data for an episode includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred during the corresponding episode.

Tips

train updates the agent as training progresses. To preserve the original agent parameters for later use, save the agent to a MAT-file.
By default, calling train opens the Reinforcement Learning Episode Manager, which lets you visualize the progress of the training. The Episode Manager plot shows the reward for each episode, a running average reward value, and the critic estimate Q₀ (for agents that have critics). The Episode Manager also displays various episode and training statistics. To turn off the Reinforcement Learning Episode Manager, set the Plots option of trainOpts to "none".
If you use a predefined environment for which there is a visualization, you can use plot(env) to visualize the environment. If you call plot(env) before training, then the visualization updates during training to allow you to visualize the progress of each episode. (For custom environments, you must implement your own plot method.)
Training terminates when the conditions specified in trainOpts are satisfied. To terminate training in progress, in the Reinforcement Learning Episode Manager, click Stop Training. Because train updates the agent at each episode, you can resume training by calling train(agent,env,trainOpts) again, without losing the trained parameters learned during the first call to train.
During training, you can save candidate agents that meet conditions you specify with trainOpts. For instance, you can save any agent whose episode reward exceeds a certain value, even if the overall condition for terminating training is not yet satisfied. train stores saved agents in a MAT-file in the folder you specify with trainOpts. Saved agents can be useful, for instance, to allow you to test candidate agents generated during a long-running training process. For details about saving criteria and saving location, see rlTrainingOptions.

Algorithms

In general, train performs the following iterative steps:

Initialize agent.
For each episode:
1. Reset the environment.
2. Get the initial observation s₀ from the environment.
3. Compute the initial action a₀ = μ(s₀).
4. Set the current action to the initial action (a←a₀) and set the current observation to the initial observation (s←s₀).
5. While the episode is not finished or terminated:
  1. Step the environment with action a to obtain the next observation s' and the reward r.
  2. Learn from the experience set (s,a,r,s').
  3. Compute the next action a' = μ(s').
  4. Update the current action with the next action (a←a') and update the current observation with the next observation (s←s').
  5. Break if the episode termination conditions defined in the environment are met.
If the training termination condition defined by trainOpts is met, terminate training. Otherwise, begin the next episode.

The specifics of how train performs these computations depends on your configuration of the agent and environment. For instance, resetting the environment at the start of each episode can include randomizing initial state values, if you configure your environment to do so.

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

To train in parallel, set the UseParallel and ParallelizationOptions options in the option set trainOpts. For more information, see rlTrainingOptions.

Documentation

train

Syntax

Description

Examples

Train a Reinforcement Learning Agent

Input Arguments

`agent` — Agent
reinforcement learning agent object

`env` — Environment
reinforcement learning environment object

`trainOpts` — Training parameters and options
`rlTrainingOptions` object

Output Arguments

`trainStats` — Training episode data
structure

`EpisodeIndex` — Episode numbers
`[1;2;…;N]`

`EpisodeReward` — Reward for each episode
column vector

`EpisodeSteps` — Number of steps in each episode
column vector

`AverageReward` — Average reward over the averaging window
column vector

`TotalAgentSteps` — Total number of steps
column vector

`EpisodeQ0` — Critic estimate of long-term reward for each episode
column vector

`SimulationInfo` — Information collected during simulation
structure | vector of `Simulink.SimulationOutput` objects

Tips

Algorithms

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

See Also

Topics

Introduced in R2019a

Reinforcement Learning Toolbox Documentation

Support

Documentation

train

Syntax

Description

Examples

Train a Reinforcement Learning Agent

Input Arguments

agent — Agent reinforcement learning agent object

env — Environment reinforcement learning environment object

trainOpts — Training parameters and options rlTrainingOptions object

Output Arguments

trainStats — Training episode data structure

EpisodeIndex — Episode numbers [1;2;…;N]

EpisodeReward — Reward for each episode column vector

EpisodeSteps — Number of steps in each episode column vector

AverageReward — Average reward over the averaging window column vector

TotalAgentSteps — Total number of steps column vector

EpisodeQ0 — Critic estimate of long-term reward for each episode column vector

SimulationInfo — Information collected during simulation structure | vector of Simulink.SimulationOutput objects

Tips

Algorithms

Extended Capabilities

Automatic Parallel Support Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

See Also

Topics

Introduced in R2019a

Reinforcement Learning Toolbox Documentation

Support

`agent` — Agent
reinforcement learning agent object

`env` — Environment
reinforcement learning environment object

`trainOpts` — Training parameters and options
`rlTrainingOptions` object

`trainStats` — Training episode data
structure

`EpisodeIndex` — Episode numbers
`[1;2;…;N]`

`EpisodeReward` — Reward for each episode
column vector

`EpisodeSteps` — Number of steps in each episode
column vector

`AverageReward` — Average reward over the averaging window
column vector

`TotalAgentSteps` — Total number of steps
column vector

`EpisodeQ0` — Critic estimate of long-term reward for each episode
column vector

`SimulationInfo` — Information collected during simulation
structure | vector of `Simulink.SimulationOutput` objects

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.