rlTD3Agent

Twin-delayed deep deterministic policy gradient reinforcement learning agent

Description

The twin-delayed deep deterministic policy gradient (DDPG) algorithm is an actor-critic, model-free, online, off-policy reinforcement learning method which computes an optimal policy that maximizes the long-term reward. The action space can only be continuous.

Use rlTD3Agent to create one of the following types of agents.

  • Twin-delayed deep deterministic policy gradient (TD3) agent with two Q-value functions. This agent prevents overestimation of the value function by learning two Q value functions and using the minimum values for policy updates.

  • Delayed deep deterministic policy gradient (delayed DDPG) agent with a single Q value function. This agent is a DDPG agent with target policy smoothing and delayed policy and target updates.

For more information, see Twin-Delayed Deep Deterministic Policy Gradient Agents. For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

Creation

Description

Create Agent from Observation and Action Specifications

example

agent = rlTD3Agent(observationInfo,actionInfo) creates a TD3 agent for an environment with the given observation and action specifications, using default initialization options. The actor and critics representations in the agent use default deep neural networks built from the observation specification observationInfo and the action specification actionInfo.

example

agent = rlTD3Agent(observationInfo,actionInfo,initOpts) creates a deep deterministic policy gradient agent for an environment with the given observation and action specifications. The agent uses default networks configured using options specified in the initOpts object. For more information on the initialization options, see rlAgentInitializationOptions.

Create Agent from Actor and Critic Representations

example

agent = rlTD3Agent(actor,critics,agentOptions) creates an agent with the specified actor and critic representations. To create a:

  • TD3 agent, specify a two-element row vector of critic representations.

  • Delayed DDPG agent, specify a single critic representation.

Specify Agent Options

agent = rlTD3Agent(___,agentOptions) creates a TD3 agent and sets the AgentOptions property to the agentOptions input argument. Use this syntax after any of the input arguments in the previous syntaxes.

Input Arguments

expand all

Observation specifications, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data type, and names of the observation signals.

You can extract observationInfo from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

Action specifications, specified as a reinforcement learning specification object defining properties such as dimensions, data type, and names of the action signals.

Since a DDPG agent operates in a continuous action space, you must specify actionInfo as an rlNumericSpec object.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlNumericSpec.

Agent initialization options, specified as an rlAgentInitializationOptions object.

Actor network representation, specified as an rlDeterministicActorRepresentation object. For more information on creating actor representations, see Create Policy and Value Function Representations.

Critic network representations, specified as one of the following:

  • rlQValueRepresentation object — Create a delayed DDPG agent with a single Q value function. This agent is a DDPG agent with target policy smoothing and delayed policy and target updates.

  • Two-element row vector of rlQValueRepresentation objects — Create a TD3 agent with two critic value functions. The two critic networks must be unique rlQValueRepresentation objects with the same observation and action specifications. The representations can either have different structures or the same structure but with different initial parameters.

For more information on creating critic representations, see Create Policy and Value Function Representations.

Properties

expand all

Agent options, specified as an rlTD3AgentOptions object.

Experience buffer, specified as an ExperienceBuffer object. During training the agent stores each of its experiences (S,A,R,S') in a buffer. Here:

  • S is the current observation of the environment.

  • A is the action taken by the agent.

  • R is the reward for taking action A.

  • S' is the next observation after taking action A.

For more information on how the agent samples experience from the buffer during training, see Twin-Delayed Deep Deterministic Policy Gradient Agents.

Object Functions

trainTrain reinforcement learning agents within a specified environment
simSimulate trained reinforcement learning agents within specified environment
getActionObtain action from agent or actor representation given environment observations
getActorGet actor representation from reinforcement learning agent
setActorSet actor representation of reinforcement learning agent
getCriticGet critic representation from reinforcement learning agent
setCriticSet critic representation of reinforcement learning agent
generatePolicyFunctionCreate function that evaluates trained policy of reinforcement learning agent

Examples

collapse all

Create an environment with a continuous action space, and obtain its observation and action specifications. For this example, load the environment used in the example Train DDPG Agent to Control Double Integrator System. The observation from the environment is a vector containing the position and velocity of a mass. The action is a scalar representing a force, applied to the mass, ranging continuously from -2 to 2 Newton.

% load predefined environment
env = rlPredefinedEnv("DoubleIntegrator-Continuous");

% obtain observation and action specifications
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. You can ensure reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.

% rng(0)

Create a TD3 agent from the environment observation and action specifications.

agent = rlTD3Agent(obsInfo,actInfo);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension)})
ans = 1x1 cell array
    {[0.0087]}

You can now test and train the agent within the environment.

Create an environment with a continuous action space and obtain its observation and action specifications. For this example, load the environment used in the example Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation. This environment has two observations: a 50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar representing a torque ranging continuously from -2 to 2 Nm.

% load predefined environment
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");

% obtain observation and action specifications
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create an agent initialization option object, specifying that each hidden fully connected layer in the network must have 128 neurons (instead of the default number, 256).

initOpts = rlAgentInitializationOptions('NumHiddenUnit',128);

The agent creation function initializes the actor and critic networks randomly. You can ensure reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.

% rng(0)

Create a DDPG agent from the environment observation and action specifications.

agent = rlTD3Agent(obsInfo,actInfo,initOpts);

Reduce the learning rate of the critics to 1e-3 and 2e-3.

critic = getCritic(agent);
critic(1).Options.LearnRate = 1e-3;
critic(2).Options.LearnRate = 2e-3;
agent  = setCritic(agent,critic);

Extract the deep neural networks from the actor.

actorNet = getModel(getActor(agent));

Extract the deep neural networks from the two critics. Note that getModel(critics) only returns the first critic network.

critics = getCritic(agent);
criticNet1 = getModel(critics(1));
criticNet2 = getModel(critics(2));

Display the layers of the first critic network, and verify that each hidden fully connected layer has 128 neurons.

criticNet1.Layers
ans = 
  14x1 Layer array with layers:

     1   'concat'               Concatenation       Concatenation of 3 inputs along dimension 3
     2   'relu_body'            ReLU                ReLU
     3   'fc_body'              Fully Connected     128 fully connected layer
     4   'body_output'          ReLU                ReLU
     5   'input_1'              Image Input         50x50x1 images
     6   'conv_1'               Convolution         64 3x3x1 convolutions with stride [1  1] and padding [0  0  0  0]
     7   'relu_input_1'         ReLU                ReLU
     8   'fc_1'                 Fully Connected     128 fully connected layer
     9   'input_2'              Image Input         1x1x1 images
    10   'fc_2'                 Fully Connected     128 fully connected layer
    11   'input_3'              Image Input         1x1x1 images
    12   'fc_3'                 Fully Connected     128 fully connected layer
    13   'output'               Fully Connected     1 fully connected layer
    14   'RepresentationLoss'   Regression Output   mean-squared-error

Plot the networks of the actor and of the second critic.

plot(actorNet)

plot(criticNet2)

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
ans = 1x1 cell array
    {[0.0675]}

You can now test and train the agent within the environment.

Create an environment with a continuous action space and obtain its observation and action specifications. For this example, load the environment used in the example Train DDPG Agent to Control Double Integrator System. The observation from the environment is a vector containing the position and velocity of a mass. The action is a scalar representing a force ranging continuously from -2 to 2 Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
numObs = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
numAct = numel(actInfo);

Create two Q-value critic representations. First, create a critic deep neural network structure.

statePath1 = [
    featureInputLayer(numObs,'Normalization','none','Name','observation')
    fullyConnectedLayer(400,'Name','CriticStateFC1')
    reluLayer('Name','CriticStateRelu1')
    fullyConnectedLayer(300,'Name','CriticStateFC2')
    ];
actionPath1 = [
    featureInputLayer(numAct,'Normalization','none','Name','action')
    fullyConnectedLayer(300,'Name','CriticActionFC1')
    ];
commonPath1 = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu1')
    fullyConnectedLayer(1,'Name','CriticOutput')
    ];

criticNet = layerGraph(statePath1);
criticNet = addLayers(criticNet,actionPath1);
criticNet = addLayers(criticNet,commonPath1);
criticNet = connectLayers(criticNet,'CriticStateFC2','add/in1');
criticNet = connectLayers(criticNet,'CriticActionFC1','add/in2');

Create the critic representations. Use the same network structure for both critics. The TD3 agent initializes the two networks using different default parameters.

criticOptions = rlRepresentationOptions('Optimizer','adam','LearnRate',1e-3,... 
                                        'GradientThreshold',1,'L2RegularizationFactor',2e-4);
critic1 = rlQValueRepresentation(criticNet,obsInfo,actInfo,...
    'Observation',{'observation'},'Action',{'action'},criticOptions);
critic2 = rlQValueRepresentation(criticNet,obsInfo,actInfo,...
    'Observation',{'observation'},'Action',{'action'},criticOptions);

Create a deep neural network for the actor.

actorNet = [
    featureInputLayer(numObs,'Normalization','none','Name','observation')
    fullyConnectedLayer(400,'Name','ActorFC1')
    reluLayer('Name','ActorRelu1')
    fullyConnectedLayer(300,'Name','ActorFC2')
    reluLayer('Name','ActorRelu2')
    fullyConnectedLayer(numAct,'Name','ActorFC3')                       
    tanhLayer('Name','ActorTanh1')
    ];

Create a deterministic actor representation.

actorOptions = rlRepresentationOptions('Optimizer','adam','LearnRate',1e-3,...
                                       'GradientThreshold',1,'L2RegularizationFactor',1e-5);
actor  = rlDeterministicActorRepresentation(actorNet,obsInfo,actInfo,...
    'Observation',{'observation'},'Action',{'ActorTanh1'},actorOptions);

Specify agent options.

agentOptions = rlTD3AgentOptions;
agentOptions.DiscountFactor = 0.99;
agentOptions.TargetSmoothFactor = 5e-3;
agentOptions.TargetPolicySmoothModel.Variance = 0.2;
agentOptions.TargetPolicySmoothModel.LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel.UpperLimit = 0.5;

Create TD3 agent using actor, critics, and options.

agent = rlTD3Agent(actor,[critic1 critic2],agentOptions);

You can also create an rlTD3Agent object with a single critic. In this case, the object represents a DDPG agent with target policy smoothing and delayed policy and target updates.

delayedDDPGAgent = rlTD3Agent(actor,critic1,agentOptions);

To check your agents, use getAction to return the action from a random observation.

getAction(agent,{rand(2,1)})
ans = 1x1 cell array
    {[0.0304]}

getAction(delayedDDPGAgent,{rand(2,1)})
ans = 1x1 cell array
    {[-0.0142]}

You can now test and train either agents within the environment.

Introduced in R2020a