Deep Deterministic Policy Gradient Agents

The deep deterministic policy gradient (DDPG) algorithm is a model-free, online, off-policy reinforcement learning method. A DDPG agent is an actor-critic reinforcement learning agent that computes an optimal policy that maximizes the long-term reward.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

DDPG agents can be trained in environments with the following observation and action spaces.

Observation Space	Action Space
Continuous or discrete	Continuous

During training, a DDPG agent:

Updates the actor and critic properties at each time step during learning.
Stores past experiences using a circular experience buffer. The agent updates the actor and critic using a mini-batch of experiences randomly sampled from the buffer.
Perturbs the action chosen by the policy using a stochastic noise model at each training step.

Actor and Critic Functions

To estimate the policy and value function, a DDPG agent maintains four function approximators:

Actor μ(S) — The actor takes observation S and returns the corresponding action that maximizes the long-term reward.
Target actor μ'(S) — To improve the stability of the optimization, the agent periodically updates the target actor based on the latest actor parameter values.
Critic Q(S,A) — The critic takes observation S and action A as inputs and returns the corresponding expectation of the long-term reward.
Target critic Q'(S,A) — To improve the stability of the optimization, the agent periodically updates the target critic based on the latest critic parameter values.

Both Q(S,A) and Q'(S,A) have the same structure and parameterization, and both μ(S) and μ'(S) have the same structure and parameterization.

When training is complete, the trained optimal policy is stored in actor μ(S).

For more information on creating actors and critics for function approximation, see Create Policy and Value Function Representations.

Agent Creation

You can create a DDPG agent with default actor and critic representations based on the observation and action specifications from the environment. To do so, perform the following steps.

Create observation specifications for your environment. If you already have an environment interface object, you can obtain these specifications using getObservationInfo.
Create action specifications for your environment. If you already have an environment interface object, you can obtain these specifications using getActionInfo.
If needed, specify the number of neurons in each learnable layer or whether to use an LSTM layer. To do so, create an agent initialization option object using rlAgentInitializationOptions.
If needed, specify agent options using an rlDDPGAgentOptions object.
Create the agent using an rlDDPGAgent object.

Alternatively, you can create actor and critic representations and use these representations to create your agent. In this case, ensure that the input and output dimensions of the actor and critic representations match the corresponding action and observation specifications of the environment.

Create an actor using an rlDeterministicActorRepresentation object.
Create a critic using an rlQValueRepresentation object.
Specify agent options using an rlDDPGAgentOptions object.
Create the agent using an rlDDPGAgent object.

For more information on creating actors and critics for function approximation, see Create Policy and Value Function Representations.

Training Algorithm

DDPG agents use the following training algorithm, in which they update their actor and critic models at each time step. To configure the training algorithm, specify options using an rlDDPGAgentOptions object.

Initialize the critic Q(S,A) with random parameter values θ_Q, and initialize the target critic with the same random parameter values: $θ_{Q'} = θ_{Q}$ .
Initialize the actor μ(S) with random parameter values θ_μ, and initialize the target actor with the same parameter values: $θ_{μ'} = θ_{μ}$ .
For each training time step:
1. For the current observation S, select action A = μ(S) + N, where N is stochastic noise from the noise model. To configure the noise model, use the NoiseOptions option.
2. Execute action A. Observe the reward R and next observation S'.
3. Store the experience (S,A,R,S') in the experience buffer.
4. Sample a random mini-batch of M experiences (S_i,A_i,R_i,S'_i) from the experience buffer. To specify M, use the MiniBatchSize option.
5. If S'_i is a terminal state, set the value function target y_i to R_i. Otherwise, set it to
  $y_{i} = R_{i} + γ Q' (S_{i}', μ' (S_{i}' | θ_{μ}) | θ_{Q'})$
  The value function target is the sum of the experience reward R_i and the discounted future reward. To specify the discount factor γ, use the DiscountFactor option.
  To compute the cumulative reward, the agent first computes a next action by passing the next observation S_i' from the sampled experience to the target actor. The agent finds the cumulative reward by passing the next action to the target critic.
6. Update the critic parameters by minimizing the loss L across all sampled experiences.
  $L = \frac{1}{M} \sum_{i = 1}^{M} {(y_{i} - Q (S_{i}, A_{i} | θ_{Q}))}^{2}$
7. Update the actor parameters using the following sampled policy gradient to maximize the expected discounted reward.
  $\begin{array}{l} \nabla_{θ_{μ}} J \approx \frac{1}{M} \sum_{i = 1}^{M} G_{a i} G_{μ i} \\ G_{a i} = \nabla_{A} Q (S_{i}, A | θ_{Q}) where A = μ (S_{i} | θ_{μ}) \\ G_{μ i} = \nabla_{θ_{μ}} μ (S_{i} | θ_{μ}) \end{array}$
  Here, G_ai is the gradient of the critic output with respect to the action computed by the actor network, and G_μi is the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observation S_i.
8. Update the target actor and critic parameters depending on the target update method. For more information see Target Update Methods.

For simplicity, the actor and critic updates in this algorithm show a gradient update using basic stochastic gradient descent. The actual gradient update method depends on the optimizer you specify using rlRepresentationOptions.

Target Update Methods

DDPG agents update their target actor and critic parameters using one of the following target update methods.

Smoothing — Update the target parameters at every time step using smoothing factor τ. To specify the smoothing factor, use the TargetSmoothFactor option.
$\begin{array}{l} θ_{Q'} = τ θ_{Q} + (1 - τ) θ_{Q'} (critic parameters) \\ θ_{μ'} = τ θ_{μ} + (1 - τ) θ_{μ'} (actor parameters) \end{array}$
Periodic — Update the target parameters periodically without smoothing (TargetSmoothFactor = 1). To specify the update period, use the TargetUpdateFrequency parameter.
Periodic Smoothing — Update the target parameters periodically with smoothing.

To configure the target update method, create a rlDDPGAgentOptions object, and set the TargetUpdateFrequency and TargetSmoothFactor parameters as shown in the following table.

Update Method	`TargetUpdateFrequency`	`TargetSmoothFactor`
Smoothing (default)	`1`	Less than `1`
Periodic	Greater than `1`	`1`
Periodic smoothing	Greater than `1`	Less than `1`

References

[1] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous Control with Deep Reinforcement Learning.” ArXiv:1509.02971 [Cs, Stat], September 9, 2015. https://arxiv.org/abs/1509.02971.

Documentation