A reinforcement learning policy is a mapping that selects the action that the agent takes based on observations from the environment. During training, the agent tunes the parameters of its policy representation to maximize the expected cumulative long-term reward.
Reinforcement learning agents estimate policies and value functions using function approximators called actor and critic representations respectively. The actor represents the policy that selects the best action to take, based on the current observation. The critic represents the value function that estimates the expected cumulative long-term reward for the current policy.
Before creating an agent, you must create the required actor and critic representations using deep neural networks, linear basis functions, or lookup tables. The type of function approximators you use depends on your application.
For more information on agents, see Reinforcement Learning Agents.
The Reinforcement Learning Toolbox™ software supports the following types of representations:
V(S|θV)
— Critics that estimate the expected cumulative long-term reward based on a given
observation S. You can create these critics using rlValueRepresentation
.
Q(S,A|θQ)
— Critics that estimate the expected cumulative long-term reward for all possible
discrete action based on a given observation S. You can create these
critics using rlQValueRepresentation
.
Q(S|θQ)
— Multi-output critics that estimate the expected cumulative long-term reward for all
possible discrete actions Ai given observation
S. You can create these critics using rlQValueRepresentation
.
μ(S|θμ)
— Actors that select an action based on a given observation S. You
can create these actors using either rlDeterministicActorRepresentation
or rlStochasticActorRepresentation
.
Each representation uses a function approximator with a corresponding set of parameters (θV, θQ, θμ), which are computed during the learning process.
For systems with a limited number of discrete observations and discrete actions, you can store value functions in a lookup table. For systems that have many discrete observations and actions and for observation and action spaces that are continuous, storing the observations and actions is impractical. For such systems, you can represent your actors and critics using deep neural networks or custom (linear in the parameters) basis functions.
The following table summarizes the way in which you can use the four representation objects available with the Reinforcement Learning Toolbox software, depending on the action and observation spaces of your environment, and on the approximator and agent that you want to use.
Representations vs. Approximators and Agents
Representation | Supported Approximators | Observation Space | Action Space | Supported Agents |
---|---|---|---|---|
Value function critic, V(S) which you create using | Table | Discrete | Not applicable | PG, AC, PPO |
Deep neural network or custom basis function | Discrete or continuous | Not applicable | PG, AC, PPO | |
Q-value function critic, Q(S,A) which you create using | Table | Discrete | Discrete | Q, DQN, SARSA |
Deep neural network or custom basis function | Discrete or continuous | Discrete | Q, DQN, SARSA | |
Deep neural network or custom basis function | Discrete or continuous | Continuous | DDPG, TD3 | |
Multi-output Q-value function critic, Q(S,A) which you create using | Deep neural network or custom basis function | Discrete or continuous | Discrete | Q, DQN, SARSA |
Deterministic policy actor, π(S) which you create using | Deep neural network or custom basis function | Discrete or continuous | Continuous | DDPG, TD3 |
Stochastic policy actor, π(S) which you create using | Deep neural network or custom basis function | Discrete or continuous | Discrete | PG, AC, PPO |
Deep neural network | Discrete or continuous | Continuous | PG, AC, PPO, SAC |
For more information on agents, see Reinforcement Learning Agents.
Representations based on lookup tables are appropriate for environments with a limited number of discrete observations and actions. You can create two types of lookup table representations:
Value tables, which store rewards for corresponding observations
Q-tables, which store rewards for corresponding observation-action pairs
To create a table representation, first create a value table or Q-table using the
rlTable
function.
Then, create a representation for the table using either an rlValueRepresentation
or rlQValueRepresentation
object. To configure the learning rate and optimization
used by the representation, use an rlRepresentationOptions
object.
You can create actor and critic function approximators using deep neural networks. Doing so uses the Deep Learning Toolbox™ software features.
The dimensions of your actor and critic networks must match the corresponding action
and observation specifications from the training environment object. To obtain the action
and observation dimensions for environment env
, use the
getActionInfo
and getObservationInfo
functions, respectively. Then access the Dimensions
property of the
specification objects.
actInfo = getActionInfo(env); actDimensions = actInfo.Dimensions; obsInfo = getObservationInfo(env); obsDimensions = obsInfo.Dimensions;
Networks for value function critics (such as the ones used in AC, PG, or PPO agents)
must take only observations as inputs and must have a single scalar output. For these
networks, the dimensions of the input layers must match the dimensions of the environment
observation specifications. For more information, see rlValueRepresentation
.
Networks for single-output Q-value function critics (such as the ones used in Q, DQN,
SARSA, DDPG, TD3, and SAC agents) must take both observations and actions as inputs, and
must have a single scalar output. For these networks, the dimensions of the input layers
must match the dimensions of the environment specifications for both observations and
actions. For more information, see rlQValueRepresentation
.
Networks for multi-output Q-value function critics (such as those used in Q, DQN, and
SARSA agents) take only observations as inputs and must have a single output layer with
output size equal to the number of discrete actions. For these networks the dimensions of
the input layers must match the dimensions of the environment observations.
specifications. For more information, see rlQValueRepresentation
.
For actor networks, the dimensions of the input layers must match the dimensions of the environment observation specifications.
Networks used in actors with a discrete action space (such as the ones in PG, AC, and PPO agents) must have a single output layer with an output size equal to the number of possible discrete actions.
Networks used in deterministic actors with a continuous action space (such as the ones in DDPG and TD3 agents) must have a single output layer with an output size matching the dimension of the action space defined in the environment action specification.
Networks used in stochastic actors with a continuous action space (such as the ones in PG, AC, PPO, and SAC agents) must have a single output layer with output size having twice the dimension of the action space defined in the environment action specification. These networks must have two separate paths, the first producing the mean values (which must be scaled to the output range) and the second producing the standard deviations (which must be non-negative).
For more information, see rlDeterministicActorRepresentation
and rlStochasticActorRepresentation
.
Deep neural networks consist of a series of interconnected layers. The following table lists some common deep learning layers used in reinforcement learning applications. For a full list of available layers, see List of Deep Learning Layers.
Layer | Description |
---|---|
featureInputLayer | Inputs feature data and applies normalization |
imageInputLayer | Inputs vectors and 2-D images and applies normalization. |
sigmoidLayer | Applies a sigmoid function to the input such that the output is bounded in the interval (0,1). |
tanhLayer | Applies a hyperbolic tangent activation layer to the input. |
reluLayer | Sets any input values that are less than zero to zero. |
fullyConnectedLayer | Multiplies the input vector by a weight matrix, and add a bias vector. |
convolution2dLayer | Applies sliding convolutional filters to the input. |
additionLayer | Adds the outputs of multiple layers together. |
concatenationLayer | Concatenates inputs along a specified dimension. |
sequenceInputLayer | Provides inputs sequence data to a network. |
lstmLayer | Applies a Long Short-Term Memory layer to the input. Supported for DQN and PPO agents. |
The bilstmLayer
and
batchNormalizationLayer
layers are not supported for reinforcement
learning.
The Reinforcement Learning Toolbox software provides the following layers, which contain no tunable parameters (that is, parameters that change during training).
Layer | Description |
---|---|
scalingLayer | Applies a linearly scale and bias to an input array. This layer is useful for
scaling and shifting the outputs of nonlinear layers, such as tanhLayer and sigmoid layer. |
quadraticLayer | Creates a vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller. |
softplusLayer | Implements the softplus activation Y = log(1 + eX), which ensures that the output is always positive. This is a smoothed version of the rectified linear unit (ReLU). |
You can also create your own custom layers. For more information, see Define Custom Deep Learning Layers.
For reinforcement learning applications, you construct your deep neural network by
connecting a series of layers for each input path (observations or actions) and for each
output path (estimated rewards or actions). You then connect these paths together using
the connectLayers
function.
You can also create your deep neural network using the Deep Network Designer app. For an example, see Create Agent Using Deep Network Designer and Train Using Image Observations.
When you create a deep neural network, you must specify names for the first layer of each input path and the final layer of the output path.
The following code creates and connects the following input and output paths:
An observation input path, observationPath
, with the first
layer named 'observation'
.
An action input path, actionPath
, with the first layer named
'action'
.
An estimated value function output path, commonPath
, which
takes the outputs of observationPath
and
actionPath
as inputs. The final layer of this path is named
'output'
.
observationPath = [ imageInputLayer([4 1 1],'Normalization','none','Name','observation') fullyConnectedLayer(24,'Name','CriticObsFC1') reluLayer('Name','CriticRelu1') fullyConnectedLayer(24,'Name','CriticObsFC2')]; actionPath = [ imageInputLayer([1 1 1],'Normalization','none','Name','action') fullyConnectedLayer(24,'Name','CriticActFC1')]; commonPath = [ additionLayer(2,'Name','add') reluLayer('Name','CriticCommonRelu') fullyConnectedLayer(1,'Name','output')]; criticNetwork = layerGraph(observationPath); criticNetwork = addLayers(criticNetwork,actionPath); criticNetwork = addLayers(criticNetwork,commonPath); criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1'); criticNetwork = connectLayers(criticNetwork,'CriticActFC1','add/in2');
For all observation and action input paths, you must specify an
imageInputLayer
as the first layer in the path.
You can view the structure of your deep neural network using the
plot
function.
plot(criticNetwork)
For PG and AC agents, the final output layers of your deep neural network actor
representation are a fullyConnectedLayer
and a
softmaxLayer
. When you specify the layers for your network, you
must specify the fullyConnectedLayer
and you can optionally specify
the softmaxLayer
. If you omit the softmaxLayer
,
the software automatically adds one for you.
Determining the number, type, and size of layers for your deep neural network representation can be difficult and is application dependent. However, the most critical component in deciding the characteristics of the function approximator is whether it is able to approximate the optimal policy or discounted value function for your application, that is, whether it has layers that can correctly learn the features of your observation, action, and reward signals.
Consider the following tips when constructing your network.
For continuous action spaces, bound actions with a tanhLayer
followed by a ScalingLayer
, if necessary.
Deep dense networks with reluLayer
layers can be fairly good
at approximating many different functions. Therefore, they are often a good first
choice.
Start with the smallest possible network that you think can approximate the optimal policy or value function.
When you approximate strong nonlinearities or systems with algebraic constraints, adding more layers is often better than increasing the number of outputs per layer. In general, the ability of the approximator to represent more complex functions grows only polynomially in the size of the layers, but grows exponentially with the number of layers. In other words, more layers allow approximating more complex and nonlinear compositional functions, although this generally requires more data and longer training times. Networks with fewer layers can require exponentially more units to successfully approximate the same class of functions, and might fail to learn and generalize correctly.
For on-policy agents (the ones that learn only from experience collected while following the current policy), such as AC and PG agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). On-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. If the network is small, the worker updates can correlate with each other and make training unstable.
To create a critic representation for your deep neural network, use an rlValueRepresentation
or rlQValueRepresentation
object. To create an actor representation for your
deep neural network, use an rlDeterministicActorRepresentation
or rlStochasticActorRepresentation
object. To configure the learning rate and
optimization used by the representation, use an rlRepresentationOptions
object.
For example, create a Q-value representation object for the critic network
criticNetwork
, specifying a learning rate of
0.0001
. When you create the representation, pass the environment
action and observation specifications to the rlQValueRepresentation
object, and specify the names of the network layers to which the observations and actions
are connected (in this case 'observation'
and
'action'
).
opt = rlRepresentationOptions('LearnRate',0.0001); critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,... 'Observation',{'observation'},'Action',{'action'},opt);
When you create your deep neural network and configure your representation object, consider using the following approach as a starting point.
Start with the smallest possible network and a high learning rate
(0.01
). Train this initial network to see if the agent converges
quickly to a poor policy or acts in a random manner. If either of these issues occur,
rescale the network by adding more layers or more outputs on each layer. Your goal is
to find a network structure that is just big enough, does not learn too fast, and
shows signs of learning (an improving trajectory of the reward graph) after an initial
training period.
Once you settle on a good network architecture, a low initial learning rate can allow you to see if the agent is on the right track, and help you check that your network architecture is satisfactory for the problem. A low learning rate makes tuning parameters is much easier, especially for difficult problems.
Also, consider the following tips when configuring your deep neural network representation.
Be patient with DDPG and DQN agents, since they might not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes.
For DDPG and DQN agents, promoting exploration of the agent is critical.
For agents with both actor and critic networks, set the initial learning rates of both representations to the same value. For some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.
When creating representations for use with a PPO or DQN agent, you can use recurrent
neural networks. These networks are deep neural networks with a sequenceInputLayer
input layer and at least one layer that has hidden state
information, such as an lstmLayer
. They
can be especially useful when the environment has states that cannot be included in the
observation vector. For more information and examples, see rlValueRepresentation
, rlQValueRepresentation
, rlDeterministicActorRepresentation
, and rlStochasticActorRepresentation
.
Custom (linear in the parameters) basis function approximators have the form f
= W'B
, where W
is a weight array and B
is
the column vector output of a custom basis function that you must create. The learnable
parameters of a linear basis function representation are the elements of
W
.
For value function critic representations, (such as the ones used in AC, PG or PPO
agents), f
is a scalar value, so W
must be a column
vector with the same length as B
, and B
must be a
function of the observation. For more information, see rlValueRepresentation
.
For single-output Q-value function critic representations, (such as the ones used in Q,
DQN, SARSA, DDPG, TD3, and SAC agents), f
is a scalar value, so
W
must be a column vector with the same length as
B
, and B
must be a function of both the
observation and action. For more information, see rlQValueRepresentation
.
For multi-output Q-value function critic representations with discrete action spaces,
(such as those used in Q, DQN, and SARSA agents), f
is a vector with as
many elements as the number of possible actions. Therefore W
must be a
matrix with as many columns as the number of possible actions and as many rows as the length
of B
. B
must be only a function of the
observation. For more information, see rlQValueRepresentation
.
For actors with a discrete action space (such as the ones in PG, AC, and PPO
agents), f
must be column vector with length equal to the number of
possible discrete actions.
For deterministic actors with a continuous action space (such as the ones in DDPG,
and TD3 agents), the dimensions of f
must match the dimensions of the
agent action specification, which is either a scalar or a column vector.
Stochastic actors with continuous action spaces cannot rely on custom basis functions (they can only use neural network approximators, due to the need to enforce positivity for the standard deviations).
For any actor representation, W
must have as many columns as the
number of elements in f
, and as many rows as the number of elements in
B
. B
must be only a function of the observation.
For more information, see rlDeterministicActorRepresentation
, and rlStochasticActorRepresentation
.
For an example that trains a custom agent that uses a linear basis function representation, see Train Custom LQR Agent.
Once you create your actor and critic representations, you can create a reinforcement learning agent that uses these representations. For example, create a PG agent using a given actor and critic network.
agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);
For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
You can obtain the actor and critic representations from an existing agent using
getActor
and
getCritic
,
respectively.
You can also set the actor and critic of an existing agent using setActor
and
setCritic
,
respectively. When you specify a representation for an existing agent using these functions,
the input and output layers of the specified representation must match the observation and
action specifications of the original agent.