A reinforcement learning policy is a mapping that selects an action to take based on observations from the environment. During training, the agent tunes the parameters of its policy representation to maximize the long-term reward.
Depending on the type of reinforcement learning agent you are using, you define actor and critic function approximators, which the agent uses to represent and train its policy. The actor represents the policy that selects the best action to take. The critic represents the value function that estimates the long-term reward for the current policy. Depending on your application and selected agent, you can define policy and value functions using deep neural networks, linear basis functions, or lookup tables.
For more information on agents, see Reinforcement Learning Agents.
Depending on the type of agent you are using, Reinforcement Learning Toolbox™ software supports the following types of function approximators:
V(S|θV) — Critics that estimate the expected long-term reward based on a given observation S
Q(S,A|θQ) — Critics that estimate the expected long-term reward based on a given observation S and action A
Q(S,Ai|θQ) — Critics that estimate the expected long-term reward for all possible discrete actions given observation S
μ(S|θμ) — Actors that select an action based on a given observation S
Each function approximator has a corresponding set of parameters (θV, θQ, θμ), which are computed during the learning process.
For systems with a limited number of discrete observations and discrete actions, you can store value functions in a lookup table. For systems that have many discrete observations and actions and for observation and action spaces that are continuous, storing the observations and actions is impractical. For such systems, you can represent your actors and critics using deep neural networks or linear basis functions.
You can create two types of table representations:
Value tables, which store rewards for corresponding observations
Q-tables, which store rewards for corresponding observation-action pairs
To create a table representation, first create a value table or Q table using the
rlTable
function.
Then, create a representation for the table using either an rlValueRepresentation
or rlQValueRepresentation
object. To configure the learning rate and optimization
used by the representation, use an rlRepresentationOptions
object.
You can create actor and critic function approximators using deep neural network representations. Doing so uses Deep Learning Toolbox™ software features.
The dimensions of your actor and critic networks must match the corresponding action
and observation specifications from the training environment object. To obtain the action
and observation dimensions for environment env
, use the
getActionInfo
and getObservationInfo
functions, respectively. Then access the Dimensions
property of the
specification objects.
actInfo = getActionInfo(env); actDimensions = actInfo.Dimensions; obsInfo = getObservationInfo(env); obsDimensions = obsInfo.Dimensions;
For critic networks that take only observations as inputs, such as those used in AC or PG agents, the dimensions of the input layers must match the dimensions of the environment observation specifications. The dimensions of the critic output layer must be a scalar value function.
For critic networks that take both observations and actions as inputs, such as those used in DQN or DDPG agents, the dimensions of the input layers must match the dimensions of the corresponding environment observation and action specifications.
For actor networks the dimensions of the input layers must match the dimensions of the environment observation specifications. If the actor has a:
Discrete action space, then its output size must equal the number of discrete actions.
Continuous action space, then its output size must be a scalar or vector value, as defined in the action specification.
Deep neural networks consist of a series of interconnected layers. The following table lists some common deep learning layers used in reinforcement learning applications. For a full list of available layers, see List of Deep Learning Layers (Deep Learning Toolbox).
Layer | Description |
---|---|
imageInputLayer | Input vectors and 2-D images, and normalize the data. |
tanhLayer | Apply a hyperbolic tangent activation layer to the layer inputs. |
reluLayer | Set any input values that are less than zero to zero. |
fullyConnectedLayer | Multiply the input vector by a weight matrix, and add a bias vector. |
convolution2dLayer | Apply sliding convolutional filters to the input. |
additionLayer | Add the outputs of multiple layers together. |
concatenationLayer | Concatenate inputs along a specified dimension. |
The bilstmLayer
,
and batchNormalizationLayer
layers are not supported for reinforcement
learning.
You can also create your own custom layers. For more information, see Define Custom Deep Learning Layers (Deep Learning Toolbox). Reinforcement Learning Toolbox software provides the following custom layers.
Layer | Description |
---|---|
scalingLayer | Linearly scale and bias an input array. This layer is useful for scaling and
shifting the outputs of nonlinear layers, such as tanhLayer and sigmoid. |
quadraticLayer | Create vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller. |
softplusLayer | Implement the softplus activation Y = log(1 + eX), which ensures that the output is always positive |
The custom layers do not contain tunable parameters; that is, they do not change during training.
For reinforcement learning applications, you construct your deep neural network by
connecting a series of layers for each input path (observations or actions) and for each
output path (estimated rewards or actions). You then connect these paths together using
the connectLayers
function.
You can also create your deep neural network using the Deep Network Designer app. For an example, see Create Agent Using Deep Network Designer and Train Using Image Observations.
When you create a deep neural network, you must specify names for the first layer of each input path and the final layer of the output path.
The following code creates and connects the following input and output paths:
An observation input path, observationPath
, with the first
layer named 'observation'
.
An action input path, actionPath
, with the first layer named
'action'
.
An estimated value function output path, commonPath
, which
takes the outputs of observationPath
and
actionPath
as inputs. The final layer of this path is named
'output'
.
observationPath = [ imageInputLayer([4 1 1],'Normalization','none','Name','observation') fullyConnectedLayer(24,'Name','CriticObsFC1') reluLayer('Name','CriticRelu1') fullyConnectedLayer(24,'Name','CriticObsFC2')]; actionPath = [ imageInputLayer([1 1 1],'Normalization','none','Name','action') fullyConnectedLayer(24,'Name','CriticActFC1')]; commonPath = [ additionLayer(2,'Name','add') reluLayer('Name','CriticCommonRelu') fullyConnectedLayer(1,'Name','output')]; criticNetwork = layerGraph(observationPath); criticNetwork = addLayers(criticNetwork,actionPath); criticNetwork = addLayers(criticNetwork,commonPath); criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1'); criticNetwork = connectLayers(criticNetwork,'CriticActFC1','add/in2');
For all observation and action input paths, you must specify an
imageInputLayer
as the first layer in the path.
You can view the structure of your deep neural network using the
plot
function.
plot(criticNetwork)
For PG and AC agents, the final output layers of your deep neural network actor
representation are a fullyConnectedLayer
and a
softmaxLayer
layer. When you specify the layers for your network,
you must specify the fullyConnectedLayer
and you can optionally
specify the softmaxLayer
. If you omit the
softmaxLayer
, the software automatically adds one for you.
Determining the number, type, and size of layers for your deep neural network representation can be difficult and is application dependent. However, the most critical component for any function approximator is whether the function is able to approximate the optimal policy or discounted value function for your application; that is, whether it has layers that can correctly learn the features of your observation, action, and reward signals.
Consider the following tips when constructing your network.
For continuous action spaces, bound actions with a tanhLayer
followed by ScalingLayer
, if necessary.
Deep dense networks with reluLayer
layers can be fairly good
at approximating many different functions. Therefore, they are often a good first
choice.
When you approximate strong nonlinearities or systems with algebraic constraints, adding more layers is often better than increasing the number of outputs per layer. Adding more layers promotes exponential exploration, while adding layer outputs promotes polynomial exploration.
For on-policy agents, such as AC and PG agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). On-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. If the network is small, the worker updates can correlate with each other and make training unstable.
To create a critic representation for your deep neural network, use an rlValueRepresentation
or rlQValueRepresentation
object. To create an actor representation for your
deep neural network, use an rlDeterministicActorRepresentation
or rlStochasticActorRepresentation
object. To configure the learning rate and
optimization used by the representation, use an rlRepresentationOptions
object.
For example, create a Q-value representation object for the critic network
criticNetwork
, specifying a learning rate of
0.0001
. When you create the representation, pass the environment
action and observation specifications to the rlQValueRepresentation
object, and specify the names of the network layers to which the actions and observations
are connected.
opt = rlRepresentationOptions('LearnRate',0.0001); critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,... 'Observation',{'observation'},'Action',{'action'},opt);
When you create your deep neural network and configure your representation object, consider using one of the following approaches as a starting point.
Start with the smallest possible network and a high learning rate
(0.01
). Train this initial network to see if the agent converges
quickly to a poor policy or acts in a random manner. If either of these issues occur,
rescale the network by adding more layers or more outputs on each layer. Your goal is
to find a network structure that is just big enough, does not learn too fast, and
shows signs of learning (an improving trajectory of the reward graph) after an initial
training period.
A low initial learning rate can allow you to see if the agent is on the right track, and help you check that your network architecture is satisfactory for the problem. For difficult problems, tuning parameters is much easier once you settle on a good network architecture.
Also, consider the following tips when configuring your deep neural network representation.
Be patient with DDPG and DQN agents, since they might not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes.
For DDPG and DQN agents, promoting exploration of the agent is critical.
For agents with both actor and critic networks, set the initial learning rates of both representations to the same value. For some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.
When creating representations for use with a PPO or DQN agent, you can use recurrent
neural networks. These networks are deep neural networks with at least one layer that has
hidden state information, such as an lstmLayer
. For
more information and examples, see rlValueRepresentation
, rlQValueRepresentation
, rlDeterministicActorRepresentation
, and rlStochasticActorRepresentation
.
Linear basis function representations have the form f = W'B
, where
W
is a weight array and B
is the column vector
output of a custom basis function. The learnable parameters of a linear basis function
representation are the elements of W
.
For critic representations, f
is a scalar value and
W
is a column vector with the same length as
B
.
For actor representations with a:
Continuous action space, the dimensions of f
match the dimensions
of the agent action specification, which is either a scalar or a column vector.
Discrete action space, f
is a column vector with length equal to
the number of discrete actions.
For actor representations, the number of columns in W
equals the
number of elements in f
.
To create a linear basis function representation, first create a custom basis function
that returns a column vector. The signature of the basis function depends on the type of
agent you are creating. For more information, rlValueRepresentation
, rlQValueRepresentation
, rlDeterministicActorRepresentation
, and rlStochasticActorRepresentation
.
For an example that trains a custom agent that uses a linear basis function representation, see Train Custom LQR Agent.
Once you create your actor and critic representations, you can create a reinforcement learning agent that uses these representations. For example, create a PG agent using a given actor and critic network.
agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);
For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
You can obtain the actor and critic representations from an existing agent using
getActor
and
getCritic
,
respectively.
You can also set the actor and critic of an existing agent using setActor
and
setCritic
,
respectively. When you specify a representation using these functions, the input and output
layers of the specified representation must match the observation and action specifications
of the original agent.