Create Policy and Value Function Representations

A reinforcement learning policy is a mapping that selects an action to take based on observations from the environment. During training, the agent tunes the parameters of its policy representation to maximize the long-term reward.

Depending on the type of reinforcement learning agent you are using, you define actor and critic function approximators, which the agent uses to represent and train its policy. The actor represents the policy that selects the best action to take. The critic represents the value function that estimates the long-term reward for the current policy. Depending on your application and selected agent, you can define policy and value functions using deep neural networks, linear basis functions, or lookup tables.

For more information on agents, see Reinforcement Learning Agents.

Function Approximation

Depending on the type of agent you are using, Reinforcement Learning Toolbox™ software supports the following types of function approximators:

V(S|θ_V) — Critics that estimate the expected long-term reward based on a given observation S
Q(S,A|θ_Q) — Critics that estimate the expected long-term reward based on a given observation S and action A
Q(S,A_i|θ_Q) — Critics that estimate the expected long-term reward for all possible discrete actions given observation S
μ(S|θ_μ) — Actors that select an action based on a given observation S

Each function approximator has a corresponding set of parameters (θ_V, θ_Q, θ_μ), which are computed during the learning process.

For systems with a limited number of discrete observations and discrete actions, you can store value functions in a lookup table. For systems that have many discrete observations and actions and for observation and action spaces that are continuous, storing the observations and actions is impractical. For such systems, you can represent your actors and critics using deep neural networks or linear basis functions.

Table Representations

You can create two types of table representations:

Value tables, which store rewards for corresponding observations
Q-tables, which store rewards for corresponding observation-action pairs

To create a table representation, first create a value table or Q table using the rlTable function. Then, create a representation for the table using either an rlValueRepresentation or rlQValueRepresentation object. To configure the learning rate and optimization used by the representation, use an rlRepresentationOptions object.

Deep Neural Network Representations

You can create actor and critic function approximators using deep neural network representations. Doing so uses Deep Learning Toolbox™ software features.

Network Input and Output Dimensions

The dimensions of your actor and critic networks must match the corresponding action and observation specifications from the training environment object. To obtain the action and observation dimensions for environment env, use the getActionInfo and getObservationInfo functions, respectively. Then access the Dimensions property of the specification objects.

actInfo = getActionInfo(env);
actDimensions = actInfo.Dimensions;

obsInfo = getObservationInfo(env);
obsDimensions = obsInfo.Dimensions;

For critic networks that take only observations as inputs, such as those used in AC or PG agents, the dimensions of the input layers must match the dimensions of the environment observation specifications. The dimensions of the critic output layer must be a scalar value function.

For critic networks that take both observations and actions as inputs, such as those used in DQN or DDPG agents, the dimensions of the input layers must match the dimensions of the corresponding environment observation and action specifications.

For actor networks the dimensions of the input layers must match the dimensions of the environment observation specifications. If the actor has a:

Discrete action space, then its output size must equal the number of discrete actions.
Continuous action space, then its output size must be a scalar or vector value, as defined in the action specification.

Build Deep Neural Network

Deep neural networks consist of a series of interconnected layers. The following table lists some common deep learning layers used in reinforcement learning applications. For a full list of available layers, see List of Deep Learning Layers (Deep Learning Toolbox).

Layer	Description
`imageInputLayer`	Input vectors and 2-D images, and normalize the data.
`tanhLayer`	Apply a hyperbolic tangent activation layer to the layer inputs.
`reluLayer`	Set any input values that are less than zero to zero.
`fullyConnectedLayer`	Multiply the input vector by a weight matrix, and add a bias vector.
`convolution2dLayer`	Apply sliding convolutional filters to the input.
`additionLayer`	Add the outputs of multiple layers together.
`concatenationLayer`	Concatenate inputs along a specified dimension.

The bilstmLayer, and batchNormalizationLayer layers are not supported for reinforcement learning.

You can also create your own custom layers. For more information, see Define Custom Deep Learning Layers (Deep Learning Toolbox). Reinforcement Learning Toolbox software provides the following custom layers.

Layer	Description
`scalingLayer`	Linearly scale and bias an input array. This layer is useful for scaling and shifting the outputs of nonlinear layers, such as `tanhLayer` and sigmoid.
`quadraticLayer`	Create vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller.
`softplusLayer`	Implement the softplus activation Y = log(1 + e^X), which ensures that the output is always positive

The custom layers do not contain tunable parameters; that is, they do not change during training.

For reinforcement learning applications, you construct your deep neural network by connecting a series of layers for each input path (observations or actions) and for each output path (estimated rewards or actions). You then connect these paths together using the connectLayers function.

You can also create your deep neural network using the Deep Network Designer app. For an example, see Create Agent Using Deep Network Designer and Train Using Image Observations.

When you create a deep neural network, you must specify names for the first layer of each input path and the final layer of the output path.

The following code creates and connects the following input and output paths:

An observation input path, observationPath, with the first layer named 'observation'.
An action input path, actionPath, with the first layer named 'action'.
An estimated value function output path, commonPath, which takes the outputs of observationPath and actionPath as inputs. The final layer of this path is named 'output'.

observationPath = [
    imageInputLayer([4 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(24,'Name','CriticObsFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(24,'Name','CriticObsFC2')];
actionPath = [
    imageInputLayer([1 1 1],'Normalization','none','Name','action')
    fullyConnectedLayer(24,'Name','CriticActFC1')];
commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(observationPath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);    
criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActFC1','add/in2');

For all observation and action input paths, you must specify an imageInputLayer as the first layer in the path.

You can view the structure of your deep neural network using the plot function.

plot(criticNetwork)

For PG and AC agents, the final output layers of your deep neural network actor representation are a fullyConnectedLayer and a softmaxLayer layer. When you specify the layers for your network, you must specify the fullyConnectedLayer and you can optionally specify the softmaxLayer. If you omit the softmaxLayer, the software automatically adds one for you.

Determining the number, type, and size of layers for your deep neural network representation can be difficult and is application dependent. However, the most critical component for any function approximator is whether the function is able to approximate the optimal policy or discounted value function for your application; that is, whether it has layers that can correctly learn the features of your observation, action, and reward signals.

Consider the following tips when constructing your network.

For continuous action spaces, bound actions with a tanhLayer followed by ScalingLayer, if necessary.
Deep dense networks with reluLayer layers can be fairly good at approximating many different functions. Therefore, they are often a good first choice.
When you approximate strong nonlinearities or systems with algebraic constraints, adding more layers is often better than increasing the number of outputs per layer. Adding more layers promotes exponential exploration, while adding layer outputs promotes polynomial exploration.
For on-policy agents, such as AC and PG agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). On-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. If the network is small, the worker updates can correlate with each other and make training unstable.

Create and Configure Representation

To create a critic representation for your deep neural network, use an rlValueRepresentation or rlQValueRepresentation object. To create an actor representation for your deep neural network, use an rlDeterministicActorRepresentation or rlStochasticActorRepresentation object. To configure the learning rate and optimization used by the representation, use an rlRepresentationOptions object.

For example, create a Q-value representation object for the critic network criticNetwork, specifying a learning rate of 0.0001. When you create the representation, pass the environment action and observation specifications to the rlQValueRepresentation object, and specify the names of the network layers to which the actions and observations are connected.

opt = rlRepresentationOptions('LearnRate',0.0001);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
        'Observation',{'observation'},'Action',{'action'},opt);

When you create your deep neural network and configure your representation object, consider using one of the following approaches as a starting point.

Start with the smallest possible network and a high learning rate (0.01). Train this initial network to see if the agent converges quickly to a poor policy or acts in a random manner. If either of these issues occur, rescale the network by adding more layers or more outputs on each layer. Your goal is to find a network structure that is just big enough, does not learn too fast, and shows signs of learning (an improving trajectory of the reward graph) after an initial training period.
A low initial learning rate can allow you to see if the agent is on the right track, and help you check that your network architecture is satisfactory for the problem. For difficult problems, tuning parameters is much easier once you settle on a good network architecture.

Also, consider the following tips when configuring your deep neural network representation.

Be patient with DDPG and DQN agents, since they might not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes.
For DDPG and DQN agents, promoting exploration of the agent is critical.
For agents with both actor and critic networks, set the initial learning rates of both representations to the same value. For some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.

Recurrent Neural Networks

When creating representations for use with a PPO or DQN agent, you can use recurrent neural networks. These networks are deep neural networks with at least one layer that has hidden state information, such as an lstmLayer. For more information and examples, see rlValueRepresentation, rlQValueRepresentation, rlDeterministicActorRepresentation, and rlStochasticActorRepresentation.

Linear Basis Function Representations

Linear basis function representations have the form f = W'B, where W is a weight array and B is the column vector output of a custom basis function. The learnable parameters of a linear basis function representation are the elements of W.

For critic representations, f is a scalar value and W is a column vector with the same length as B.

For actor representations with a:

Continuous action space, the dimensions of f match the dimensions of the agent action specification, which is either a scalar or a column vector.
Discrete action space, f is a column vector with length equal to the number of discrete actions.

For actor representations, the number of columns in W equals the number of elements in f.

To create a linear basis function representation, first create a custom basis function that returns a column vector. The signature of the basis function depends on the type of agent you are creating. For more information, rlValueRepresentation, rlQValueRepresentation, rlDeterministicActorRepresentation, and rlStochasticActorRepresentation.

For an example that trains a custom agent that uses a linear basis function representation, see Train Custom LQR Agent.

Specify Agent Representations

Once you create your actor and critic representations, you can create a reinforcement learning agent that uses these representations. For example, create a PG agent using a given actor and critic network.

agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

You can obtain the actor and critic representations from an existing agent using getActor and getCritic, respectively.

You can also set the actor and critic of an existing agent using setActor and setCritic, respectively. When you specify a representation using these functions, the input and output layers of the specified representation must match the observation and action specifications of the original agent.

Documentation