Create Policy and Value Function Representations

A reinforcement learning policy is a mapping that selects the action that the agent takes based on observations from the environment. During training, the agent tunes the parameters of its policy representation to maximize the expected cumulative long-term reward.

Reinforcement learning agents estimate policies and value functions using function approximators called actor and critic representations respectively. The actor represents the policy that selects the best action to take, based on the current observation. The critic represents the value function that estimates the expected cumulative long-term reward for the current policy.

Before creating an agent, you must create the required actor and critic representations using deep neural networks, linear basis functions, or lookup tables. The type of function approximators you use depends on your application.

For more information on agents, see Reinforcement Learning Agents.

Actors and Critic Representations

The Reinforcement Learning Toolbox™ software supports the following types of representations:

V(S|θ_V) — Critics that estimate the expected cumulative long-term reward based on a given observation S. You can create these critics using rlValueRepresentation.
Q(S,A|θ_Q) — Critics that estimate the expected cumulative long-term reward for all possible discrete action based on a given observation S. You can create these critics using rlQValueRepresentation.
Q(S|θ_Q) — Multi-output critics that estimate the expected cumulative long-term reward for all possible discrete actions A_i given observation S. You can create these critics using rlQValueRepresentation.
μ(S|θ_μ) — Actors that select an action based on a given observation S. You can create these actors using either rlDeterministicActorRepresentation or rlStochasticActorRepresentation.

Each representation uses a function approximator with a corresponding set of parameters (θ_V, θ_Q, θ_μ), which are computed during the learning process.

For systems with a limited number of discrete observations and discrete actions, you can store value functions in a lookup table. For systems that have many discrete observations and actions and for observation and action spaces that are continuous, storing the observations and actions is impractical. For such systems, you can represent your actors and critics using deep neural networks or custom (linear in the parameters) basis functions.

The following table summarizes the way in which you can use the four representation objects available with the Reinforcement Learning Toolbox software, depending on the action and observation spaces of your environment, and on the approximator and agent that you want to use.

Representations vs. Approximators and Agents

Representation	Supported Approximators	Observation Space	Action Space	Supported Agents
Value function critic, V(S) which you create using `rlValueRepresentation`	Table	Discrete	Not applicable	PG, AC, PPO
	Deep neural network or custom basis function	Discrete or continuous	Not applicable	PG, AC, PPO
Q-value function critic, Q(S,A) which you create using `rlQValueRepresentation`	Table	Discrete	Discrete	Q, DQN, SARSA
	Deep neural network or custom basis function	Discrete or continuous	Discrete	Q, DQN, SARSA
	Deep neural network or custom basis function	Discrete or continuous	Continuous	DDPG, TD3
Multi-output Q-value function critic, Q(S,A) which you create using `rlQValueRepresentation`	Deep neural network or custom basis function	Discrete or continuous	Discrete	Q, DQN, SARSA
Deterministic policy actor, π(S) which you create using `rlDeterministicActorRepresentation`	Deep neural network or custom basis function	Discrete or continuous	Continuous	DDPG, TD3
Stochastic policy actor, π(S) which you create using `rlStochasticActorRepresentation`	Deep neural network or custom basis function	Discrete or continuous	Discrete	PG, AC, PPO
	Deep neural network	Discrete or continuous	Continuous	PG, AC, PPO, SAC

For more information on agents, see Reinforcement Learning Agents.

Table Approximators

Representations based on lookup tables are appropriate for environments with a limited number of discrete observations and actions. You can create two types of lookup table representations:

Value tables, which store rewards for corresponding observations
Q-tables, which store rewards for corresponding observation-action pairs

To create a table representation, first create a value table or Q-table using the rlTable function. Then, create a representation for the table using either an rlValueRepresentation or rlQValueRepresentation object. To configure the learning rate and optimization used by the representation, use an rlRepresentationOptions object.

Deep Neural Network Approximators

You can create actor and critic function approximators using deep neural networks. Doing so uses the Deep Learning Toolbox™ software features.

Network Input and Output Dimensions

The dimensions of your actor and critic networks must match the corresponding action and observation specifications from the training environment object. To obtain the action and observation dimensions for environment env, use the getActionInfo and getObservationInfo functions, respectively. Then access the Dimensions property of the specification objects.

actInfo = getActionInfo(env);
actDimensions = actInfo.Dimensions;

obsInfo = getObservationInfo(env);
obsDimensions = obsInfo.Dimensions;

Networks for value function critics (such as the ones used in AC, PG, or PPO agents) must take only observations as inputs and must have a single scalar output. For these networks, the dimensions of the input layers must match the dimensions of the environment observation specifications. For more information, see rlValueRepresentation.

Networks for single-output Q-value function critics (such as the ones used in Q, DQN, SARSA, DDPG, TD3, and SAC agents) must take both observations and actions as inputs, and must have a single scalar output. For these networks, the dimensions of the input layers must match the dimensions of the environment specifications for both observations and actions. For more information, see rlQValueRepresentation.

Networks for multi-output Q-value function critics (such as those used in Q, DQN, and SARSA agents) take only observations as inputs and must have a single output layer with output size equal to the number of discrete actions. For these networks the dimensions of the input layers must match the dimensions of the environment observations. specifications. For more information, see rlQValueRepresentation.

For actor networks, the dimensions of the input layers must match the dimensions of the environment observation specifications.

Networks used in actors with a discrete action space (such as the ones in PG, AC, and PPO agents) must have a single output layer with an output size equal to the number of possible discrete actions.
Networks used in deterministic actors with a continuous action space (such as the ones in DDPG and TD3 agents) must have a single output layer with an output size matching the dimension of the action space defined in the environment action specification.
Networks used in stochastic actors with a continuous action space (such as the ones in PG, AC, PPO, and SAC agents) must have a single output layer with output size having twice the dimension of the action space defined in the environment action specification. These networks must have two separate paths, the first producing the mean values (which must be scaled to the output range) and the second producing the standard deviations (which must be non-negative).

For more information, see rlDeterministicActorRepresentation and rlStochasticActorRepresentation.

Build Deep Neural Network

Deep neural networks consist of a series of interconnected layers. The following table lists some common deep learning layers used in reinforcement learning applications. For a full list of available layers, see List of Deep Learning Layers.

Layer	Description
`featureInputLayer`	Inputs feature data and applies normalization
`imageInputLayer`	Inputs vectors and 2-D images and applies normalization.
`sigmoidLayer`	Applies a sigmoid function to the input such that the output is bounded in the interval (0,1).
`tanhLayer`	Applies a hyperbolic tangent activation layer to the input.
`reluLayer`	Sets any input values that are less than zero to zero.
`fullyConnectedLayer`	Multiplies the input vector by a weight matrix, and add a bias vector.
`convolution2dLayer`	Applies sliding convolutional filters to the input.
`additionLayer`	Adds the outputs of multiple layers together.
`concatenationLayer`	Concatenates inputs along a specified dimension.
`sequenceInputLayer`	Provides inputs sequence data to a network.
`lstmLayer`	Applies a Long Short-Term Memory layer to the input. Supported for DQN and PPO agents.

The bilstmLayer and batchNormalizationLayer layers are not supported for reinforcement learning.

The Reinforcement Learning Toolbox software provides the following layers, which contain no tunable parameters (that is, parameters that change during training).

Layer	Description
`scalingLayer`	Applies a linearly scale and bias to an input array. This layer is useful for scaling and shifting the outputs of nonlinear layers, such as `tanhLayer` and sigmoid layer.
`quadraticLayer`	Creates a vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller.
`softplusLayer`	Implements the softplus activation Y = log(1 + e^X), which ensures that the output is always positive. This is a smoothed version of the rectified linear unit (ReLU).

You can also create your own custom layers. For more information, see Define Custom Deep Learning Layers.

For reinforcement learning applications, you construct your deep neural network by connecting a series of layers for each input path (observations or actions) and for each output path (estimated rewards or actions). You then connect these paths together using the connectLayers function.

You can also create your deep neural network using the Deep Network Designer app. For an example, see Create Agent Using Deep Network Designer and Train Using Image Observations.

When you create a deep neural network, you must specify names for the first layer of each input path and the final layer of the output path.

The following code creates and connects the following input and output paths:

An observation input path, observationPath, with the first layer named 'observation'.
An action input path, actionPath, with the first layer named 'action'.
An estimated value function output path, commonPath, which takes the outputs of observationPath and actionPath as inputs. The final layer of this path is named 'output'.

observationPath = [
    imageInputLayer([4 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(24,'Name','CriticObsFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(24,'Name','CriticObsFC2')];
actionPath = [
    imageInputLayer([1 1 1],'Normalization','none','Name','action')
    fullyConnectedLayer(24,'Name','CriticActFC1')];
commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(observationPath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);    
criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActFC1','add/in2');

For all observation and action input paths, you must specify an imageInputLayer as the first layer in the path.

You can view the structure of your deep neural network using the plot function.

plot(criticNetwork)

For PG and AC agents, the final output layers of your deep neural network actor representation are a fullyConnectedLayer and a softmaxLayer. When you specify the layers for your network, you must specify the fullyConnectedLayer and you can optionally specify the softmaxLayer. If you omit the softmaxLayer, the software automatically adds one for you.

Determining the number, type, and size of layers for your deep neural network representation can be difficult and is application dependent. However, the most critical component in deciding the characteristics of the function approximator is whether it is able to approximate the optimal policy or discounted value function for your application, that is, whether it has layers that can correctly learn the features of your observation, action, and reward signals.

Consider the following tips when constructing your network.

For continuous action spaces, bound actions with a tanhLayer followed by a ScalingLayer, if necessary.
Deep dense networks with reluLayer layers can be fairly good at approximating many different functions. Therefore, they are often a good first choice.
Start with the smallest possible network that you think can approximate the optimal policy or value function.
When you approximate strong nonlinearities or systems with algebraic constraints, adding more layers is often better than increasing the number of outputs per layer. In general, the ability of the approximator to represent more complex functions grows only polynomially in the size of the layers, but grows exponentially with the number of layers. In other words, more layers allow approximating more complex and nonlinear compositional functions, although this generally requires more data and longer training times. Networks with fewer layers can require exponentially more units to successfully approximate the same class of functions, and might fail to learn and generalize correctly.
For on-policy agents (the ones that learn only from experience collected while following the current policy), such as AC and PG agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). On-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. If the network is small, the worker updates can correlate with each other and make training unstable.

Create and Configure Representation

To create a critic representation for your deep neural network, use an rlValueRepresentation or rlQValueRepresentation object. To create an actor representation for your deep neural network, use an rlDeterministicActorRepresentation or rlStochasticActorRepresentation object. To configure the learning rate and optimization used by the representation, use an rlRepresentationOptions object.

For example, create a Q-value representation object for the critic network criticNetwork, specifying a learning rate of 0.0001. When you create the representation, pass the environment action and observation specifications to the rlQValueRepresentation object, and specify the names of the network layers to which the observations and actions are connected (in this case 'observation' and 'action').

opt = rlRepresentationOptions('LearnRate',0.0001);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
        'Observation',{'observation'},'Action',{'action'},opt);

When you create your deep neural network and configure your representation object, consider using the following approach as a starting point.

Start with the smallest possible network and a high learning rate (0.01). Train this initial network to see if the agent converges quickly to a poor policy or acts in a random manner. If either of these issues occur, rescale the network by adding more layers or more outputs on each layer. Your goal is to find a network structure that is just big enough, does not learn too fast, and shows signs of learning (an improving trajectory of the reward graph) after an initial training period.
Once you settle on a good network architecture, a low initial learning rate can allow you to see if the agent is on the right track, and help you check that your network architecture is satisfactory for the problem. A low learning rate makes tuning parameters is much easier, especially for difficult problems.

Also, consider the following tips when configuring your deep neural network representation.

Be patient with DDPG and DQN agents, since they might not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes.
For DDPG and DQN agents, promoting exploration of the agent is critical.
For agents with both actor and critic networks, set the initial learning rates of both representations to the same value. For some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.

Recurrent Neural Networks

When creating representations for use with a PPO or DQN agent, you can use recurrent neural networks. These networks are deep neural networks with a sequenceInputLayer input layer and at least one layer that has hidden state information, such as an lstmLayer. They can be especially useful when the environment has states that cannot be included in the observation vector. For more information and examples, see rlValueRepresentation, rlQValueRepresentation, rlDeterministicActorRepresentation, and rlStochasticActorRepresentation.

Custom Basis Function Approximators

Custom (linear in the parameters) basis function approximators have the form f = W'B, where W is a weight array and B is the column vector output of a custom basis function that you must create. The learnable parameters of a linear basis function representation are the elements of W.

For value function critic representations, (such as the ones used in AC, PG or PPO agents), f is a scalar value, so W must be a column vector with the same length as B, and B must be a function of the observation. For more information, see rlValueRepresentation.

For single-output Q-value function critic representations, (such as the ones used in Q, DQN, SARSA, DDPG, TD3, and SAC agents), f is a scalar value, so W must be a column vector with the same length as B, and B must be a function of both the observation and action. For more information, see rlQValueRepresentation.

For multi-output Q-value function critic representations with discrete action spaces, (such as those used in Q, DQN, and SARSA agents), f is a vector with as many elements as the number of possible actions. Therefore W must be a matrix with as many columns as the number of possible actions and as many rows as the length of B. B must be only a function of the observation. For more information, see rlQValueRepresentation.

For actors with a discrete action space (such as the ones in PG, AC, and PPO agents), f must be column vector with length equal to the number of possible discrete actions.
For deterministic actors with a continuous action space (such as the ones in DDPG, and TD3 agents), the dimensions of f must match the dimensions of the agent action specification, which is either a scalar or a column vector.
Stochastic actors with continuous action spaces cannot rely on custom basis functions (they can only use neural network approximators, due to the need to enforce positivity for the standard deviations).

For any actor representation, W must have as many columns as the number of elements in f, and as many rows as the number of elements in B. B must be only a function of the observation. For more information, see rlDeterministicActorRepresentation, and rlStochasticActorRepresentation.

For an example that trains a custom agent that uses a linear basis function representation, see Train Custom LQR Agent.

Create an Agent or Specify Agent Representations

Once you create your actor and critic representations, you can create a reinforcement learning agent that uses these representations. For example, create a PG agent using a given actor and critic network.

agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

You can obtain the actor and critic representations from an existing agent using getActor and getCritic, respectively.

You can also set the actor and critic of an existing agent using setActor and setCritic, respectively. When you specify a representation for an existing agent using these functions, the input and output layers of the specified representation must match the observation and action specifications of the original agent.

Documentation