Reinforcement Learning Agents

The goal of reinforcement learning is to train an agent to complete a task within an uncertain environment. The agent receives observations and a reward from the environment and sends actions to the environment. The reward is a measure of how successful an action is with respect to completing the task goal.

The agent contains two components: a policy and a learning algorithm.

  • The policy is a mapping that selects actions based on the observations from the environment. Typically, the policy is a function approximator with tunable parameters, such as a deep neural network.

  • The learning algorithm continuously updates the policy parameters based on the actions, observations, and rewards. The goal of the learning algorithm is to find an optimal policy that maximizes the expected cumulative long-term reward received during the task.

Depending on the learning algorithm, an agent maintains one or more parameterized function approximators for training the policy. Approximators can be used in two ways.

  • Critics — For a given observation and action, a critic returns as output the expected value of the cumulative long-term reward for the task.

  • Actor — For a given observation, an actor returns as output the action that maximizes the expected cumulative long-term reward.

Agents that use only critics to select their actions rely on an indirect policy representation. These agents are also referred to as value-based, and they use an approximator to represent a value function or Q-value function. In general, these agents work better with discrete action spaces but can become computationally expensive for continuous action spaces.

Agents that use only actors to select their actions rely on a direct policy representation. These agents are also referred to as policy-based. The policy can be either deterministic or stochastic. In general, these agents are simpler and can handle continuous action spaces, though the training algorithm can be sensitive to noisy measurement and can converge on local minima.

Agents that use both an actor and a critic are referred to as actor-critic agents. In these agents, during training, the actor learns the best action to take using feedback from the critic (instead of using the reward directly). At the same time, the critic learns the value function from the rewards so that it can properly criticize the actor. In general, these agents can handle both discrete and continuous action spaces.

Built-In Agents

Reinforcement Learning Toolbox™ software provides the following built-in agents. You can train these agents in environments with either continuous or discrete observation spaces and the following action spaces.

The following tables summarize the types, action spaces, and representation for all the built-in agents. For each agent, the observation space can be either discrete or continuous.

Built-in Agents: Type and Action Space

AgentTypeAction Space
Q-Learning Agents (Q)Value-BasedDiscrete
Deep Q-Network Agents(DQN)Value-BasedDiscrete
SARSA AgentsValue-BasedDiscrete
Policy Gradient Agents (PG)Policy-BasedDiscrete or continuous
Actor-Critic Agents (AC)Actor-CriticDiscrete or continuous
Proximal Policy Optimization Agents (PPO)Actor-CriticDiscrete or continuous
Deep Deterministic Policy Gradient Agents (DDPG)Actor-CriticContinuous
Twin-Delayed Deep Deterministic Policy Gradient Agents (TD3)Actor-CriticContinuous
Soft Actor-Critic Agents (SAC)Actor-CriticContinuous

Built-in Agents: Representation that You Must Use with Each Agent

RepresentationQ, DQN, SARSAPGAC, PPOSACDDPG, TD3

Value function critic V(S), which you create using

rlValueRepresentation

 X (if baseline is used)X  

Q-value function critic Q(S,A), which you create using

rlQValueRepresentation

X  XX

Deterministic policy actor π(S), which you create using

rlDeterministicActorRepresentation

    X

Stochastic policy actor π(S), which you create using

rlStochasticActorRepresentation

 XXX 

Agent with default networks — All agents except Q-Learning and SARSA support default networks for actors and critics. You can create an agent with default actor and critic representations based on the observation and action specifications from the environment. To do so, perform the following steps.

  1. Create observation specifications for your environment. If you already have an environment interface object, you can obtain these specifications using getObservationInfo.

  2. Create action specifications for your environment. If you already have an environment interface object, you can obtain these specifications using getActionInfo.

  3. If needed, specify the number of neurons in each learnable layer or whether to use an LSTM layer. To do so, create an agent initialization option object using rlAgentInitializationOptions.

  4. If needed, specify agent options by creating an options object set for the specific agent.

  5. Create the agent using the corresponding agent creation function. The resulting agent contains the appropriate actor and critic representations listed in the table above. The actor and critic use default agent-specific deep neural networks as internal approximators.

For more information on creating actor and critic function approximators, see Create Policy and Value Function Representations.

Choose the Type of Agent

When choosing an agent, a best practice is to start with a simpler (and faster to train) algorithm that is compatible with your action and observation spaces. You can then try progressively more complicated algorithms if the simpler ones do not perform as desired.

  • Discrete action and observation space — For environments with a discrete action and observation space, the Q-learning agent is the simplest compatible agent, followed by DQN and PPO.

    Arrow going from left to right showing first a vertical stack containing the Q-learning agent on top and the SARSA agent on the bottom, followed by a DQN on the middle, and by a PPO agent on the right

  • Discrete action space and continuous observation space — For environments with a discrete action space and a continuous observation space, DQN is the simplest compatible agent followed by PPO.

    Arrow showing a DQN agent on the left followed by a PPO agent on the right

  • Continuous action space — For environments with both a continuous action and observation space, DDPG is the simplest compatible agent, followed by TD3, PPO, and SAC. For such environments, try DDPG first. In general:

    • TD3 is an improved, more complex version of DDPG.

    • PPO has more stable updates but requires more training.

    • SAC is an improved, more complex version of DDPG that generates stochastic policies.

    Arrow showing a DDPG agent followed by a vertical stack containing the TD3 agent on top, the PPO agent in th emiddle, and the SAC agent on the bottom.

Custom Agents

You can also train policies using other learning algorithms by creating a custom agent. To do so, you create a subclass of a custom agent class, defining the agent behavior using a set of required and optional methods. For more information, see Custom Agents.

See Also

| | | | | | | |

Related Topics