Train DDPG Agent for Adaptive Cruise Control

This example shows how to train a deep deterministic policy gradient (DDPG) agent for adaptive cruise control (ACC) in Simulink®. For more information on DDPG agents, see Deep Deterministic Policy Gradient Agents.

Simulink Model

The reinforcement learning environment for this example is the simple longitudinal dynamics for an ego car and lead car. The training goal is to make the ego car travel at a set velocity while maintaining a safe distance from lead car by controlling longitudinal acceleration and braking. This example uses the same vehicle model as the Adaptive Cruise Control System Using Model Predictive Control (Model Predictive Control Toolbox) example.

Specify the initial position and velocity for the two vehicles.

x0_lead = 50;   % initial position for lead car (m)
v0_lead = 25;   % initial velocity for lead car (m/s)
x0_ego = 10;    % initial position for ego car (m)
v0_ego = 20;    % initial velocity for ego car (m/s)

Specify standstill default spacing (m), time gap (s) and driver-set velocity (m/s).

D_default = 10;
t_gap = 1.4;
v_set = 30;

Considering the physical limitations of the vehicle dynamics, the acceleration is constrained to the range [-3,2] (m/s^2).

amin_ego = -3;
amax_ego = 2;

Define the sample time Ts and simulation duration Tf in seconds.

Ts = 0.1;
Tf = 60;

Open the model.

mdl = 'rlACCMdl';
open_system(mdl)
agentblk = [mdl '/RL Agent'];

For this model:

  • The acceleration action signal from the agent to the environment is from -3 to 2 m/s^2.

  • The reference velocity for ego car Vref is defined as follows. If relative distance is less than safe distance, ego car tracks the minimum of lead car velocity and driver-set velocity. In this manner, the ego car maintains some distance from lead car. If the relative distance is greater than safe distance, the ego car tracks driver-set velocity. In this example, safe distance is defined as a linear function of ego car longitudinal velocity V; that is, tgap*V+Ddefault. The safe distance determines the reference tracking velocity for the ego car.

  • The observations from the environment are the velocity error e=Vref-Vego, its integral e, and the ego car longitudinal velocity V.

  • The simulation is terminated when longitudinal velocity of the ego car is less than 0, or the relative distance between the lead car and ego car becomes less than 0.

  • The reward rt, provided at every time step t, is:

rt=-(0.1et2+ut-12)+Mt

where ut-1 is the control input from the previous time step. The logical value Mt=1 if velocity error et2<=0.25, otherwise Mt=0.

Create Environment Interface

Create a reinforcement learning environment interface for the model.

% create the observation info
observationInfo = rlNumericSpec([3 1],'LowerLimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1));
observationInfo.Name = 'observations';
observationInfo.Description = 'information on velocity error and ego velocity';
% action Info
actionInfo = rlNumericSpec([1 1],'LowerLimit',-3,'UpperLimit',2);
actionInfo.Name = 'acceleration';
% define environment
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);

To define the initial condition for the position of the lead car, specify an environment reset function using an anonymous function handle.

% randomize initial positions of lead car
env.ResetFcn = @(in)localResetFcn(in);

Fix the random generator seed for reproducibility.

rng('default')

Create DDPG agent

A DDPG agent approximates the long-term reward given observations and actions using a critic value function representation. To create the critic, first create a deep neural network with two inputs, the state and action, and one output. For more information on creating a neural network value function representation, see Create Policy and Value Function Representations.

L = 48; % number of neurons
statePath = [
    imageInputLayer([3 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(L,'Name','fc1')
    reluLayer('Name','relu1')
    fullyConnectedLayer(L,'Name','fc2')
    additionLayer(2,'Name','add')
    reluLayer('Name','relu2')
    fullyConnectedLayer(L,'Name','fc3')
    reluLayer('Name','relu3')
    fullyConnectedLayer(1,'Name','fc4')];

actionPath = [
    imageInputLayer([1 1 1],'Normalization','none','Name','action')
    fullyConnectedLayer(L, 'Name', 'fc5')];

criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
    
criticNetwork = connectLayers(criticNetwork,'fc5','add/in2');

View the critic network configuration.

plot(criticNetwork)

Specify options for the critic representation using rlRepresentationOptions.

criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1,'L2RegularizationFactor',1e-4);

Create the critic representation using the specified neural network and options. You must also specify the action and observation info for the critic, which you obtain from the environment interface. For more information, see rlQValueRepresentation.

critic = rlQValueRepresentation(criticNetwork,observationInfo,actionInfo,...
    'Observation',{'observation'},'Action',{'action'},criticOptions);

A DDPG agent decides which action to take given observations using an actor representation. To create the actor, first create a deep neural network with one input, the observation, and one output, the action.

Construct the actor similarly to the critic. For more information, see rlDeterministicActorRepresentation.

actorNetwork = [
    imageInputLayer([3 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(L,'Name','fc1')
    reluLayer('Name','relu1')
    fullyConnectedLayer(L,'Name','fc2')
    reluLayer('Name','relu2')
    fullyConnectedLayer(L,'Name','fc3')
    reluLayer('Name','relu3')
    fullyConnectedLayer(1,'Name','fc4')
    tanhLayer('Name','tanh1')
    scalingLayer('Name','ActorScaling1','Scale',2.5,'Bias',-0.5)];

actorOptions = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1,'L2RegularizationFactor',1e-4);
actor = rlDeterministicActorRepresentation(actorNetwork,observationInfo,actionInfo,...
    'Observation',{'observation'},'Action',{'ActorScaling1'},actorOptions);

To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.

agentOptions = rlDDPGAgentOptions(...
    'SampleTime',Ts,...
    'TargetSmoothFactor',1e-3,...
    'ExperienceBufferLength',1e6,...
    'DiscountFactor',0.99,...
    'MiniBatchSize',64);
agentOptions.NoiseOptions.Variance = 0.6;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;

Then, create the DDPG agent using the specified actor representation, critic representation, and agent options. For more information, see rlDDPGAgent.

agent = rlDDPGAgent(actor,critic,agentOptions);

Train Agent

To train the agent, first specify the training options. For this example, use the following options:

  • Run each training episode for at most 5000 episodes, with each episode lasting at most 600 time steps.

  • Display the training progress in the Episode Manager dialog box.

  • Stop training when the agent receives an episode reward greater than 260.

For more information, see rlTrainingOptions.

maxepisodes = 5000;
maxsteps = ceil(Tf/Ts);
trainingOpts = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','EpisodeReward',...
    'StopTrainingValue',260);

Train the agent using the train function. This is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

doTraining = false;

if doTraining    
    % Train the agent.
    trainingStats = train(agent,env,trainingOpts);
else
    % Load pretrained agent for the example.
    load('SimulinkACCDDPG.mat','agent')       
end

Simulate DDPG Agent

To validate the performance of the trained agent, uncomment the following commands and simulate it within the Simulink environment. For more information on agent simulation, see rlSimulationOptions and sim.

% simOptions = rlSimulationOptions('MaxSteps',maxsteps);
% experience = sim(env,agent,simOptions);

To demonstrate the trained agent using deterministic initial conditions, simulate the model in Simulink.

x0_lead = 80;
sim(mdl)

The following plots show the simulation results when lead car is 70 (m) ahead of ego car.

  • In the first 28 seconds, relative distance is greater than safe distance (bottom plot), therefore the ego car tracks set velocity (middle plot). To speed up and reach the set velocity, acceleration is positive (top plot).

  • From 28 to 60 seconds, relative distance is less than safe distance (bottom plot), therefore the ego car tracks the minimum of the lead velocity and set velocity. From 28 to 36 seconds, the lead velocity is less than the set velocity (middle plot). To slow down and track the lead car velocity, acceleration is negative (top plot). From 36 to 60 seconds, ego car adjusts its acceleration to track the reference velocity closely (middle plot). Within this time interval, the ego car tracks the set velocity from 43 to 52 seconds and tracks lead velocity from 36 to 43 seconds and 52 to 60 seconds.

Close the Simulink model.

bdclose(mdl)

Local Function

function in = localResetFcn(in)
% reset initial position of lead car
in = setVariable(in,'x0_lead',40+randi(60,1,1));
end

See Also

Related Topics