This example shows how to train a proximal policy optimization (PPO) agent with a discrete action space to land a rocket on the ground. For more information on PPO agents, see Proximal Policy Optimization Agents.
The environment in this example is a 3-DOF rocket represented by a circular disc with mass. The rocket has two thrusters for forward and rotational motion. Gravity acts vertically downwards, and there are no aerodynamic drag forces. The training goal is to make the robot land on the ground at a specified location.
For this environment:
Motion of the rocket is bounded in X (horizontal axis) from -100 to 100 meters and Y (vertical axis) from 0 to 120 meters.
The goal position is at (0,0) meters and the goal orientation is 0 radians.
The maximum thrust applied by each thruster is 8.5 N.
The sample time is 0.1 seconds.
The observations from the environment are the rocket position , orientation , velocity , angular velocity , and a sensor reading that detects rough landing (-1), soft landing (1) or airborne (0) condition. The observations are normalized between -1 and 1.
At the beginning of every episode, the rocket starts from a random initial position and orientation. The altitude is always reset to 100 meters.
The reward provided at the time step is as follows.
Here:
,,, and are the positions and velocities of the rocket along the x and y axes.
is the normalized distance of the rocket from the goal position.
is the normalized speed of the rocket.
and are the maximum distances and speeds.
is the orientation with respect to the vertical axis.
and are the action values for the left and right thrusters.
is a sparse reward for soft-landing with horizontal and vertical velocities less than 0.5 m/s.
Create a MATLAB environment for the rocket lander using the RocketLander
class.
env = RocketLander;
Obtain the observation and action specifications from the environment.
actionInfo = getActionInfo(env); observationInfo = getObservationInfo(env); numObs = observationInfo.Dimension(1); numAct = numel(actionInfo.Elements);
Set a sample time for the environment
Ts = 0.1;
Fix the random generator seed for reproducibility.
rng(0)
The PPO agent in this example operates on a discrete action space. At every time step, the agent selects one of the following discrete action pairs.
Here, and are normalized thrust values for each thruster.
To estimate the policy and value function, the agent maintains function approximators for the actor and critic, which are modeled using deep neural networks. The training can be sensitive to the initial network weights and biases, and results can vary with different sets of values. The network weights are randomly initialized to small values in this example.
Create the critic deep neural network with six inputs and one output. The output of the critic network is the discounted long-term reward for the input observations.
criticLayerSizes = [400 300]; actorLayerSizes = [400 300]; criticNetwork = [ featureInputLayer(numObs,'Normalization','none','Name','observation') fullyConnectedLayer(criticLayerSizes(1),'Name','CriticFC1', ... 'Weights',sqrt(2/numObs)*(rand(criticLayerSizes(1),numObs)-0.5), ... 'Bias',1e-3*ones(criticLayerSizes(1),1)) reluLayer('Name','CriticRelu1') fullyConnectedLayer(criticLayerSizes(2),'Name','CriticFC2', ... 'Weights',sqrt(2/criticLayerSizes(1))*(rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), ... 'Bias',1e-3*ones(criticLayerSizes(2),1)) reluLayer('Name','CriticRelu2') fullyConnectedLayer(1,'Name','CriticOutput', ... 'Weights',sqrt(2/criticLayerSizes(2))*(rand(1,criticLayerSizes(2))-0.5), ... 'Bias',1e-3)];
Create the critic representation.
criticOpts = rlRepresentationOptions('LearnRate',1e-4); critic = rlValueRepresentation(criticNetwork,observationInfo,'Observation',{'observation'},criticOpts);
Create the actor using a deep neural network with six inputs and two outputs. The outputs of the actor network are the probabilities of taking each possible action pair. Each action pair contains normalized action values for each thruster. The environment step
function scales these values to determine the actual thrust values.
actorNetwork = [featureInputLayer(numObs,'Normalization','none','Name','observation') fullyConnectedLayer(actorLayerSizes(1),'Name','ActorFC1', ... 'Weights',sqrt(2/numObs)*(rand(actorLayerSizes(1),numObs)-0.5), ... 'Bias',1e-3*ones(actorLayerSizes(1),1)) reluLayer('Name','ActorRelu1') fullyConnectedLayer(actorLayerSizes(2),'Name','ActorFC2', ... 'Weights',sqrt(2/actorLayerSizes(1))*(rand(actorLayerSizes(2),actorLayerSizes(1))-0.5), ... 'Bias',1e-3*ones(actorLayerSizes(2),1)) reluLayer('Name', 'ActorRelu2') fullyConnectedLayer(numAct,'Name','Action', ... 'Weights',sqrt(2/actorLayerSizes(2))*(rand(numAct,actorLayerSizes(2))-0.5), ... 'Bias',1e-3*ones(numAct,1)) softmaxLayer('Name','actionProb')];
Create the actor using a stochastic actor representation.
actorOpts = rlRepresentationOptions('LearnRate',1e-4); actor = rlStochasticActorRepresentation(actorNetwork,observationInfo,actionInfo,... 'Observation',{'observation'},actorOpts);
Specify the agent hyperparameters using an rlPPOAgentOptions
object.
agentOpts = rlPPOAgentOptions(... 'ExperienceHorizon',600,... 'ClipFactor',0.02,... 'EntropyLossWeight',0.01,... 'MiniBatchSize',128,... 'NumEpoch',3,... 'AdvantageEstimateMethod','gae',... 'GAEFactor',0.95,... 'SampleTime',Ts,... 'DiscountFactor',0.997);
For these hyperparameters:
The agent collects experiences until it reaches the experience horizon of 600 steps or episode termination and then trains from mini-batches of 128 experiences for 3 epochs.
For improving training stability, use an objective function clip factor of 0.02.
A discount factor value of 0.997 encourages long term rewards.
Variance in critic output is reduced by using the Generalized Advantage Estimate method with a GAE factor of 0.95.
The EntropyLossWeight
term of 0.01 enhances exploration during training.
Create the PPO agent.
agent = rlPPOAgent(actor,critic,agentOpts);
To train the PPO agent, specify the following training options.
Run the training for at most 20000 episodes, with each episode lasting at most 600 time steps.
Stop the training when the average reward over 100 consecutive episodes is 430 or more.
Save a copy of the agent for each episode where the episode reward is 700 or more.
trainOpts = rlTrainingOptions(... 'MaxEpisodes',20000,... 'MaxStepsPerEpisode',600,... 'Plots','training-progress',... 'StopTrainingCriteria','AverageReward',... 'StopTrainingValue',430,... 'ScoreAveragingWindowLength',100,... 'SaveAgentCriteria',"EpisodeReward",... 'SaveAgentValue',700);
Train the agent using the train
function. Due to the complexity of the environment, training process is computationally intensive and takes several hours to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
.
doTraining = false; if doTraining trainingStats = train(agent,env,trainOpts); else load('rocketLanderAgent.mat'); end
An example training session is shown below. The actual results may vary because of randomness in the training process.
Plot the rocket lander environment to visualize the simulation.
plot(env)
Simulate the trained agent within the environment. For more information on agent simulation, see rlSimulationOptions
and sim
.
simOptions = rlSimulationOptions('MaxSteps',600); simOptions.NumSimulations = 5; % simulate the environment 5 times experience = sim(env,agent,simOptions);