This example demonstrates a multi-agent collaborative-competitive task in which you train three proximal policy optimization (PPO) agents to explore all areas within a grid-world environment.
Multi-agent training is supported for Simulink® environments only. As shown in this example, if you define your environment behavior using a MATLAB® System object, you can incorporate it into a Simulink environment using a MATLAB System (Simulink) block.
The environment in this example is a 12x12 grid world containing obstacles, with unexplored cells marked in white and obstacles marked in black. There are three robots in the environment represented by the red, green, and blue circles. Three proximal policy optimization agents with discrete action spaces control the robots. To learn more about PPO agents, see Proximal Policy Optimization Agents.
The agents provide one of five possible movement actions (WAIT, UP, DOWN, LEFT, or RIGHT) to their respective robots. The robots decide whether an action is legal or illegal. For example, an action of moving LEFT when the robot is located next to the left boundary of the environment is deemed illegal. Similarly, actions for colliding against obstacles and other agents in the environment are illegal actions and draw penalties. The environment dynamics are deterministic, which means the robots execute legal and illegal actions with 100% and 0% probabilities, respectively. The overall goal is to explore all cells as quickly as possible.
At each time step, an agent observes the state of the environment through a set of four images that identify the cells with obstacles, current position of the robot that is being controlled, position of other robots, and cells that have been explored during the episode. These images are combined to create a 4-channel 12x12 image observation set. The following figure shows an example of what the agent controlling the green robot observes for a given time step.
For the grid world environment:
The search area is a 12x12 grid with obstacles.
The observation for each agent is a 12x12x4 image.
The discrete action set is a set of five actions (WAIT=0, UP=1, DOWN=2, LEFT=3, RIGHT=4).
The simulation terminates when the grid is fully explored or the maximum number of steps is reached.
At each time step, agents receive the following rewards and penalties.
+1 for moving to a previously unexplored cell (white).
-0.5 for an illegal action (attempt to move outside the boundary or collide against other robots and obstacles)
-0.05 for an action that results in movement (movement cost).
-0.1 for an action that results in no motion (lazy penalty).
If the grid is fully explored, +200 times the coverage contribution for that robot during the episode (ratio of cells explored to total cells)
Define the locations of obstacles within the grid using a matrix of indices. The first column contains the row indices, and the second column contains the column indices.
obsMat = [4 3; 5 3; 6 3; 7 3; 8 3; 9 3; 5 11; 6 11; 7 11; 8 11; 5 12; 6 12; 7 12; 8 12];
Initialize the robot positions.
sA0 = [2 2]; sB0 = [11 4]; sC0 = [3 12]; s0 = [sA0; sB0; sC0];
Specify the sample time, simulation time, and maximum number of steps per episode.
Ts = 0.1; Tf = 100; maxsteps = ceil(Tf/Ts);
Open the Simulink model.
mdl = "rlAreaCoverage";
open_system(mdl)
The GridWorld
block is a MATLAB System block representing the training environment. The System object for this environment is defined in GridWorld.m
.
In this example, the agents are homogeneous and have the same observation and action specifications. Create the observation and action specifications for the environment. For more information, see rlNumericSpec
and rlFiniteSetSpec
.
% Define observation specifications. obsSize = [12 12 4]; oinfo = rlNumericSpec(obsSize); oinfo.Name = 'observations'; % Define action specifications. numAct = 5; actionSpace = {0,1,2,3,4}; ainfo = rlFiniteSetSpec(actionSpace); ainfo.Name = 'actions';
Specify the block paths for the agents
blks = mdl + ["/Agent A (Red)","/Agent B (Green)","/Agent C (Blue)"];
Create the environment interface, specifying the same observation and action specifications for all three agents.
env = rlSimulinkEnv(mdl,blks,{oinfo,oinfo,oinfo},{ainfo,ainfo,ainfo});
Specify a reset function for the environment. The reset function resetMap
ensures that the robots start from random initial positions at the beginning of each episode. The random initialization makes the agents robust to different starting positions and improves training convergence.
env.ResetFcn = @(in) resetMap(in, obsMat);
PPO agents rely on actor and critic representations to learn the optimal policy. In this example, the agents maintain deep neural network-based function approximators for the actor and critic. Both the actor and critic have similar network structures with convolution and fully connected layers. The critic outputs a scalar value representing the state value . The actor outputs the probabilities of taking each of the five actions. For more information, see rlValueRepresentation
and rlStochasticActorRepresentation
.
Set the random seed for reproducibility.
rng(0)
Create the actor and critic representations using the following steps.
Create the actor and critic deep neural networks.
Specify representation options for the actor and critic. In this example, specify the learning rates and the gradient thresholds. For more information, see rlRepresentationOptions
.
Create the actor and critic representation objects.
Use the same network structure and representation options for all three agents.
for idx = 1:3 % Create actor deep neural network. actorNetWork = [ imageInputLayer(obsSize,'Normalization','none','Name','observations') convolution2dLayer(8,16,'Name','conv1','Stride',1,'Padding',1,'WeightsInitializer','he') reluLayer('Name','relu1') convolution2dLayer(4,8,'Name','conv2','Stride',1,'Padding','same','WeightsInitializer','he') reluLayer('Name','relu2') fullyConnectedLayer(256,'Name','fc1','WeightsInitializer','he') reluLayer('Name','relu3') fullyConnectedLayer(128,'Name','fc2','WeightsInitializer','he') reluLayer('Name','relu4') fullyConnectedLayer(64,'Name','fc3','WeightsInitializer','he') reluLayer('Name','relu5') fullyConnectedLayer(numAct,'Name','output') softmaxLayer('Name','action')]; % Create critic deep neural network. criticNetwork = [ imageInputLayer(obsSize,'Normalization','none','Name','observations') convolution2dLayer(8,16,'Name','conv1','Stride',1,'Padding',1,'WeightsInitializer','he') reluLayer('Name','relu1') convolution2dLayer(4,8,'Name','conv2','Stride',1,'Padding','same','WeightsInitializer','he') reluLayer('Name','relu2') fullyConnectedLayer(256,'Name','fc1','WeightsInitializer','he') reluLayer('Name','relu3') fullyConnectedLayer(128,'Name','fc2','WeightsInitializer','he') reluLayer('Name','relu4') fullyConnectedLayer(64,'Name','fc3','WeightsInitializer','he') reluLayer('Name','relu5') fullyConnectedLayer(1,'Name','output')]; % Specify representation options for the actor and critic. actorOpts = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1); criticOpts = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1); % create actor and critic actor(idx) = rlStochasticActorRepresentation(actorNetWork,oinfo,ainfo,... 'Observation',{'observations'},actorOpts); critic(idx) = rlValueRepresentation(criticNetwork,oinfo,... 'Observation',{'observations'},criticOpts); end
Specify the agent options using rlPPOAgentOptions
. Use the same options for all three agents. During training, agents collect experiences until they reach the experience horizon of 128 steps and then train from mini-batches of 64 experiences. An objective function clip factor of 0.2 improves training stability, and a discount factor value of 0.995 encourages long-term rewards.
opt = rlPPOAgentOptions(... 'ExperienceHorizon',128,... 'ClipFactor',0.2,... 'EntropyLossWeight',0.01,... 'MiniBatchSize',64,... 'NumEpoch',3,... 'AdvantageEstimateMethod','gae',... 'GAEFactor',0.95,... 'SampleTime',Ts,... 'DiscountFactor',0.995);
Create the agents using the defined actors, critics, and options.
agentA = rlPPOAgent(actor(1),critic(1),opt); agentB = rlPPOAgent(actor(2),critic(2),opt); agentC = rlPPOAgent(actor(3),critic(3),opt); agents = [agentA,agentB,agentC];
Specify the following options for training the agents.
Run the training for at most 1000 episodes, with each episode lasting at most 5000 time steps.
Stop the training of an agent when its average reward over 100 consecutive episodes is 80 or more.
trainOpts = rlTrainingOptions(... 'MaxEpisodes',1000,... 'MaxStepsPerEpisode',maxsteps,... 'Plots','training-progress',... 'ScoreAveragingWindowLength',100,... 'StopTrainingCriteria','AverageReward',... 'StopTrainingValue',80);
To train multiple agents, specify an array of agents to the train
function. The order of the agents in the array must match the order of agent block paths specified during environment creation. Doing so ensures that the agent objects are linked to the appropriate action and observation specifications in the environment.
Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load pretrained agent parameters by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
.
doTraining = false; if doTraining stats = train(agents,env,trainOpts); else load('rlAreaCoverageParameters.mat'); setLearnableParameters(agentA,agentAParams); setLearnableParameters(agentB,agentBParams); setLearnableParameters(agentC,agentCParams); end
The following figure shows a snapshot of the training progress. You can expect different results due to randomness in the training process.
Simulate the trained agents within the environment. For more information on agent simulation, see rlSimulationOptions
and sim
.
rng(0) % reset the random seed simOpts = rlSimulationOptions('MaxSteps',maxsteps); experience = sim(env,agents,simOpts);
The agents successfully cover the entire grid world.