Proximal Policy Optimisation (PPO)¶
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor). The idea is that after an update, the new policy should be not be too far from the old policy. For that, PPO uses clipping to avoid too large update.
Original paper: https://arxiv.org/abs/1707.06347
What can you use?¶
Multi processing: ✔️
Discrete spaces: ✔️
Continuous spaces: ✔️
Mixed Discrete/Continuous spaces: ✔️
Parameters¶
-
class
neorl.rl.baselines.ppo2.
PPO2
(policy, env, gamma=0.99, n_steps=128, ent_coef=0.01, learning_rate=0.00025, vf_coef=0.5, max_grad_norm=0.5, lam=0.95, nminibatches=4, noptepochs=4, cliprange=0.2, verbose=0, seed=None, _init_setup_model=True)[source]¶ Proximal Policy Optimization algorithm
- Parameters
policy – (ActorCriticPolicy or str) The policy model to use (e.g. MlpPolicy)
env – (NEORL environment or Gym environment) The environment to learn with PPO, either use NEORL method
CreateEnvironment
(see below) or construct your custom Gym environmentgamma – (float) Discount factor
n_steps – (int) The number of steps to run for each environment per update (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
ent_coef – (float) Entropy coefficient for the loss calculation
learning_rate – (float or callable) The learning rate, it can be a function
vf_coef – (float) Value function coefficient for the loss calculation
max_grad_norm – (float) The maximum value for the gradient clipping
lam – (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator
nminibatches – (int) Number of training minibatches per update. For recurrent policies, the number of environments run in parallel should be a multiple of nminibatches.
noptepochs – (int) Number of epoch when optimizing the surrogate
cliprange – (float or callable) Clipping parameter, it can be a function
verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None (default), use random seed.
-
learn
(total_timesteps, callback=None, log_interval=1, tb_log_name='PPO2', reset_num_timesteps=True)[source]¶ Return a trained model.
- Parameters
total_timesteps – (int) The total number of samples to train on
callback – (Union[callable, [callable], BaseCallback]) function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted. When the callback inherits from BaseCallback, you will have access to additional stages of the training (training start/end), please read the documentation for more details.
log_interval – (int) The number of timesteps before logging.
tb_log_name – (str) the name of the run for tensorboard log
reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)
- Returns
(BaseRLModel) the trained model
-
classmethod
load
(load_path, env=None, custom_objects=None, **kwargs)¶ Load the model from file
- Parameters
load_path – (str or file-like) the saved parameter location
env – (Gym Environment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable is present in this dictionary as a key, it will not be deserialized and the corresponding item will be used instead. Similar to custom_objects in keras.models.load_model. Useful when you have an object in file that can not be deserialized.
kwargs – extra arguments to change the model when loading
-
predict
(observation, state=None, mask=None, deterministic=False)¶ Get the model’s action from an observation
- Parameters
observation – (np.ndarray) the input observation
state – (np.ndarray) The last states (can be None, used in recurrent policies)
mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
deterministic – (bool) Whether or not to return deterministic actions.
- Returns
(np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)
-
class
neorl.rl.make_env.
CreateEnvironment
(method, fit, bounds, ncores=1, mode='max', episode_length=50)[source] A module to construct a fitness environment for certain algorithms that follow reinforcement learning approach of optimization
- Parameters
method – (str) the supported algorithms, choose either:
dqn
,ppo
,acktr
,acer
,a2c
.fit – (function) the fitness function
bounds – (dict) input parameter type and lower/upper bounds in dictionary form. Example:
bounds={'x1': ['int', 1, 4], 'x2': ['float', 0.1, 0.8], 'x3': ['float', 2.2, 6.2]}
ncores – (int) number of parallel processors
mode – (str) problem type, either
min
for minimization problem ormax
for maximization (RL is default tomax
)episode_length – (int): number of individuals to evaluate before resetting the environment to random initial guess.
-
class
neorl.utils.neorlcalls.
RLLogger
(check_freq=1, plot_freq=None, n_avg_steps=10, pngname='history', save_model=False, model_name='bestmodel.pkl', save_best_only=True, verbose=False)[source] Callback for logging data of RL algorathims (x,y), compatible with: A2C, ACER, ACKTR, DQN, PPO
- Parameters
check_freq – (int) logging frequency, e.g. 1 will record every time step
plot_freq – (int) frequency of plotting the fitness progress (if
None
, plotter is deactivated)n_avg_steps – (int) if
plot_freq
is NOTNone
, then this is the number of timesteps to group to draw statistics for the plotter (e.g. 10 will group every 10 time steps to estimate min, max, mean, and std).pngname – (str) name of the plot that will be saved if
plot_freq
is NOTNone
.save_model – (bool) whether or not to save the RL neural network model (model is saved every
check_freq
)model_name – (str) name of the model to be saved if
save_model=True
save_best_only – (bool) if
save_model = True
, then this flag only saves the model if the fitness value improves.verbose – (bool) print updates to the screen
Example¶
Train a PPO agent to optimize the 5-D sphere function
from neorl import PPO2
from neorl import MlpPolicy
from neorl import RLLogger
from neorl import CreateEnvironment
def Sphere(individual):
"""Sphere test objective function.
F(x) = sum_{i=1}^d xi^2
d=1,2,3,...
Range: [-100,100]
Minima: 0
"""
return sum(x**2 for x in individual)
nx=5
bounds={}
for i in range(1,nx+1):
bounds['x'+str(i)]=['float', -10, 10]
if __name__=='__main__': #use this "if" block for parallel PPO!
#create an enviroment class
env=CreateEnvironment(method='ppo', fit=Sphere,
bounds=bounds, mode='min', episode_length=50)
#create a callback function to log data
cb=RLLogger(check_freq=1)
#create a RL object based on the env object
ppo = PPO2(MlpPolicy, env=env, n_steps=12, seed=1)
#optimise the enviroment class
ppo.learn(total_timesteps=2000, callback=cb)
#print the best results
print('--------------- PPO results ---------------')
print('The best value of x found:', cb.xbest)
print('The best value of y found:', cb.rbest)
Notes¶
PPO is the most popular RL algorithm due to its robustness. PPO is parallel and supports all types of spaces.
PPO shows sensitivity to
n_steps
,vf_coef
,ent_coef
, andlam
. It is always good to consider tuning these hyperparameters before using for optimization. In particular,n_steps
is considered the most important parameter to tune for PPO. Always start with smalln_steps
and increase as needed.For PPO, always ensure that
ncores
*n_steps
is divisible bynminibatches
. For example, ifnminibatches=4
, thenncores=12
/n_steps=5
setting works, whilencores=5
/n_steps=5
will fail. For tuning purposes, it is recommended to choosencores
divisible bynminibatches
so that you can changen_steps
more freely.The cost of PPO equals to the
total_timesteps
in thelearn
function, where the original fitness function will be accessedtotal_timesteps
times.See how PPO is used to solve two common combinatorial problems in TSP and KP.
Acknowledgment¶
Thanks to our fellows in stable-baselines, as we used their standalone RL implementation, which is utilized as a baseline to leverage advanced neuroevolution algorithms.
Hill, Ashley, et al. “Stable baselines.” (2018).