Actor Critic using Kronecker-Factored Trust Region (ACKTR)

Actor Critic using Kronecker-Factored Trust Region (ACKTR) uses Kronecker-factored approximate curvature (K-FAC) for trust region optimization. ACKTR uses K-FAC to allow more efficient inversion of the covariance matrix of the gradient. ACKTR also extends the natural policy gradient algorithm to optimize value functions via Gauss-Newton approximation.

Original paper: https://arxiv.org/abs/1708.05144

What can you use?

  • Multi processing: ✔️

  • Discrete spaces: ✔️

  • Continuous spaces: ✔️

  • Mixed Discrete/Continuous spaces: ✔️

Parameters

class neorl.rl.baselines.acktr.ACKTR(policy, env, gamma=0.99, n_steps=20, ent_coef=0.01, vf_coef=0.25, vf_fisher_coef=1.0, learning_rate=0.25, max_grad_norm=0.5, kfac_clip=0.001, lr_schedule='linear', verbose=0, seed=None, _init_setup_model=True)[source]

The ACKTR (Actor Critic using Kronecker-Factored Trust Region) model class4

Parameters
  • policy – (ActorCriticPolicy or str) The policy model to use (e.g. MlpPolicy)

  • env – (NEORL environment or Gym environment) The environment to learn with PPO, either use NEORL method CreateEnvironment (see below) or construct your custom Gym environment

  • gamma – (float) Discount factor

  • n_steps – (int) The number of steps to run for each environment

  • ent_coef – (float) The weight for the entropy loss

  • vf_coef – (float) The weight for the loss on the value function

  • vf_fisher_coef – (float) The weight for the fisher loss on the value function

  • learning_rate – (float) The initial learning rate for the RMS prop optimizer

  • max_grad_norm – (float) The clipping value for the maximum gradient

  • kfac_clip – (float) gradient clipping for Kullback-Leibler

  • lr_schedule – (str) The type of scheduler for the learning rate update (‘linear’, ‘constant’, ‘double_linear_con’, ‘middle_drop’ or ‘double_middle_drop’)

  • verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug

  • seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None (default), use random seed.

learn(total_timesteps, callback=None, log_interval=100, tb_log_name='ACKTR', reset_num_timesteps=True)[source]

Return a trained model.

Parameters
  • total_timesteps – (int) The total number of samples to train on

  • callback – (Union[callable, [callable], BaseCallback]) function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted. When the callback inherits from BaseCallback, you will have access to additional stages of the training (training start/end), please read the documentation for more details.

  • log_interval – (int) The number of timesteps before logging.

  • tb_log_name – (str) the name of the run for tensorboard log

  • reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)

Returns

(BaseRLModel) the trained model

classmethod load(load_path, env=None, custom_objects=None, **kwargs)

Load the model from file

Parameters
  • load_path – (str or file-like) the saved parameter location

  • env – (Gym Environment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)

  • custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable is present in this dictionary as a key, it will not be deserialized and the corresponding item will be used instead. Similar to custom_objects in keras.models.load_model. Useful when you have an object in file that can not be deserialized.

  • kwargs – extra arguments to change the model when loading

predict(observation, state=None, mask=None, deterministic=False)

Get the model’s action from an observation

Parameters
  • observation – (np.ndarray) the input observation

  • state – (np.ndarray) The last states (can be None, used in recurrent policies)

  • mask – (np.ndarray) The last masks (can be None, used in recurrent policies)

  • deterministic – (bool) Whether or not to return deterministic actions.

Returns

(np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)

save(save_path, cloudpickle=False)[source]

Save the current parameters to file

Parameters
  • save_path – (str or file-like) The save location

  • cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.

class neorl.rl.make_env.CreateEnvironment(method, fit, bounds, ncores=1, mode='max', episode_length=50)[source]

A module to construct a fitness environment for certain algorithms that follow reinforcement learning approach of optimization

Parameters
  • method – (str) the supported algorithms, choose either: dqn, ppo, acktr, acer, a2c.

  • fit – (function) the fitness function

  • bounds – (dict) input parameter type and lower/upper bounds in dictionary form. Example: bounds={'x1': ['int', 1, 4], 'x2': ['float', 0.1, 0.8], 'x3': ['float', 2.2, 6.2]}

  • ncores – (int) number of parallel processors

  • mode – (str) problem type, either min for minimization problem or max for maximization (RL is default to max)

  • episode_length – (int): number of individuals to evaluate before resetting the environment to random initial guess.

class neorl.utils.neorlcalls.RLLogger(check_freq=1, plot_freq=None, n_avg_steps=10, pngname='history', save_model=False, model_name='bestmodel.pkl', save_best_only=True, verbose=False)[source]

Callback for logging data of RL algorathims (x,y), compatible with: A2C, ACER, ACKTR, DQN, PPO

Parameters
  • check_freq – (int) logging frequency, e.g. 1 will record every time step

  • plot_freq – (int) frequency of plotting the fitness progress (if None, plotter is deactivated)

  • n_avg_steps – (int) if plot_freq is NOT None, then this is the number of timesteps to group to draw statistics for the plotter (e.g. 10 will group every 10 time steps to estimate min, max, mean, and std).

  • pngname – (str) name of the plot that will be saved if plot_freq is NOT None.

  • save_model – (bool) whether or not to save the RL neural network model (model is saved every check_freq)

  • model_name – (str) name of the model to be saved if save_model=True

  • save_best_only – (bool) if save_model = True, then this flag only saves the model if the fitness value improves.

  • verbose – (bool) print updates to the screen

Example

Train an ACKTR agent to optimize the 5-D sphere function

from neorl import ACKTR
from neorl import MlpPolicy
from neorl import RLLogger
from neorl import CreateEnvironment

def Sphere(individual):
        """Sphere test objective function.
                F(x) = sum_{i=1}^d xi^2
                d=1,2,3,...
                Range: [-100,100]
                Minima: 0
        """
        return sum(x**2 for x in individual)

nx=5
bounds={}
for i in range(1,nx+1):
        bounds['x'+str(i)]=['float', -10, 10]

if __name__=='__main__':  #use this "if" block for parallel ACKTR!
    
    #create an enviroment class
    env=CreateEnvironment(method='acktr', fit=Sphere, 
                          bounds=bounds, mode='min', episode_length=50)
    #create a callback function to log data
    cb=RLLogger(check_freq=1)
    #create an acktr object based on the env object
    acktr = ACKTR(MlpPolicy, env=env, n_steps=12, seed=1)
    #optimise the enviroment class
    acktr.learn(total_timesteps=2000, callback=cb)
    #print the best results
    print('--------------- ACKTR results ---------------')
    print('The best value of x found:', cb.xbest)
    print('The best value of y found:', cb.rbest)

Notes

  • ACKTR belongs to the actor-critic family of reinforcement learning. ACKTR uses some methods to increase the efficiency of reinforcement learning gradient-based search. ACKTR is parallel and supports all types of spaces.

  • ACKTR shows sensitivity to n_steps, vf_fisher_coef, vf_coef, and learning_rate. It is always good to consider tuning these hyperparameters before using for optimization. In particular, n_steps is considered the most important parameter to tune for ACKTR. Always start with small n_steps and increase as needed.

  • The cost of ACKTR equals to the total_timesteps in the learn function, where the original fitness function will be accessed total_timesteps times.

  • See how ACKTR is used to solve two common combinatorial problems in TSP and KP.

Acknowledgment

Thanks to our fellows in stable-baselines, as we used their standalone RL implementation, which is utilized as a baseline to leverage advanced neuroevolution algorithms.

Hill, Ashley, et al. “Stable baselines.” (2018).