RL-informed Evolution Strategies (PPO-ES)¶

The Proximal Policy Optimization algorithm starts the search to collect some individuals given a fitness function through a RL environment. In the second step, the best PPO individuals are used to guide evolution strategies (ES), where RL individuals are randomly introduced into the ES population to enrich their diversity. The user first runs PPO search followed by ES, the best results of both stages are reported to the user.

Original papers:

Radaideh, M. I., & Shirvan, K. (2021). Rule-based reinforcement learning methodology to inform evolutionary algorithms for constrained optimization of engineering applications. Knowledge-Based Systems, 217, 106836.
Radaideh, M. I., Forget, B., & Shirvan, K. (2021). Large-scale design optimisation of boiling water reactor bundles with neuroevolution. Annals of Nuclear Energy, 160, 108355.

What can you use?¶

Multi processing: ✔️
Discrete spaces: ✔️
Continuous spaces: ✔️
Mixed Discrete/Continuous spaces: ✔️

Parameters¶

class neorl.hybrid.ppoes.PPOES(mode, fit, env, bounds, npop=60, npop_rl=6, init_pop_rl=True, hyperparam={}, seed=None)[source]¶

A PPO-informed ES Neuroevolution module

Parameters

mode – (str) problem type, either min for minimization problem or max for maximization
fit – (function) the fitness function to be used with ES
env – (NEORL environment or Gym environment) The environment to learn with PPO, either use NEORL method CreateEnvironment (see below) or construct your custom Gym environment.
bounds – (dict) input parameter type and lower/upper bounds in dictionary form. Example: bounds={'x1': ['int', 1, 4], 'x2': ['float', 0.1, 0.8], 'x3': ['float', 2.2, 6.2]}
npop – (int): population size of ES
npop_rl – (int): number of RL/PPO individuals to use in ES population (npop_rl < npop)
init_pop_rl – (bool) flag to initialize ES population with PPO individuals
hyperparam – (dict) dictionary of ES hyperparameters (cxpb, cxmode, mutpb, alpha, mu, smin, smax) and PPO hyperparameters (n_steps, gamma, learning_rate, ent_coef, vf_coef, lam, cliprange, max_grad_norm, nminibatches, noptephocs)
seed – (int) random seed for sampling

evolute(ngen, ncores=1, verbose=False)[source]¶

This function evolutes the ES algorithm for number of generations with guidance from RL individuals.

Parameters

ngen – (int) number of generations to evolute
ncores – (int) number of parallel processors to use with ES
verbose – (bool) print statistics to screen

Returns

(tuple) (best individual, best fitness, and a list of fitness history)

learn(total_timesteps, rl_filter=100, verbose=False)[source]¶

This function starts the learning of PPO algorithm for number of timesteps to create individuals for evolutionary search

Parameters

total_timesteps – (int) number of timesteps to run
rl_filter – (int) number of top individuals to keep from the full RL search
verbose – (bool) print statistics to screen

Returns

(dataframe) dataframe of individuals/fitness sorted from best to worst

class neorl.rl.make_env.CreateEnvironment(method, fit, bounds, ncores=1, mode='max', episode_length=50)[source]

A module to construct a fitness environment for certain algorithms that follow reinforcement learning approach of optimization

Parameters

method – (str) the supported algorithms, choose either: dqn, ppo, acktr, acer, a2c.
fit – (function) the fitness function
bounds – (dict) input parameter type and lower/upper bounds in dictionary form. Example: bounds={'x1': ['int', 1, 4], 'x2': ['float', 0.1, 0.8], 'x3': ['float', 2.2, 6.2]}
ncores – (int) number of parallel processors
mode – (str) problem type, either min for minimization problem or max for maximization (RL is default to max)
episode_length – (int): number of individuals to evaluate before resetting the environment to random initial guess.

Example¶

Train a PPO-ES agent to optimize the 5-D sphere function

from neorl import PPOES
from neorl import CreateEnvironment

def Sphere(individual):
    """Sphere test objective function.
            F(x) = sum_{i=1}^d xi^2
            d=1,2,3,...
            Range: [-100,100]
            Minima: 0
    """
    y=sum(x**2 for x in individual)
    return y


#Setup the parameter space (d=5)
nx=5
BOUNDS={}
for i in range(1,nx+1):
    BOUNDS['x'+str(i)]=['float', -100, 100]
    

if __name__=='__main__':  #use this block for parallel PPO!
    #create an enviroment class for RL/PPO
    env=CreateEnvironment(method='ppo', fit=Sphere, ncores=1,  
                          bounds=BOUNDS, mode='min', episode_length=50)
    
    #change hyperparameters of PPO/ES if you like (defaults should be good to start with)
    h={'cxpb': 0.8,
       'mutpb': 0.2,
       'n_steps': 24,
       'lam': 1.0}
    
    #Important: `mode` in CreateEnvironment and `mode` in PPOES must be consistent
    #fit is needed to be passed again for ES, must be same as the one used in env
    ppoes=PPOES(mode='min', fit=Sphere, 
                env=env, npop_rl=4, init_pop_rl=True, 
                bounds=BOUNDS, hyperparam=h, seed=1)
    #first run RL for some timesteps
    rl=ppoes.learn(total_timesteps=2000, verbose=True)
    #second run ES, which will use RL data for guidance
    ppoes_x, ppoes_y, ppoes_hist=ppoes.evolute(ngen=20, ncores=1, verbose=True) #ncores for ES