Deep Q Learning (DQN)¶

Deep Q Network (DQN) and its extensions (Double-DQN, Dueling-DQN, Prioritized Experience Replay).

Original papers:

DQN paper: https://arxiv.org/abs/1312.5602
Dueling DQN: https://arxiv.org/abs/1511.06581
Double-Q Learning: https://arxiv.org/abs/1509.06461
Prioritized Experience Replay: https://arxiv.org/abs/1511.05952

What can you use?¶

Multi processing: ❌
Discrete spaces: ✔️
Continuous spaces: ❌
Mixed Discrete/Continuous spaces: ❌

Parameters¶

class neorl.rl.baselines.deepq.DQN(policy, env, gamma=0.99, learning_rate=0.0005, buffer_size=50000, exploration_fraction=0.1, eps_final=0.02, eps_init=1.0, train_freq=1, batch_size=32, learning_starts=1000, target_network_update_freq=500, prioritized_replay=True, verbose=0, seed=None, _init_setup_model=True)[source]¶

The DQN model class

Parameters

policy – (DQNPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, LnMlpPolicy, …)
env – (NEORL environment or Gym environment) The environment to learn with PPO, either use NEORL method CreateEnvironment (see below) or construct your custom Gym environment
gamma – (float) discount factor
learning_rate – (float) learning rate for adam optimizer
buffer_size – (int) size of the replay buffer
exploration_fraction – (float) fraction of entire training period over which the exploration rate is annealed
eps_final – (float) final value of random action probability (e.g. 0.05)
eps_init – (float) initial value of random action probability (e.g. 1.0)
train_freq – (int) update the model every train_freq steps. set to None to disable printing
batch_size – (int) size of a batched sampled from replay buffer for training
learning_starts – (int) how many steps of the model to collect transitions for before learning starts
target_network_update_freq – (int) update the target network every target_network_update_freq steps.
prioritized_replay – (bool) if True prioritized experience replay buffer will be used.
verbose – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
seed – (int) Seed for the pseudo-random generators (python, numpy, tensorflow). If None (default), use random seed.

learn(total_timesteps, callback=None, log_interval=100, tb_log_name='DQN', reset_num_timesteps=True, replay_wrapper=None)[source]¶

Return a trained model.

Parameters

total_timesteps – (int) The total number of samples to train on
callback – (Union[callable, [callable], BaseCallback]) function called at every steps with state of the algorithm. It takes the local and global variables. If it returns False, training is aborted. When the callback inherits from BaseCallback, you will have access to additional stages of the training (training start/end), please read the documentation for more details.
log_interval – (int) The number of timesteps before logging.
tb_log_name – (str) the name of the run for tensorboard log
reset_num_timesteps – (bool) whether or not to reset the current timestep number (used in logging)

Returns

(BaseRLModel) the trained model

classmethod load(load_path, env=None, custom_objects=None, **kwargs)¶

Load the model from file

Parameters

load_path – (str or file-like) the saved parameter location
env – (Gym Environment) the new environment to run the loaded model on (can be None if you only need prediction from a trained model)
custom_objects – (dict) Dictionary of objects to replace upon loading. If a variable is present in this dictionary as a key, it will not be deserialized and the corresponding item will be used instead. Similar to custom_objects in keras.models.load_model. Useful when you have an object in file that can not be deserialized.
kwargs – extra arguments to change the model when loading

predict(observation, state=None, mask=None, deterministic=True)[source]¶

Get the model’s action from an observation

Parameters

observation – (np.ndarray) the input observation
state – (np.ndarray) The last states (can be None, used in recurrent policies)
mask – (np.ndarray) The last masks (can be None, used in recurrent policies)
deterministic – (bool) Whether or not to return deterministic actions.

Returns

(np.ndarray, np.ndarray) the model’s action and the next state (used in recurrent policies)

save(save_path, cloudpickle=False)[source]¶

Save the current parameters to file

Parameters

save_path – (str or file-like) The save location
cloudpickle – (bool) Use older cloudpickle format instead of zip-archives.

class neorl.rl.make_env.CreateEnvironment(method, fit, bounds, ncores=1, mode='max', episode_length=50)[source]

A module to construct a fitness environment for certain algorithms that follow reinforcement learning approach of optimization

Parameters

method – (str) the supported algorithms, choose either: dqn, ppo, acktr, acer, a2c.
fit – (function) the fitness function
bounds – (dict) input parameter type and lower/upper bounds in dictionary form. Example: bounds={'x1': ['int', 1, 4], 'x2': ['float', 0.1, 0.8], 'x3': ['float', 2.2, 6.2]}
ncores – (int) number of parallel processors
mode – (str) problem type, either min for minimization problem or max for maximization (RL is default to max)
episode_length – (int): number of individuals to evaluate before resetting the environment to random initial guess.

class neorl.utils.neorlcalls.RLLogger(check_freq=1, plot_freq=None, n_avg_steps=10, pngname='history', save_model=False, model_name='bestmodel.pkl', save_best_only=True, verbose=False)[source]

Callback for logging data of RL algorathims (x,y), compatible with: A2C, ACER, ACKTR, DQN, PPO

Parameters

check_freq – (int) logging frequency, e.g. 1 will record every time step
plot_freq – (int) frequency of plotting the fitness progress (if None, plotter is deactivated)
n_avg_steps – (int) if plot_freq is NOT None, then this is the number of timesteps to group to draw statistics for the plotter (e.g. 10 will group every 10 time steps to estimate min, max, mean, and std).
pngname – (str) name of the plot that will be saved if plot_freq is NOT None.
save_model – (bool) whether or not to save the RL neural network model (model is saved every check_freq)
model_name – (str) name of the model to be saved if save_model=True
save_best_only – (bool) if save_model = True, then this flag only saves the model if the fitness value improves.
verbose – (bool) print updates to the screen

Example¶

Train a DQN agent to optimize the 5-D discrete sphere function

from neorl import DQN
from neorl import DQNPolicy
from neorl import RLLogger
from neorl import CreateEnvironment

def Sphere(individual):
        """Sphere test objective function.
                F(x) = sum_{i=1}^d xi^2
                d=1,2,3,...
                Range: [-100,100]
                Minima: 0
        """
        #print(individual)
        return sum(x**2 for x in individual)

nx=5
bounds={}
for i in range(1,nx+1):
        bounds['x'+str(i)]=['int', -100, 100]

#create an enviroment class
env=CreateEnvironment(method='dqn', 
                      fit=Sphere, 
                      bounds=bounds, 
                      mode='min', 
                      episode_length=50)
#create a callback function to log data
cb=RLLogger(check_freq=1)
#create a RL object based on the env object
dqn = DQN(DQNPolicy, env=env, seed=1)
#optimise the enviroment class
dqn.learn(total_timesteps=2000, callback=cb)
#print the best results
print('--------------- DQN results ---------------')
print('The best value of x found:', cb.xbest)
print('The best value of y found:', cb.rbest)

Notes¶

DQN is the most limited RL algorithm in the package with no multiprocessing and only restricted to discrete spaces. Nevertheless, DQN is considered the first and the heart of many deep RL algorithms.
For parallel RL algorithm with Q-value support like DQN, use ACER.
DQN shows sensitivity to exploration_fraction, train_freq, and target_network_update_freq. It is always good to consider tuning these hyperparameters before using for optimization.
Activating prioritized_replay seems to improve DQN performance.
The cost for DQN equals to the total_timesteps in the learn function, where the original fitness function will be accessed total_timesteps times.
See how DQN is used to solve two common combinatorial problems in TSP and KP.

Acknowledgment¶

Thanks to our fellows in stable-baselines, as we used their standalone RL implementation, which is utilized as a baseline to leverage advanced neuroevolution algorithms.

Hill, Ashley, et al. “Stable baselines.” (2018).