Deep Q-Learning Implementations¶

This page documents the implemented deep Q-learning algorithms, agents, and models. Up to Rainbow, minus noisy nets, can be run using Categorical-DQN with the options for double-DQN, dueling heads, n-step returns, and prioritized replay.

DQN¶

class rlpyt.algos.dqn.dqn.DQN(discount=0.99, batch_size=32, min_steps_learn=50000, delta_clip=1.0, replay_size=1000000, replay_ratio=8, target_update_tau=1, target_update_interval=312, n_step_return=1, learning_rate=0.00025, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, initial_optim_state_dict=None, clip_grad_norm=10.0, eps_steps=1000000, double_dqn=False, prioritized_replay=False, pri_alpha=0.6, pri_beta_init=0.4, pri_beta_final=1.0, pri_beta_steps=50000000, default_priority=None, ReplayBufferCls=None, updates_per_sync=1)¶

Bases: rlpyt.algos.base.RlAlgorithm

DQN algorithm trainig from a replay buffer, with options for double-dqn, n-step returns, and prioritized replay.

__init__(discount=0.99, batch_size=32, min_steps_learn=50000, delta_clip=1.0, replay_size=1000000, replay_ratio=8, target_update_tau=1, target_update_interval=312, n_step_return=1, learning_rate=0.00025, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, initial_optim_state_dict=None, clip_grad_norm=10.0, eps_steps=1000000, double_dqn=False, prioritized_replay=False, pri_alpha=0.6, pri_beta_init=0.4, pri_beta_final=1.0, pri_beta_steps=50000000, default_priority=None, ReplayBufferCls=None, updates_per_sync=1)¶

Saves input arguments.

delta_clip selects the Huber loss; if None, uses MSE.

replay_ratio determines the ratio of data-consumption to data-generation. For example, original DQN sampled 4 environment steps between each training update with batch-size 32, for a replay ratio of 8.

initialize(agent, n_itr, batch_spec, mid_batch_reset, examples, world_size=1, rank=0)¶: Stores input arguments and initializes replay buffer and optimizer. Use in non-async runners. Computes number of gradient updates per optimization iteration as (replay_ratio * sampler-batch-size / training-batch_size).

async_initialize(agent, sampler_n_itr, batch_spec, mid_batch_reset, examples, world_size=1)¶: Used in async runner only; returns replay buffer allocated in shared memory, does not instantiate optimizer.

optim_initialize(rank=0)¶: Called in initilize or by async runner after forking sampler.

initialize_replay_buffer(examples, batch_spec, async_=False)¶: Allocates replay buffer using examples and with the fields in SamplesToBuffer namedarraytuple. Uses frame-wise buffers, so that only unique frames are stored, using less memory (usual observations are 4 most recent frames, with only newest frame distince from previous observation).

optimize_agent(itr, samples=None, sampler_itr=None)¶: Extracts the needed fields from input samples and stores them in the replay buffer. Then samples from the replay buffer to train the agent by gradient updates (with the number of updates determined by replay ratio, sampler batch size, and training batch size). If using prioritized replay, updates the priorities for sampled training batches.

samples_to_buffer(samples)¶: Defines how to add data from sampler into the replay buffer. Called in optimize_agent() if samples are provided to that method. In asynchronous mode, will be called in the memory_copier process.

loss(samples)¶

Computes the Q-learning loss, based on: 0.5 * (Q - target_Q) ^ 2. Implements regular DQN or Double-DQN for computing target_Q values using the agent’s target network. Computes the Huber loss using delta_clip, or if None, uses MSE. When using prioritized replay, multiplies losses by importance sample weights.

Input samples have leading batch dimension [B,..] (but not time).

Calls the agent to compute forward pass on training inputs, and calls agent.target() to compute target values.

Returns loss and TD-absolute-errors for use in prioritization.

Warning

If not using mid_batch_reset, the sampler will only reset environments between iterations, so some samples in the replay buffer will be invalid. This case is not supported here currently.

class rlpyt.agents.dqn.epsilon_greedy.EpsilonGreedyAgentMixin(eps_init=1, eps_final=0.01, eps_final_min=None, eps_itr_min=50, eps_itr_max=1000, eps_eval=0.001, *args, **kwargs)¶

Mixin class to operate all epsilon-greedy agents. Includes epsilon annealing, switching between sampling and evaluation epsilons, and vector-valued epsilons. The agent subclass must use a compatible epsilon-greedy distribution.

__init__(eps_init=1, eps_final=0.01, eps_final_min=None, eps_itr_min=50, eps_itr_max=1000, eps_eval=0.001, *args, **kwargs)¶: Saves input arguments. eps_final_min other than None will use vector-valued epsilon, log-spaced.

collector_initialize(global_B=1, env_ranks=None)¶: For vector-valued epsilon, the agent inside the sampler worker process must initialize with its own epsilon values.

make_vec_eps(global_B, env_ranks)¶: Construct log-spaced epsilon values and select local assignments from the global number of sampler environment instances (for SyncRl and AsyncRl).

sample_mode(itr)¶: Extend method to set epsilon for sampling (including annealing).

eval_mode(itr)¶: Extend method to set epsilon for evaluation, using 1 for pre-training eval.

class rlpyt.agents.dqn.dqn_agent.DqnAgent(eps_init=1, eps_final=0.01, eps_final_min=None, eps_itr_min=50, eps_itr_max=1000, eps_eval=0.001, *args, **kwargs)¶

Bases: rlpyt.agents.dqn.epsilon_greedy.EpsilonGreedyAgentMixin, rlpyt.agents.base.BaseAgent

Standard agent for DQN algorithms with epsilon-greedy exploration.

__call__(observation, prev_action, prev_reward)¶: Returns Q-values for states/observations (with grad).

initialize(env_spaces, share_memory=False, global_B=1, env_ranks=None)¶: Along with standard initialization, creates vector-valued epsilon for exploration, if applicable, with a different epsilon for each environment instance.

step(observation, prev_action, prev_reward)¶: Computes Q-values for states/observations and selects actions by epsilon-greedy. (no grad)

target(observation, prev_action, prev_reward)¶: Returns the target Q-values for states/observations.

update_target(tau=1)¶: Copies the model parameters into the target model.

class rlpyt.models.dqn.atari_dqn_model.AtariDqnModel(image_shape, output_size, fc_sizes=512, dueling=False, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶

Bases: sphinx.ext.autodoc.importer._MockObject

Standard convolutional network for DQN. 2-D convolution for multiple video frames per observation, feeding an MLP for Q-value outputs for the action set.

__init__(image_shape, output_size, fc_sizes=512, dueling=False, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶: Instantiates the neural network according to arguments; network defaults stored within this method.

forward(observation, prev_action, prev_reward)¶: Compute action Q-value estimates from input state. Infers leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Convolution layers process as [T*B, image_shape[0], image_shape[1],…,image_shape[-1]], with T=1,B=1 when not given. Expects uint8 images in [0,255] and converts them to float32 in [0,1] (to minimize image data storage and transfer). Used in both sampler and in algorithm (both via the agent).

Categorical-DQN¶

class rlpyt.algos.dqn.cat_dqn.CategoricalDQN(V_min=-10, V_max=10, **kwargs)¶

Bases: rlpyt.algos.dqn.dqn.DQN

Distributional DQN with fixed probability bins for the Q-value of each action, a.k.a. categorical.

__init__(V_min=-10, V_max=10, **kwargs)¶: Standard __init__() plus Q-value limits; the agent configures the number of atoms (bins).

loss(samples)¶

Computes the Distributional Q-learning loss, based on projecting the discounted rewards + target Q-distribution into the current Q-domain, with cross-entropy loss.

Returns loss and KL-divergence-errors for use in prioritization.

class rlpyt.agents.dqn.catdqn_agent.CatDqnAgent(n_atoms=51, **kwargs)¶

Bases: rlpyt.agents.dqn.dqn_agent.DqnAgent

Agent for Categorical DQN algorithm.

__init__(n_atoms=51, **kwargs)¶: Standard init, and set the number of probability atoms (bins).

step(observation, prev_action, prev_reward)¶: Compute the discrete distribution for the Q-value for each action for each state/observation (no grad).

class rlpyt.models.dqn.atari_catdqn_model.AtariCatDqnModel(image_shape, output_size, n_atoms=51, fc_sizes=512, dueling=False, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶

Bases: sphinx.ext.autodoc.importer._MockObject

2D conlutional network feeding into MLP with n_atoms outputs per action, representing a discrete probability distribution of Q-values.

__init__(image_shape, output_size, n_atoms=51, fc_sizes=512, dueling=False, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶: Instantiates the neural network according to arguments; network defaults stored within this method.

forward(observation, prev_action, prev_reward)¶: Returns the probability masses num_atoms x num_actions for the Q-values for each state/observation, using softmax output nonlinearity.

Recurrent DQN (R2D1)¶

class rlpyt.algos.dqn.r2d1.R2D1(discount=0.997, batch_T=80, batch_B=64, warmup_T=40, store_rnn_state_interval=40, min_steps_learn=100000, delta_clip=None, replay_size=1000000, replay_ratio=1, target_update_interval=2500, n_step_return=5, learning_rate=0.0001, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, initial_optim_state_dict=None, clip_grad_norm=80.0, eps_steps=1000000, double_dqn=True, prioritized_replay=True, pri_alpha=0.6, pri_beta_init=0.9, pri_beta_final=0.9, pri_beta_steps=50000000, pri_eta=0.9, default_priority=None, input_priorities=True, input_priority_shift=None, value_scale_eps=0.001, ReplayBufferCls=None, updates_per_sync=1)¶

Bases: rlpyt.algos.dqn.dqn.DQN

Recurrent-replay DQN with options for: Double-DQN, Dueling Architecture, n-step returns, prioritized_replay.

__init__(discount=0.997, batch_T=80, batch_B=64, warmup_T=40, store_rnn_state_interval=40, min_steps_learn=100000, delta_clip=None, replay_size=1000000, replay_ratio=1, target_update_interval=2500, n_step_return=5, learning_rate=0.0001, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, initial_optim_state_dict=None, clip_grad_norm=80.0, eps_steps=1000000, double_dqn=True, prioritized_replay=True, pri_alpha=0.6, pri_beta_init=0.9, pri_beta_final=0.9, pri_beta_steps=50000000, pri_eta=0.9, default_priority=None, input_priorities=True, input_priority_shift=None, value_scale_eps=0.001, ReplayBufferCls=None, updates_per_sync=1)¶

Saves input arguments.

Parameters:	store_rnn_state_interval (int) – store RNN state only once this many steps, to reduce memory usage; replay sequences will only begin at the steps with stored recurrent state.

Note

Typically ran with store_rnn_state_interval equal to the sampler’s batch_T, 40. Then every 40 steps can be the beginning of a replay sequence, and will be guaranteed to start with a valid RNN state. Only reset the RNN state (and env) at the end of the sampler batch, so that the beginnings of episodes are trained on.

initialize_replay_buffer(examples, batch_spec, async_=False)¶: Similar to DQN but uses replay buffers which return sequences, and stores the agent’s recurrent state.

optimize_agent(itr, samples=None, sampler_itr=None)¶: Similar to DQN, except allows to compute the priorities of new samples as they enter the replay buffer (input priorities), instead of only once they are used in training (important because the replay-ratio is quite low, about 1, so must avoid un-informative samples).

compute_input_priorities(samples)¶: Used when putting new samples into the replay buffer. Computes n-step TD-errors using recorded Q-values from online network and value scaling. Weights the max and the mean TD-error over each sequence to make a single priority value for that sequence.

Note

Although the original R2D2 implementation used the entire 80-step sequence to compute the input priorities, we ran R2D1 with 40 time-step sample batches, and so computed the priority for each 80-step training sequence based on one of the two 40-step halves. Algorithm argument input_priority_shift determines which 40-step half is used as the priority for the 80-step sequence. (Since this method might get executed by alternating memory copiers in async mode, don’t carry internal state here, do all computation with only the samples available in input. Could probably reduce to one memory copier and keep state there, if needed.)

loss(samples)¶

Samples have leading Time and Batch dimentions [T,B,..]. Move all samples to device first, and then slice for sub-sequences. Use same init_rnn_state for agent and target; start both at same t. Warmup the RNN state first on the warmup subsequence, then train on the remaining subsequence.

Returns loss (usually use MSE, not Huber), TD-error absolute values, and new sequence-wise priorities, based on weighted sum of max and mean TD-error over the sequence.

value_scale(x)¶: Value scaling function to handle raw rewards across games (not clipped).

inv_value_scale(z)¶: Invert the value scaling.

class rlpyt.agents.dqn.r2d1_agent.R2d1AgentBase(eps_init=1, eps_final=0.01, eps_final_min=None, eps_itr_min=50, eps_itr_max=1000, eps_eval=0.001, *args, **kwargs)¶

Bases: rlpyt.agents.dqn.dqn_agent.DqnAgent

Base agent for recurrent DQN (to add recurrent mixin).

step(observation, prev_action, prev_reward)¶: Computes Q-values for states/observations and selects actions by epsilon-greedy (no grad). Advances RNN state.

class rlpyt.agents.dqn.r2d1_agent.R2d1Agent(*args, **kwargs)¶

Bases: rlpyt.agents.base.RecurrentAgentMixin, rlpyt.agents.dqn.r2d1_agent.R2d1AgentBase

R2D1 agent.

class rlpyt.models.dqn.atari_r2d1_model.AtariR2d1Model(image_shape, output_size, fc_size=512, lstm_size=512, head_size=512, dueling=False, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶

Bases: sphinx.ext.autodoc.importer._MockObject

2D convolutional neural network (for multiple video frames per observation) feeding into an LSTM and MLP output for Q-value outputs for the action set.

__init__(image_shape, output_size, fc_size=512, lstm_size=512, head_size=512, dueling=False, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶: Instantiates the neural network according to arguments; network defaults stored within this method.

Miscellaneous¶

class rlpyt.models.dqn.dueling.DuelingHeadModel(input_size, hidden_sizes, output_size, grad_scale=0.7071067811865476)¶

Bases: sphinx.ext.autodoc.importer._MockObject

Model component for dueling DQN. For each state Q-value, uses a scalar output for mean (bias), and vector output for relative advantages associated with each action, so the Q-values are computed as: Mean + (Advantages - mean(Advantages)). Uses a shared bias for all Advantage outputs.Gradient scaling can be applied, affecting preceding layers in the backward pass.

forward(input)¶: Computes Q-values through value and advantage heads; applies gradient scaling.

advantage(input)¶: Computes shared-bias advantages.

class rlpyt.models.dqn.dueling.DistributionalDuelingHeadModel(input_size, hidden_sizes, output_size, n_atoms, grad_scale=0.7071067811865476)¶

Bases: sphinx.ext.autodoc.importer._MockObject

Model component for Dueling Distributional (Categorical) DQN, like DuelingHeadModel, but handles n_atoms outputs for each state-action Q-value distribution.

class rlpyt.models.dqn.atari_catdqn_model.DistributionalHeadModel(input_size, layer_sizes, output_size, n_atoms)¶

Bases: sphinx.ext.autodoc.importer._MockObject

An MLP head which reshapes output to [B, output_size, n_atoms].