Q-Value Policy Gradient Implementations¶

This page documents algorithms, agents, and models implemented for Q-value policy gradient methods. (Much of the functionality around training and replay buffers looks similar to DQN.)

Deep Deterministc Policy Gradient (DDPG)¶

class rlpyt.algos.qpg.ddpg.DDPG(discount=0.99, batch_size=64, min_steps_learn=10000, replay_size=1000000, replay_ratio=64, target_update_tau=0.01, target_update_interval=1, policy_update_interval=1, learning_rate=0.0001, q_learning_rate=0.001, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, initial_optim_state_dict=None, clip_grad_norm=100000000.0, q_target_clip=1000000.0, n_step_return=1, updates_per_sync=1, bootstrap_timelimit=True, ReplayBufferCls=None)¶

Bases: rlpyt.algos.base.RlAlgorithm

Deep deterministic policy gradient algorithm, training from a replay buffer.

__init__(discount=0.99, batch_size=64, min_steps_learn=10000, replay_size=1000000, replay_ratio=64, target_update_tau=0.01, target_update_interval=1, policy_update_interval=1, learning_rate=0.0001, q_learning_rate=0.001, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, initial_optim_state_dict=None, clip_grad_norm=100000000.0, q_target_clip=1000000.0, n_step_return=1, updates_per_sync=1, bootstrap_timelimit=True, ReplayBufferCls=None)¶: Saves input arguments.

initialize(agent, n_itr, batch_spec, mid_batch_reset, examples, world_size=1, rank=0)¶: Stores input arguments and initializes replay buffer and optimizer. Use in non-async runners. Computes number of gradient updates per optimization iteration as (replay_ratio * sampler-batch-size / training-batch_size).

async_initialize(agent, sampler_n_itr, batch_spec, mid_batch_reset, examples, world_size=1)¶: Used in async runner only; returns replay buffer allocated in shared memory, does not instantiate optimizer.

optim_initialize(rank=0)¶: Called in initilize or by async runner after forking sampler.

initialize_replay_buffer(examples, batch_spec, async_=False)¶: Allocates replay buffer using examples and with the fields in SamplesToBuffer namedarraytuple.

optimize_agent(itr, samples=None, sampler_itr=None)¶: Extracts the needed fields from input samples and stores them in the replay buffer. Then samples from the replay buffer to train the agent by gradient updates (with the number of updates determined by replay ratio, sampler batch size, and training batch size).

samples_to_buffer(samples)¶: Defines how to add data from sampler into the replay buffer. Called in optimize_agent() if samples are provided to that method.

mu_loss(samples, valid)¶: Computes the mu_loss as the Q-value at that action.

q_loss(samples, valid)¶: Constructs the n-step Q-learning loss using target Q. Input samples have leading batch dimension [B,..] (but not time).

class rlpyt.agents.qpg.ddpg_agent.DdpgAgent(ModelCls=<class 'rlpyt.models.qpg.mlp.MuMlpModel'>, QModelCls=<class 'rlpyt.models.qpg.mlp.QofMuMlpModel'>, model_kwargs=None, q_model_kwargs=None, initial_model_state_dict=None, initial_q_model_state_dict=None, action_std=0.1, action_noise_clip=None)¶

Bases: rlpyt.agents.base.BaseAgent

Agent for deep deterministic policy gradient algorithm.

__init__(ModelCls=<class 'rlpyt.models.qpg.mlp.MuMlpModel'>, QModelCls=<class 'rlpyt.models.qpg.mlp.QofMuMlpModel'>, model_kwargs=None, q_model_kwargs=None, initial_model_state_dict=None, initial_q_model_state_dict=None, action_std=0.1, action_noise_clip=None)¶: Saves input arguments; default network sizes saved here.

initialize(env_spaces, share_memory=False, global_B=1, env_ranks=None)¶: Instantiates mu and q, and target_mu and target_q models.

q(observation, prev_action, prev_reward, action)¶: Compute Q-value for input state/observation and action (with grad).

q_at_mu(observation, prev_action, prev_reward)¶: Compute Q-value for input state/observation, through the mu_model (with grad).

target_q_at_mu(observation, prev_action, prev_reward)¶: Compute target Q-value for input state/observation, through the target mu_model.

step(observation, prev_action, prev_reward)¶: Computes distribution parameters (mu) for state/observation, returns (gaussian) sampled action.

class rlpyt.models.qpg.mlp.MuMlpModel(observation_shape, hidden_sizes, action_size, output_max=1)¶

Bases: sphinx.ext.autodoc.importer._MockObject

MLP neural net for action mean (mu) output for DDPG agent.

__init__(observation_shape, hidden_sizes, action_size, output_max=1)¶: Instantiate neural net according to inputs.

class rlpyt.models.qpg.mlp.QofMuMlpModel(observation_shape, hidden_sizes, action_size)¶

Bases: sphinx.ext.autodoc.importer._MockObject

Q portion of the model for DDPG, an MLP.

__init__(observation_shape, hidden_sizes, action_size)¶: Instantiate neural net according to inputs.

Twin Delayed Deep Deterministic Policy Gradient (TD3)¶

class rlpyt.algos.qpg.td3.TD3(batch_size=100, replay_ratio=100, target_update_tau=0.005, target_update_interval=2, policy_update_interval=2, mu_learning_rate=0.001, q_learning_rate=0.001, **kwargs)¶

Bases: rlpyt.algos.qpg.ddpg.DDPG

Twin delayed deep deterministic policy gradient algorithm.

__init__(batch_size=100, replay_ratio=100, target_update_tau=0.005, target_update_interval=2, policy_update_interval=2, mu_learning_rate=0.001, q_learning_rate=0.001, **kwargs)¶: Saved input arguments.

q_loss(samples, valid)¶: Computes MSE Q-loss for twin Q-values and min of target-Q values.

class rlpyt.agents.qpg.td3_agent.Td3Agent(pretrain_std=0.5, target_noise_std=0.2, target_noise_clip=0.5, initial_q2_model_state_dict=None, **kwargs)¶

Bases: rlpyt.agents.qpg.ddpg_agent.DdpgAgent

Agent for TD3 algorithm, using two Q-models and two target Q-models.

__init__(pretrain_std=0.5, target_noise_std=0.2, target_noise_clip=0.5, initial_q2_model_state_dict=None, **kwargs)¶: Saves input arguments.

q(observation, prev_action, prev_reward, action)¶: Compute twin Q-values for state/observation and input action (with grad).

target_q_at_mu(observation, prev_action, prev_reward)¶: Compute twin target Q-values for state/observation, through target mu model.

Soft Actor Critic (SAC)¶

class rlpyt.algos.qpg.sac.SAC(discount=0.99, batch_size=256, min_steps_learn=10000, replay_size=1000000, replay_ratio=256, target_update_tau=0.005, target_update_interval=1, learning_rate=0.0003, fixed_alpha=None, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, initial_optim_state_dict=None, action_prior='uniform', reward_scale=1, target_entropy='auto', reparameterize=True, clip_grad_norm=1000000000.0, n_step_return=1, updates_per_sync=1, bootstrap_timelimit=True, ReplayBufferCls=None)¶

Bases: rlpyt.algos.base.RlAlgorithm

Soft actor critic algorithm, training from a replay buffer.

__init__(discount=0.99, batch_size=256, min_steps_learn=10000, replay_size=1000000, replay_ratio=256, target_update_tau=0.005, target_update_interval=1, learning_rate=0.0003, fixed_alpha=None, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, initial_optim_state_dict=None, action_prior='uniform', reward_scale=1, target_entropy='auto', reparameterize=True, clip_grad_norm=1000000000.0, n_step_return=1, updates_per_sync=1, bootstrap_timelimit=True, ReplayBufferCls=None)¶: Save input arguments.

initialize(agent, n_itr, batch_spec, mid_batch_reset, examples, world_size=1, rank=0)¶: Stores input arguments and initializes replay buffer and optimizer. Use in non-async runners. Computes number of gradient updates per optimization iteration as (replay_ratio * sampler-batch-size / training-batch_size).

optim_initialize(rank=0)¶: Called in initilize or by async runner after forking sampler.

initialize_replay_buffer(examples, batch_spec, async_=False)¶: Allocates replay buffer using examples and with the fields in SamplesToBuffer namedarraytuple.

optimize_agent(itr, samples=None, sampler_itr=None)¶: Extracts the needed fields from input samples and stores them in the replay buffer. Then samples from the replay buffer to train the agent by gradient updates (with the number of updates determined by replay ratio, sampler batch size, and training batch size).

samples_to_buffer(samples)¶: Defines how to add data from sampler into the replay buffer. Called in optimize_agent() if samples are provided to that method.

loss(samples)¶

Computes losses for twin Q-values against the min of twin target Q-values and an entropy term. Computes reparameterized policy loss, and loss for tuning entropy weighting, alpha.

Input samples have leading batch dimension [B,..] (but not time).

class rlpyt.agents.qpg.sac_agent.SacAgent(ModelCls=<class 'rlpyt.models.qpg.mlp.PiMlpModel'>, QModelCls=<class 'rlpyt.models.qpg.mlp.QofMuMlpModel'>, model_kwargs=None, q_model_kwargs=None, v_model_kwargs=None, initial_model_state_dict=None, action_squash=1.0, pretrain_std=0.75)¶

Bases: rlpyt.agents.base.BaseAgent

Agent for SAC algorithm, including action-squashing, using twin Q-values.

__init__(ModelCls=<class 'rlpyt.models.qpg.mlp.PiMlpModel'>, QModelCls=<class 'rlpyt.models.qpg.mlp.QofMuMlpModel'>, model_kwargs=None, q_model_kwargs=None, v_model_kwargs=None, initial_model_state_dict=None, action_squash=1.0, pretrain_std=0.75)¶: Saves input arguments; network defaults stored within.

q(observation, prev_action, prev_reward, action)¶: Compute twin Q-values for state/observation and input action (with grad).

target_q(observation, prev_action, prev_reward, action)¶: Compute twin target Q-values for state/observation and input action.

pi(observation, prev_action, prev_reward)¶: Compute action log-probabilities for state/observation, and sample new action (with grad). Uses special sample_loglikelihood() method of Gaussian distriution, which handles action squashing through this process.

class rlpyt.models.qpg.mlp.PiMlpModel(observation_shape, hidden_sizes, action_size)¶: Action distrubition MLP model for SAC agent.