Policy Gradient Implementations¶
This page documents the implemented policy gradient / actor-critic algorithms, agents, and models.
Algorithms¶
-
class
rlpyt.algos.pg.base.
PolicyGradientAlgo
¶ Bases:
rlpyt.algos.base.RlAlgorithm
Base policy gradient / actor-critic algorithm, which includes initialization procedure and processing of data samples to compute advantages.
-
initialize
(agent, n_itr, batch_spec, mid_batch_reset=False, examples=None, world_size=1, rank=0)¶ Build the torch optimizer and store other input attributes. Params
batch_spec
andexamples
are unused.
-
process_returns
(samples)¶ Compute bootstrapped returns and advantages from a minibatch of samples. Uses either discounted returns (if
self.gae_lambda==1
) or generalized advantage estimation. Mask out invalid samples according tomid_batch_reset
or for recurrent agent. Optionally, normalize advantages.
-
-
class
rlpyt.algos.pg.a2c.
A2C
(discount=0.99, learning_rate=0.001, value_loss_coeff=0.5, entropy_loss_coeff=0.01, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, clip_grad_norm=1.0, initial_optim_state_dict=None, gae_lambda=1, normalize_advantage=False)¶ Bases:
rlpyt.algos.pg.base.PolicyGradientAlgo
Advantage Actor Critic algorithm (synchronous). Trains the agent by taking one gradient step on each iteration of samples, with advantages computed by generalized advantage estimation.
-
__init__
(discount=0.99, learning_rate=0.001, value_loss_coeff=0.5, entropy_loss_coeff=0.01, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, clip_grad_norm=1.0, initial_optim_state_dict=None, gae_lambda=1, normalize_advantage=False)¶ Saves the input settings.
-
optimize_agent
(itr, samples)¶ Train the agent on input samples, by one gradient step.
-
loss
(samples)¶ Computes the training loss: policy_loss + value_loss + entropy_loss. Policy loss: log-likelihood of actions * advantages Value loss: 0.5 * (estimated_value - return) ^ 2 Organizes agent inputs from training samples, calls the agent instance to run forward pass on training data, and uses the
agent.distribution
to compute likelihoods and entropies. Valid for feedforward or recurrent agents.
-
-
class
rlpyt.algos.pg.ppo.
PPO
(discount=0.99, learning_rate=0.001, value_loss_coeff=1.0, entropy_loss_coeff=0.01, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, clip_grad_norm=1.0, initial_optim_state_dict=None, gae_lambda=1, minibatches=4, epochs=4, ratio_clip=0.1, linear_lr_schedule=True, normalize_advantage=False)¶ Bases:
rlpyt.algos.pg.base.PolicyGradientAlgo
Proximal Policy Optimization algorithm. Trains the agent by taking multiple epochs of gradient steps on minibatches of the training data at each iteration, with advantages computed by generalized advantage estimation. Uses clipped likelihood ratios in the policy loss.
-
__init__
(discount=0.99, learning_rate=0.001, value_loss_coeff=1.0, entropy_loss_coeff=0.01, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, clip_grad_norm=1.0, initial_optim_state_dict=None, gae_lambda=1, minibatches=4, epochs=4, ratio_clip=0.1, linear_lr_schedule=True, normalize_advantage=False)¶ Saves input settings.
-
initialize
(*args, **kwargs)¶ Extends base
initialize()
to initialize learning rate schedule, if applicable.
-
optimize_agent
(itr, samples)¶ Train the agent, for multiple epochs over minibatches taken from the input samples. Organizes agent inputs from the training data, and moves them to device (e.g. GPU) up front, so that minibatches are formed within device, without further data transfer.
-
loss
(agent_inputs, action, return_, advantage, valid, old_dist_info, init_rnn_state=None)¶ Compute the training loss: policy_loss + value_loss + entropy_loss Policy loss: min(likelhood-ratio * advantage, clip(likelihood_ratio, 1-eps, 1+eps) * advantage) Value loss: 0.5 * (estimated_value - return) ^ 2 Calls the agent to compute forward pass on training data, and uses the
agent.distribution
to compute likelihoods and entropies. Valid for feedforward or recurrent agents.
-
Agents¶
Continuous Actions¶
-
class
rlpyt.agents.pg.gaussian.
GaussianPgAgent
(ModelCls=None, model_kwargs=None, initial_model_state_dict=None)¶ Bases:
rlpyt.agents.base.BaseAgent
Agent for policy gradient algorithm using Gaussian action distribution.
-
__call__
(observation, prev_action, prev_reward)¶ Performs forward pass on training data, for algorithm.
-
initialize
(env_spaces, share_memory=False, global_B=1, env_ranks=None)¶ Extends base method to build Gaussian distribution.
-
step
(observation, prev_action, prev_reward)¶ Compute policy’s action distribution from inputs, and sample an action. Calls the model to produce mean, log_std, and value estimate. Moves inputs to device and returns outputs back to CPU, for the sampler. (no grad)
-
value
(observation, prev_action, prev_reward)¶ Compute the value estimate for the environment state, e.g. for the bootstrap value, V(s_{T+1}), in the sampler. (no grad)
-
-
class
rlpyt.agents.pg.gaussian.
RecurrentGaussianPgAgentBase
(ModelCls=None, model_kwargs=None, initial_model_state_dict=None)¶ Bases:
rlpyt.agents.base.BaseAgent
-
__call__
(observation, prev_action, prev_reward, init_rnn_state)¶ Performs forward pass on training data, for algorithm (requires recurrent state input).
-
step
(observation, prev_action, prev_reward)¶ Compute policy’s action distribution from inputs, and sample an action. Calls the model to produce mean, log_std, value estimate, and next recurrent state. Moves inputs to device and returns outputs back to CPU, for the sampler. Advances the recurrent state of the agent. (no grad)
-
value
(observation, prev_action, prev_reward)¶ Compute the value estimate for the environment state using the currently held recurrent state, without advancing the recurrent state, e.g. for the bootstrap value V(s_{T+1}), in the sampler. (no grad)
-
-
class
rlpyt.agents.pg.gaussian.
RecurrentGaussianPgAgent
(*args, **kwargs)¶ Bases:
rlpyt.agents.base.RecurrentAgentMixin
,rlpyt.agents.pg.gaussian.RecurrentGaussianPgAgentBase
-
class
rlpyt.agents.pg.gaussian.
AlternatingRecurrentGaussianPgAgent
(*args, **kwargs)¶ Bases:
rlpyt.agents.base.AlternatingRecurrentAgentMixin
,rlpyt.agents.pg.gaussian.RecurrentGaussianPgAgentBase
-
class
rlpyt.agents.pg.mujoco.
MujocoMixin
¶ Mixin class defining which environment interface properties are given to the model. Now supports observation normalization, including multi-GPU.
-
make_env_to_model_kwargs
(env_spaces)¶ Extract observation_shape and action_size.
-
-
class
rlpyt.agents.pg.mujoco.
MujocoFfAgent
(ModelCls=<class 'rlpyt.models.pg.mujoco_ff_model.MujocoFfModel'>, **kwargs)¶ Bases:
rlpyt.agents.pg.mujoco.MujocoMixin
,rlpyt.agents.pg.gaussian.GaussianPgAgent
-
__init__
(ModelCls=<class 'rlpyt.models.pg.mujoco_ff_model.MujocoFfModel'>, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
rlpyt.agents.pg.mujoco.
MujocoLstmAgent
(ModelCls=<class 'rlpyt.models.pg.mujoco_lstm_model.MujocoLstmModel'>, **kwargs)¶ Bases:
rlpyt.agents.pg.mujoco.MujocoMixin
,rlpyt.agents.pg.gaussian.RecurrentGaussianPgAgent
-
__init__
(ModelCls=<class 'rlpyt.models.pg.mujoco_lstm_model.MujocoLstmModel'>, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
Discrete Actions¶
-
class
rlpyt.agents.pg.categorical.
CategoricalPgAgent
(ModelCls=None, model_kwargs=None, initial_model_state_dict=None)¶ Bases:
rlpyt.agents.base.BaseAgent
Agent for policy gradient algorithm using categorical action distribution. Same as
GausssianPgAgent
and related classes, except usesCategorical
distribution, and has a different interface to the model (model here outputs discrete probabilities in place of means and log_stds, while both output the value estimate).
-
class
rlpyt.agents.pg.atari.
AtariMixin
¶ Mixin class defining which environment interface properties are given to the model.
-
make_env_to_model_kwargs
(env_spaces)¶ Extract image shape and action size.
-
-
class
rlpyt.agents.pg.atari.
AtariFfAgent
(ModelCls=<class 'rlpyt.models.pg.atari_ff_model.AtariFfModel'>, **kwargs)¶ Bases:
rlpyt.agents.pg.atari.AtariMixin
,rlpyt.agents.pg.categorical.CategoricalPgAgent
-
__init__
(ModelCls=<class 'rlpyt.models.pg.atari_ff_model.AtariFfModel'>, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
rlpyt.agents.pg.atari.
AtariLstmAgent
(ModelCls=<class 'rlpyt.models.pg.atari_lstm_model.AtariLstmModel'>, **kwargs)¶ Bases:
rlpyt.agents.pg.atari.AtariMixin
,rlpyt.agents.pg.categorical.RecurrentCategoricalPgAgent
-
__init__
(ModelCls=<class 'rlpyt.models.pg.atari_lstm_model.AtariLstmModel'>, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
Models¶
-
class
rlpyt.models.pg.mujoco_ff_model.
MujocoFfModel
(observation_shape, action_size, hidden_sizes=None, hidden_nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, mu_nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, init_log_std=0.0, normalize_observation=False, norm_obs_clip=10, norm_obs_var_clip=1e-06)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Model commonly used in Mujoco locomotion agents: an MLP which outputs distribution means, separate parameter for learned log_std, and separate MLP for state-value estimate.
-
__init__
(observation_shape, action_size, hidden_sizes=None, hidden_nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, mu_nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, init_log_std=0.0, normalize_observation=False, norm_obs_clip=10, norm_obs_var_clip=1e-06)¶ Instantiate neural net modules according to inputs.
-
forward
(observation, prev_action, prev_reward)¶ Compute mean, log_std, and value estimate from input state. Infers leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Intermediate feedforward layers process as [T*B,H], with T=1,B=1 when not given. Used both in sampler and in algorithm (both via the agent).
-
-
class
rlpyt.models.pg.mujoco_lstm_model.
MujocoLstmModel
(observation_shape, action_size, hidden_sizes=None, lstm_size=256, nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, normalize_observation=False, norm_obs_clip=10, norm_obs_var_clip=1e-06)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Recurrent model for Mujoco locomotion agents: an MLP into an LSTM which outputs distribution means, log_std, and state-value estimate.
-
__init__
(observation_shape, action_size, hidden_sizes=None, lstm_size=256, nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, normalize_observation=False, norm_obs_clip=10, norm_obs_var_clip=1e-06)¶ Initialize self. See help(type(self)) for accurate signature.
-
forward
(observation, prev_action, prev_reward, init_rnn_state)¶ Compute mean, log_std, and value estimate from input state. Infer leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Intermediate feedforward layers process as [T*B,H], and recurrent layers as [T,B,H], with T=1,B=1 when not given. Used both in sampler and in algorithm (both via the agent). Also returns the next RNN state.
-
-
class
rlpyt.models.pg.atari_ff_model.
AtariFfModel
(image_shape, output_size, fc_sizes=512, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Feedforward model for Atari agents: a convolutional network feeding an MLP with outputs for action probabilities and state-value estimate.
-
__init__
(image_shape, output_size, fc_sizes=512, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶ Instantiate neural net module according to inputs.
-
forward
(image, prev_action, prev_reward)¶ Compute action probabilities and value estimate from input state. Infers leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Convolution layers process as [T*B, *image_shape], with T=1,B=1 when not given. Expects uint8 images in [0,255] and converts them to float32 in [0,1] (to minimize image data storage and transfer). Used in both sampler and in algorithm (both via the agent).
-
-
class
rlpyt.models.pg.atari_lstm_model.
AtariLstmModel
(image_shape, output_size, fc_sizes=512, lstm_size=512, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Recurrent model for Atari agents: a convolutional network into an FC layer into an LSTM which outputs action probabilities and state-value estimate.
-
__init__
(image_shape, output_size, fc_sizes=512, lstm_size=512, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶ Instantiate neural net module according to inputs.
-
forward
(image, prev_action, prev_reward, init_rnn_state)¶ Compute action probabilities and value estimate from input state. Infers leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Convolution layers process as [T*B, *image_shape], with T=1,B=1 when not given. Expects uint8 images in [0,255] and converts them to float32 in [0,1] (to minimize image data storage and transfer). Recurrent layers processed as [T,B,H]. Used in both sampler and in algorithm (both via the agent). Also returns the next RNN state.
-