Policy Gradient Implementations¶

This page documents the implemented policy gradient / actor-critic algorithms, agents, and models.

Algorithms¶

class rlpyt.algos.pg.base.PolicyGradientAlgo¶

Bases: rlpyt.algos.base.RlAlgorithm

Base policy gradient / actor-critic algorithm, which includes initialization procedure and processing of data samples to compute advantages.

initialize(agent, n_itr, batch_spec, mid_batch_reset=False, examples=None, world_size=1, rank=0)¶: Build the torch optimizer and store other input attributes. Params batch_spec and examples are unused.

process_returns(samples)¶: Compute bootstrapped returns and advantages from a minibatch of samples. Uses either discounted returns (if self.gae_lambda==1) or generalized advantage estimation. Mask out invalid samples according to mid_batch_reset or for recurrent agent. Optionally, normalize advantages.

class rlpyt.algos.pg.a2c.A2C(discount=0.99, learning_rate=0.001, value_loss_coeff=0.5, entropy_loss_coeff=0.01, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, clip_grad_norm=1.0, initial_optim_state_dict=None, gae_lambda=1, normalize_advantage=False)¶

Bases: rlpyt.algos.pg.base.PolicyGradientAlgo

Advantage Actor Critic algorithm (synchronous). Trains the agent by taking one gradient step on each iteration of samples, with advantages computed by generalized advantage estimation.

__init__(discount=0.99, learning_rate=0.001, value_loss_coeff=0.5, entropy_loss_coeff=0.01, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, clip_grad_norm=1.0, initial_optim_state_dict=None, gae_lambda=1, normalize_advantage=False)¶: Saves the input settings.

optimize_agent(itr, samples)¶: Train the agent on input samples, by one gradient step.

loss(samples)¶: Computes the training loss: policy_loss + value_loss + entropy_loss. Policy loss: log-likelihood of actions * advantages Value loss: 0.5 * (estimated_value - return) ^ 2 Organizes agent inputs from training samples, calls the agent instance to run forward pass on training data, and uses the agent.distribution to compute likelihoods and entropies. Valid for feedforward or recurrent agents.

class rlpyt.algos.pg.ppo.PPO(discount=0.99, learning_rate=0.001, value_loss_coeff=1.0, entropy_loss_coeff=0.01, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, clip_grad_norm=1.0, initial_optim_state_dict=None, gae_lambda=1, minibatches=4, epochs=4, ratio_clip=0.1, linear_lr_schedule=True, normalize_advantage=False)¶

Bases: rlpyt.algos.pg.base.PolicyGradientAlgo

Proximal Policy Optimization algorithm. Trains the agent by taking multiple epochs of gradient steps on minibatches of the training data at each iteration, with advantages computed by generalized advantage estimation. Uses clipped likelihood ratios in the policy loss.

__init__(discount=0.99, learning_rate=0.001, value_loss_coeff=1.0, entropy_loss_coeff=0.01, OptimCls=<sphinx.ext.autodoc.importer._MockObject object>, optim_kwargs=None, clip_grad_norm=1.0, initial_optim_state_dict=None, gae_lambda=1, minibatches=4, epochs=4, ratio_clip=0.1, linear_lr_schedule=True, normalize_advantage=False)¶: Saves input settings.

initialize(*args, **kwargs)¶: Extends base initialize() to initialize learning rate schedule, if applicable.

optimize_agent(itr, samples)¶: Train the agent, for multiple epochs over minibatches taken from the input samples. Organizes agent inputs from the training data, and moves them to device (e.g. GPU) up front, so that minibatches are formed within device, without further data transfer.

loss(agent_inputs, action, return_, advantage, valid, old_dist_info, init_rnn_state=None)¶: Compute the training loss: policy_loss + value_loss + entropy_loss Policy loss: min(likelhood-ratio * advantage, clip(likelihood_ratio, 1-eps, 1+eps) * advantage) Value loss: 0.5 * (estimated_value - return) ^ 2 Calls the agent to compute forward pass on training data, and uses the agent.distribution to compute likelihoods and entropies. Valid for feedforward or recurrent agents.

Agents¶

Continuous Actions¶

class rlpyt.agents.pg.gaussian.GaussianPgAgent(ModelCls=None, model_kwargs=None, initial_model_state_dict=None)¶

Bases: rlpyt.agents.base.BaseAgent

Agent for policy gradient algorithm using Gaussian action distribution.

__call__(observation, prev_action, prev_reward)¶: Performs forward pass on training data, for algorithm.

initialize(env_spaces, share_memory=False, global_B=1, env_ranks=None)¶: Extends base method to build Gaussian distribution.

step(observation, prev_action, prev_reward)¶: Compute policy’s action distribution from inputs, and sample an action. Calls the model to produce mean, log_std, and value estimate. Moves inputs to device and returns outputs back to CPU, for the sampler. (no grad)

value(observation, prev_action, prev_reward)¶: Compute the value estimate for the environment state, e.g. for the bootstrap value, V(s_{T+1}), in the sampler. (no grad)

class rlpyt.agents.pg.gaussian.RecurrentGaussianPgAgentBase(ModelCls=None, model_kwargs=None, initial_model_state_dict=None)¶

Bases: rlpyt.agents.base.BaseAgent

__call__(observation, prev_action, prev_reward, init_rnn_state)¶: Performs forward pass on training data, for algorithm (requires recurrent state input).

step(observation, prev_action, prev_reward)¶: Compute policy’s action distribution from inputs, and sample an action. Calls the model to produce mean, log_std, value estimate, and next recurrent state. Moves inputs to device and returns outputs back to CPU, for the sampler. Advances the recurrent state of the agent. (no grad)

value(observation, prev_action, prev_reward)¶: Compute the value estimate for the environment state using the currently held recurrent state, without advancing the recurrent state, e.g. for the bootstrap value V(s_{T+1}), in the sampler. (no grad)

class rlpyt.agents.pg.gaussian.RecurrentGaussianPgAgent(*args, **kwargs)¶: Bases: rlpyt.agents.base.RecurrentAgentMixin, rlpyt.agents.pg.gaussian.RecurrentGaussianPgAgentBase

class rlpyt.agents.pg.gaussian.AlternatingRecurrentGaussianPgAgent(*args, **kwargs)¶: Bases: rlpyt.agents.base.AlternatingRecurrentAgentMixin, rlpyt.agents.pg.gaussian.RecurrentGaussianPgAgentBase

class rlpyt.agents.pg.mujoco.MujocoMixin¶

Mixin class defining which environment interface properties are given to the model. Now supports observation normalization, including multi-GPU.

make_env_to_model_kwargs(env_spaces)¶: Extract observation_shape and action_size.

class rlpyt.agents.pg.mujoco.MujocoFfAgent(ModelCls=<class 'rlpyt.models.pg.mujoco_ff_model.MujocoFfModel'>, **kwargs)¶

Bases: rlpyt.agents.pg.mujoco.MujocoMixin, rlpyt.agents.pg.gaussian.GaussianPgAgent

__init__(ModelCls=<class 'rlpyt.models.pg.mujoco_ff_model.MujocoFfModel'>, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

class rlpyt.agents.pg.mujoco.MujocoLstmAgent(ModelCls=<class 'rlpyt.models.pg.mujoco_lstm_model.MujocoLstmModel'>, **kwargs)¶

Bases: rlpyt.agents.pg.mujoco.MujocoMixin, rlpyt.agents.pg.gaussian.RecurrentGaussianPgAgent

__init__(ModelCls=<class 'rlpyt.models.pg.mujoco_lstm_model.MujocoLstmModel'>, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

Discrete Actions¶

class rlpyt.agents.pg.categorical.CategoricalPgAgent(ModelCls=None, model_kwargs=None, initial_model_state_dict=None)¶

Bases: rlpyt.agents.base.BaseAgent

Agent for policy gradient algorithm using categorical action distribution. Same as GausssianPgAgent and related classes, except uses Categorical distribution, and has a different interface to the model (model here outputs discrete probabilities in place of means and log_stds, while both output the value estimate).

class rlpyt.agents.pg.atari.AtariMixin¶

Mixin class defining which environment interface properties are given to the model.

make_env_to_model_kwargs(env_spaces)¶: Extract image shape and action size.

class rlpyt.agents.pg.atari.AtariFfAgent(ModelCls=<class 'rlpyt.models.pg.atari_ff_model.AtariFfModel'>, **kwargs)¶

Bases: rlpyt.agents.pg.atari.AtariMixin, rlpyt.agents.pg.categorical.CategoricalPgAgent

__init__(ModelCls=<class 'rlpyt.models.pg.atari_ff_model.AtariFfModel'>, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

class rlpyt.agents.pg.atari.AtariLstmAgent(ModelCls=<class 'rlpyt.models.pg.atari_lstm_model.AtariLstmModel'>, **kwargs)¶

Bases: rlpyt.agents.pg.atari.AtariMixin, rlpyt.agents.pg.categorical.RecurrentCategoricalPgAgent

__init__(ModelCls=<class 'rlpyt.models.pg.atari_lstm_model.AtariLstmModel'>, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

Models¶

class rlpyt.models.pg.mujoco_ff_model.MujocoFfModel(observation_shape, action_size, hidden_sizes=None, hidden_nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, mu_nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, init_log_std=0.0, normalize_observation=False, norm_obs_clip=10, norm_obs_var_clip=1e-06)¶

Bases: sphinx.ext.autodoc.importer._MockObject

Model commonly used in Mujoco locomotion agents: an MLP which outputs distribution means, separate parameter for learned log_std, and separate MLP for state-value estimate.

__init__(observation_shape, action_size, hidden_sizes=None, hidden_nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, mu_nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, init_log_std=0.0, normalize_observation=False, norm_obs_clip=10, norm_obs_var_clip=1e-06)¶: Instantiate neural net modules according to inputs.

forward(observation, prev_action, prev_reward)¶: Compute mean, log_std, and value estimate from input state. Infers leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Intermediate feedforward layers process as [T*B,H], with T=1,B=1 when not given. Used both in sampler and in algorithm (both via the agent).

class rlpyt.models.pg.mujoco_lstm_model.MujocoLstmModel(observation_shape, action_size, hidden_sizes=None, lstm_size=256, nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, normalize_observation=False, norm_obs_clip=10, norm_obs_var_clip=1e-06)¶

Bases: sphinx.ext.autodoc.importer._MockObject

Recurrent model for Mujoco locomotion agents: an MLP into an LSTM which outputs distribution means, log_std, and state-value estimate.

__init__(observation_shape, action_size, hidden_sizes=None, lstm_size=256, nonlinearity=<sphinx.ext.autodoc.importer._MockObject object>, normalize_observation=False, norm_obs_clip=10, norm_obs_var_clip=1e-06)¶: Initialize self. See help(type(self)) for accurate signature.

forward(observation, prev_action, prev_reward, init_rnn_state)¶: Compute mean, log_std, and value estimate from input state. Infer leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Intermediate feedforward layers process as [T*B,H], and recurrent layers as [T,B,H], with T=1,B=1 when not given. Used both in sampler and in algorithm (both via the agent). Also returns the next RNN state.

class rlpyt.models.pg.atari_ff_model.AtariFfModel(image_shape, output_size, fc_sizes=512, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶

Bases: sphinx.ext.autodoc.importer._MockObject

Feedforward model for Atari agents: a convolutional network feeding an MLP with outputs for action probabilities and state-value estimate.

__init__(image_shape, output_size, fc_sizes=512, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶: Instantiate neural net module according to inputs.

forward(image, prev_action, prev_reward)¶: Compute action probabilities and value estimate from input state. Infers leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Convolution layers process as [T*B, *image_shape], with T=1,B=1 when not given. Expects uint8 images in [0,255] and converts them to float32 in [0,1] (to minimize image data storage and transfer). Used in both sampler and in algorithm (both via the agent).

class rlpyt.models.pg.atari_lstm_model.AtariLstmModel(image_shape, output_size, fc_sizes=512, lstm_size=512, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶

Bases: sphinx.ext.autodoc.importer._MockObject

Recurrent model for Atari agents: a convolutional network into an FC layer into an LSTM which outputs action probabilities and state-value estimate.

__init__(image_shape, output_size, fc_sizes=512, lstm_size=512, use_maxpool=False, channels=None, kernel_sizes=None, strides=None, paddings=None)¶: Instantiate neural net module according to inputs.

forward(image, prev_action, prev_reward, init_rnn_state)¶: Compute action probabilities and value estimate from input state. Infers leading dimensions of input: can be [T,B], [B], or []; provides returns with same leading dims. Convolution layers process as [T*B, *image_shape], with T=1,B=1 when not given. Expects uint8 images in [0,255] and converts them to float32 in [0,1] (to minimize image data storage and transfer). Recurrent layers processed as [T,B,H]. Used in both sampler and in algorithm (both via the agent). Also returns the next RNN state.