Base Classes and Interfaces

This page describes the base classes for three main components: algorithm, agent, and environment. These are the most likely to need modification for a new project. Intended interfaces to the infrastructure code (i.e. runner and sampler) are specified here. More details on specific instances of these components appear in following pages.

Commonly, these classes will simply store their keyword arguments when instantiated, and actual initialization occurs in methods to be called later by the runner or sampler.


class rlpyt.algos.base.RlAlgorithm

Trains the agent using gathered samples, for example by constructing TD-errors and performing gradient descent on the agent’s model parameters. Includes pre-processing of samples e.g. discounting returns.

initialize(agent, n_itr, batch_spec, mid_batch_reset, examples, world_size=1, rank=0)

Typically called in the runner during startup.

  • agent – The learning agent instance.
  • n_itr (int) – Number of training loop iterations which will be run (e.g. corresponds to each call of optimize_agent())
  • batch_spec – Holds sampler batch dimensions.
  • mid_batch_reset (bool) – Whether the sampler resets environments during a sampling batch (True) or only between batches (False). Affects whether some samples are invalid for training.
  • examples – Structure of example RL quantities, e.g. observation, action, agent_info, env_info, e.g. in case needed to allocate replay buffer.
  • world_size (int) – Number of separate optimizing processes (e.g. multi-GPU).
  • rank (int) – Unique index for each optimizing process.
async_initialize(agent, sampler_n_itr, batch_spec, mid_batch_reset, examples, world_size=1)

Called instead of initialize() in async runner (not needed unless using async runner). Should return async replay_buffer using shared memory.


Called in async runner which requires two stages of initialization; might also be used in initialize() to avoid redundant code.

optimize_agent(itr, samples=None, sampler_itr=None)

Train the agent for some number of parameter updates, e.g. either using new samples or a replay buffer.

Typically called in the runner’s training loop.

  • itr (int) – Iteration of the training loop.
  • samples – New samples from the sampler (for None case, see async runner).
  • sampler_itr – For case other than None, see async runner.

Return the optimizer state dict (e.g. Adam); overwrite if using multiple optimizers.


Load an optimizer state dict; should expect the format returned from optim_state_dict().


Environments are expected to input/output numpy arrays.

class rlpyt.envs.base.Env

The learning task, e.g. an MDP containing a transition function T(state, action)–>state’. Has a defined observation space and action space.


Run on timestep of the environment’s dynamics using the input action, advancing the internal state; T(state,action)–>state’.

Parameters:action – An element of this environment’s action space.
Returns:An element of this environment’s observation space corresponding to the next state. reward (float): A scalar reward resulting from the state transition. done (bool): Indicates whether the episode has ended. info (namedtuple): Additional custom information.
Return type:observation

Resets the state of the environment.

Returns:The initial observation of the new episode.
Return type:observation


Agents are expected to input/output torch tensors.

class rlpyt.agents.base.BaseAgent(ModelCls=None, model_kwargs=None, initial_model_state_dict=None)

The agent performs many functions, including: action-selection during sampling, returning policy-related values to use in training (e.g. action probabilities), storing recurrent state during sampling, managing model device, and performing model parameter communication between processes. The agent is both interfaces: sampler<–>neural network<–>algorithm. Typically, each algorithm and environment combination will require at least some of its own agent functionality.

The base agent automatically carries out some of these roles. It assumes there is one neural network model. Agents using multiple models might need to extend certain funcionality to include those models, depending on how they are used.

__init__(ModelCls=None, model_kwargs=None, initial_model_state_dict=None)

Arguments are saved but no model initialization occurs.

  • ModelCls – The model class to be used.
  • model_kwargs (optional) – Any keyword arguments to pass when instantiating the model.
  • initial_model_state_dict (optional) – Initial model parameter values.
__call__(observation, prev_action, prev_reward)

Returns values from model forward pass on training data (i.e. used in algorithm).

initialize(env_spaces, share_memory=False, **kwargs)

Instantiates the neural net model(s) according to the environment interfaces.

Uses shared memory as needed–e.g. in CpuSampler, workers have a copy of the agent for action-selection. The workers automatically hold up-to-date parameters in model, because they exist in shared memory, constructed here before worker processes fork. Agents with additional model components (beyond self.model) for action-selection should extend this method to share those, as well.

Typically called in the sampler during startup.

  • env_spaces – passed to make_env_to_model_kwargs(), typically namedtuple of ‘observation’ and ‘action’.
  • share_memory (bool) – whether to use shared memory for model parameters.

Generate any keyword args to the model which depend on environment interfaces.


Moves the model to the specified cuda device, if not None. If sharing memory, instantiates a new model to preserve the shared (CPU) model. Agents with additional model components (beyond self.model) for action-selection or for use during training should extend this method to move those to the device, as well.

Typically called in the runner during startup.


Wraps the model with PyTorch’s DistributedDataParallel. The intention is for rlpyt to create a separate Python process to drive each GPU (or CPU-group for CPU-only, MPI-like configuration). Agents with additional model components (beyond self.model) which will have gradients computed through them should extend this method to wrap those, as well.

Typically called in the runner during startup.


Used in async runner only; creates a new model instance to be used in the sampler, separate from the model shared with the optimizer process. The sampler can operate asynchronously, and choose when to copy the optimizer’s (shared) model parameters into its model (under read-write lock). The sampler model may be stored in shared memory, as well, to instantly share values with sampler workers. Agents with additional model components (beyond self.model) should extend this method to do the same with those, if using in asynchronous mode.

Typically called in the runner during startup.

TODO: double-check wording if this happens in sampler and optimizer.

step(observation, prev_action, prev_reward)

Returns selected actions for environment instances in sampler.


Returns model parameters for saving.


Load model parameters, should expect format returned from state_dict().


Go into training mode (e.g. see PyTorch’s Module.train()).


Go into sampling mode.


Go into evaluation mode. Example use could be to adjust epsilon-greedy.


Copies model parameters into shared_model, e.g. to make new values available to sampler workers. If running CPU-only, these will be the same object–no copy necessary. If model is on GPU, copy to CPU is performed. (Requires initialize(share_memory=True) called previously. Not used in async mode.

Typically called in the XXX during YY.


Used in async mode only, in optimizer process; copies parameters from trained model (maybe GPU) to shared model, which the sampler can access. Does so under write-lock, and increments send-count which sampler can check.

Typically called in the XXX during YY.


Used in async mode, in sampler process; copies parameters from model shared with optimizer into local model, if shared model has been updated. Does so under read-lock. (Local model might also be shared with sampler workers).

Typically called in the XXX during YY.

Recurrent Agents

class rlpyt.agents.base.RecurrentAgentMixin(*args, **kwargs)

Mixin class to manage recurrent state during sampling (so the sampler remains agnostic). To be used like class MyRecurrentAgent(RecurrentAgentMixin, MyAgent):.


Sets the recurrent state to None, which built-in PyTorch modules conver to zeros.


Sets the recurrent state corresponding to one environment instance to zero. Assumes rnn state is in cudnn-compatible shape: [N,B,H], where B corresponds to environment index.


Sets the recurrent state to the newly computed one (i.e. recurrent agents should call this at the end of their step()).


If coming from sample mode, store the rnn state elsewhere and clear it.


If coming from non-sample modes, restore the last sample-mode rnn state.


If coming from sample mode, store the rnn state elsewhere and clear it.

class rlpyt.agents.base.AlternatingRecurrentAgentMixin(*args, **kwargs)

Maintain an alternating pair of recurrent states to use when stepping in the sampler. Automatically swap them out when advance_rnn_state() is called, so it otherwise behaves like regular recurrent agent. Should use only in alternating samplers, where two sets of environment instances take turns stepping (no special class needed for feedforward agents). Use in place of RecurrentAgentMixin.