class rlpyt.runners.base.BaseRunner

Orchestrates sampler and algorithm to run the training loop. The runner should also manage logging to record agent performance during training. Different runner classes may be used depending on the overall RL procedure and the hardware configuration (e.g. multi-GPU).


Entry point to conduct an entire RL training run, to be called in a launch script after instantiating all components: algo, agent, sampler.

All of the existing runners implement loops which collect minibatches of samples and provide them to the algorithm. The distinguishing features of the following classes are: a) online vs offline performance logging, b) single- vs multi-GPU training, and c) synchronous vs asynchronous operation of sampling and training. Most RL workflows should be able to use the desired class without modification.

Single-GPU Runners

class rlpyt.runners.minibatch_rl.MinibatchRlBase(algo, agent, sampler, n_steps, seed=None, affinity=None, log_interval_steps=100000.0)

Bases: rlpyt.runners.base.BaseRunner

Implements startup, logging, and agent checkpointing functionality, to be called in the train() method of the subclassed runner. Subclasses will modify/extend many of the methods here.

  • algo – The algorithm instance.
  • agent – The learning agent instance.
  • sampler – The sampler instance.
  • n_steps (int) – Total number of environment steps to run in training loop.
  • seed (int) – Random seed to use, if None will generate randomly.
  • affinity (dict) – Hardware component assignments for sampler and algorithm.
  • log_interval_steps (int) – Number of environment steps between logging to csv.

Sets hardware affinities, initializes the following: 1) sampler (which should initialize the agent), 2) agent device and data-parallel wrapper (if applicable), 3) algorithm, 4) logger.


Pre-defines any TrajInfo attributes needed from elsewhere e.g. algorithm discount factor.


Determine number of train loop iterations to run. Converts logging interval units from environment steps to iterations.


Returns all state needed for full checkpoint/snapshot of training run, including agent parameters and optimizer parameters.


Calls the logger to save training checkpoint/snapshot (logger itself may or may not save, depending on mode selected).

store_diagnostics(itr, traj_infos, opt_info)

Store any diagnostic information from a training iteration that should be kept for the next logging iteration.

log_diagnostics(itr, traj_infos=None, eval_time=0, prefix='Diagnostics/')

Write diagnostics (including stored ones) to csv via the logger.


Writes trajectory info and optimizer info into csv via the logger. Resets stored optimizer info.

class rlpyt.runners.minibatch_rl.MinibatchRl(log_traj_window=100, **kwargs)

Bases: rlpyt.runners.minibatch_rl.MinibatchRlBase

Runs RL on minibatches; tracks performance online using learning trajectories.

__init__(log_traj_window=100, **kwargs)
Parameters:log_traj_window (int) – How many trajectories to hold in deque for computing performance statistics.

Performs startup, then loops by alternating between sampler.obtain_samples() and algo.optimize_agent(), logging diagnostics at the specified interval.

class rlpyt.runners.minibatch_rl.MinibatchRlEval(algo, agent, sampler, n_steps, seed=None, affinity=None, log_interval_steps=100000.0)

Bases: rlpyt.runners.minibatch_rl.MinibatchRlBase

Runs RL on minibatches; tracks performance offline using evaluation trajectories.


Performs startup, evaluates the initial agent, then loops by alternating between sampler.obtain_samples() and algo.optimize_agent(). Pauses to evaluate the agent at the specified log interval.


Record offline evaluation of agent performance, by sampler.evaluate_agent().

Multi-GPU Runners

class rlpyt.runners.sync_rl.SyncRlMixin

Mixin class to extend runner functionality to multi-GPU case. Creates a full replica of the sampler-algorithm-agent stack in a separate Python process for each GPU. Initializes torch.distributed to support data-parallel training of the agent. The main communication point among processes is to all-reduce gradients during backpropagation, which is handled implicitly within PyTorch. There is one agent, with the same parameters copied in all processes. No data samples are communicated in the implemented runners.

On GPU, uses the NCCL backend to communicate directly among GPUs. Can also be used without GPU, as multi-CPU (MPI-like, but using the gloo backend).

The parallelism in the sampler is independent from the parallelism here–each process will initialize its own sampler, and any one can be used (serial, cpu-parallel, gpu-parallel).

The name “Sync” refers to the fact that the sampler and algorithm still operate synchronously within each process (i.e. they alternate, running one at a time).


Weak scaling is implemented for batch sizes. The batch size input argument to the sampler and to the algorithm classes are used in each process, so the actual batch sizes are (world_size * batch_size). The world size is readily available from torch.distributed, so can change this if desired.


The affinities input is expected to be a list, with a seprate affinity dict for each process. The number of processes is taken from the length of the affinities list.


As part of startup, fork a separate Python process for each additional GPU; the master process runs on the first GPU. Initialize torch.distributed so the DistributedDataParallel wrapper can work–also makes torch.distributed avaiable for other communication.

class rlpyt.runners.sync_rl.SyncRl(log_traj_window=100, **kwargs)

Bases: rlpyt.runners.sync_rl.SyncRlMixin, rlpyt.runners.minibatch_rl.MinibatchRl

Multi-process RL with online agent performance tracking. Trajectory info is collected from all processes and is included in logging.

class rlpyt.runners.sync_rl.SyncRlEval(algo, agent, sampler, n_steps, seed=None, affinity=None, log_interval_steps=100000.0)

Bases: rlpyt.runners.sync_rl.SyncRlMixin, rlpyt.runners.minibatch_rl.MinibatchRlEval

Multi-process RL with offline agent performance evaluation. Only the master process runs agent evaluation.

Asynchronous Runners

class rlpyt.runners.async_rl.AsyncRlBase(algo, agent, sampler, n_steps, affinity, seed=None, log_interval_steps=100000.0)

Bases: rlpyt.runners.base.BaseRunner

Runs sampling and optimization asynchronously in separate Python processes. May be useful to achieve higher hardware utilization, e.g. CPUs fully busy simulating the environment while GPU fully busy training the agent (there’s no reason to use this CPU-only). This setup is significantly more complicated than the synchronous (single- or multi-GPU) runners, requires use of the asynchronous sampler, and may require special methods in the algorithm.

Further parallelization within the sampler and optimizer are independent. The asynchronous sampler can be serial, cpu-parallel, gpu-parallel, or multi-gpu-parallel. The optimizer can be single- or multi-gpu.

The algorithm must initialize a replay buffer on OS shared memory. The asynchronous sampler will allocate minibatch buffers on OS shared memory, and yet another Python process is run to copy the completed minibatches over to the algorithm’s replay buffer. While that memory copy is underway, the sampler immediately begins gathering the next minibatch.

Care should be taken to balance the rate at which the algorithm runs against the rate of the sampler, as this can affect learning performance. In the existing implementations, the sampler runs at full speed, and the algorithm may be throttled not to exceed the specified relative rate. This is set by the algorithm’s replay_ratio, which becomes the upper bound on the amount of training samples used in ratio with the amount of samples generated. (In synchronous mode, the replay ratio is enforced more precisely by running a fixed batch size and number of updates per iteration.)

The master process runs the (first) training GPU and performs all logging.

Within the optimizer, one agent exists. If multi-GPU, the same parameter values are copied across all GPUs, and PyTorch’s DistributedDataParallel is used to all-reduce gradients (as in the synchronous multi-GPU runners). Within the sampler, one agent exists. If new agent parameters are available from the optimizer between sampler minibatches, then those values are copied into the sampler before gathering the next minibatch.


The affinity argument should be a structure with sampler and optimizer attributes holding the respective hardware allocations. Optimizer and sampler parallelization is determined from this.


Run the optimizer in a loop. Check whether enough new samples have been generated, and throttle down if necessary at each iteration. Log at an interval in the number of sampler iterations, not optimizer iterations.


Calls sampler.async_initialize() to get a double buffer for minibatches, followed by algo.async_initialize() to get a replay buffer on shared memory, then launches all workers (sampler, optimizer, memory copier).


Sets the hardware affinity, moves the agent’s model parameters onto device and initialize data-parallel agent, if applicable. Computes optimizer throttling settings.


Builds several parallel communication mechanisms for controlling the workflow across processes.


If multi-GPU optimization, launches an optimizer worker for each GPU and initializes torch.distributed.

launch_memcpy(sample_buffers, replay_buffer)

Fork a Python process for each of the sampler double buffers. (It may be overkill to use two separate processes here, may be able to simplify to one and still get good performance.)

class rlpyt.runners.async_rl.AsyncRl(*args, log_traj_window=100, **kwargs)

Bases: rlpyt.runners.async_rl.AsyncRlBase

Asynchronous RL with online agent performance tracking.

class rlpyt.runners.async_rl.AsyncRlEval(algo, agent, sampler, n_steps, affinity, seed=None, log_interval_steps=100000.0)

Bases: rlpyt.runners.async_rl.AsyncRlBase

Asynchronous RL with offline agent performance evaluation.

Asynchronous Worker Processes

rlpyt.runners.async_rl.run_async_sampler(sampler, affinity, ctrl, traj_infos_queue, n_itr)

Target function for the process which will run the sampler, in the case of online performance logging. Toggles the sampler’s double-buffer for each iteration, waits for the memory copier to finish before writing into that buffer, and signals the memory copier when the sampler is done writing a minibatch.

rlpyt.runners.async_rl.run_async_sampler_eval(sampler, affinity, ctrl, traj_infos_queue, n_itr, eval_itrs)

Target function running the sampler with offline performance evaluation.

rlpyt.runners.async_rl.memory_copier(sample_buffer, samples_to_buffer, replay_buffer, ctrl)

Target function for the process which will copy the sampler’s minibatch buffer into the algorithm’s main replay buffer.

  • sample_buffer – The (single) minibatch buffer from the sampler, on shared memory.
  • samples_to_buffer – A function/method from the algorithm to process samples from the minibatch buffer into the replay buffer (e.g. select which fields, compute some prioritization).
  • replay_buffer – Algorithm’s main replay buffer, on shared memory.
  • ctrl – Structure for communicating when the minibatch is ready to copy/done copying.


Although this function may use the algorithm’s samples_to_buffer() method, here it is running in a separate process, so will not be aware of changes in the algorithm from the optimizer process. Furthermore, it may not be able to store state across iterations–in the implemented setup, two separate memory copier processes are used, so each one only sees every other minibatch. (Could easily change to single copier if desired, and probably without peformance loss.)