Distributions are used to select randomized actions during sampling, and for some algorithms to compute likelihood and related values for training. Typically, the distribution is owned by the agent. This page documents the implemented distributions and some methods–see the code for details.

class rlpyt.distributions.base.Distribution

Base distribution class. Not all subclasses will impelement all methods.


Generate random sample(s) from distribution informations.

kl(old_dist_info, new_dist_info)

Compute the KL divergence of two distributions at each datum; should maintain leading dimensions (e.g. [T,B]).

mean_kl(old_dist_info, new_dist_info, valid)

Compute the mean KL divergence over a data batch, possible ignoring data marked as invalid.

log_likelihood(x, dist_info)

Compute log-likelihood of samples x at distributions described in dist_info (i.e. can have same leading dimensions [T, B]).

likelihood_ratio(x, old_dist_info, new_dist_info)

Compute likelihood ratio of samples x at new distributions over old distributions (usually new_dist_info is variable for differentiation); should maintain leading dimensions.


Compute entropy of distributions contained in dist_info; should maintain any leading dimensions.


Exponential of the entropy, maybe useful for logging.

mean_entropy(dist_info, valid=None)

In case some sophisticated mean is needed (e.g. internally ignoring select parts of action space), can override.

mean_perplexity(dist_info, valid=None)

Exponential of the entropy, maybe useful for logging.

class rlpyt.distributions.discrete.DiscreteMixin(dim, dtype=<sphinx.ext.autodoc.importer._MockObject object>, onehot_dtype=<sphinx.ext.autodoc.importer._MockObject object>)

Conversions to and from one-hot.

to_onehot(indexes, dtype=None)

Convert from integer indexes to one-hot, preserving leading dimensions.

from_onehot(onehot, dtype=None)

Convert from one-hot to integer indexes, preserving leading dimensions.

class rlpyt.distributions.categorical.Categorical(dim, dtype=<sphinx.ext.autodoc.importer._MockObject object>, onehot_dtype=<sphinx.ext.autodoc.importer._MockObject object>)

Bases: rlpyt.distributions.discrete.DiscreteMixin, rlpyt.distributions.base.Distribution

Multinomial distribution over a discrete domain.


Sample from torch.multiomial over trailing dimension of dist_info.prob.

class rlpyt.distributions.epsilon_greedy.EpsilonGreedy(epsilon=1, **kwargs)

Bases: rlpyt.distributions.discrete.DiscreteMixin, rlpyt.distributions.base.Distribution

For epsilon-greedy exploration from state-action Q-values.


Input can be shaped [T,B,Q] or [B,Q], and vector epsilon of length B will apply across the Batch dimension (same epsilon for all T).


Assign value for epsilon (can be vector).

class rlpyt.distributions.epsilon_greedy.CategoricalEpsilonGreedy(z=None, **kwargs)

Bases: rlpyt.distributions.epsilon_greedy.EpsilonGreedy

For epsilon-greedy exploration from distributional (categorical) representation of state-action Q-values.

sample(p, z=None)

Input p to be shaped [T,B,A,P] or [B,A,P], A: number of actions, P: number of atoms. Optional input z is domain of atom-values, shaped [P]. Vector epsilon of lenght B will apply across Batch dimension.


Assign vector of bin locations, distributional domain.

class rlpyt.distributions.gaussian.Gaussian(dim, std=None, clip=None, noise_clip=None, min_std=None, max_std=None, squash=None)

Multivariate Gaussian with independent variables (diagonal covariance). Standard deviation can be provided, as scalar or value per dimension, or it will be drawn from the dist_info (possibly learnable), where it is expected to have a value per each dimension. Noise clipping or sample clipping optional during sampling, but not accounted for in formulas (e.g. entropy). Clipping of standard deviation optional and accounted in formulas. Squashing of samples to squash * tanh(sample) is optional and accounted for in log_likelihood formula but not entropy.


Uses self.std unless that is None, then will get log_std from dist_info. Not implemented for squashing.

log_likelihood(x, dist_info)

Uses self.std unless that is None, then uses log_std from dist_info. When squashing: instead of numerically risky arctanh, assume param ‘x’ is pre-squash action, see sample_loglikelihood() below.


Special method for use with SAC algorithm, which returns a new sampled action and its log-likelihood for training use. Temporarily turns OFF squashing, so that log_likelihood can be computed on non-squashed sample, and then restores squashing and applies it to the sample before output.


Generate random samples using torch.normal, from dist_info.mean. Uses self.std unless it is None, then uses dist_info.log_std.


Input value or None to turn OFF.


Input multiplicative factor for squash * tanh(sample) (usually will be 1), or None to turn OFF.


Input value or None to turn OFF.


Input value, which can be same shape as action space, or else broadcastable up to that shape, or None to turn OFF and use dist_info.log_std in other methods.