Creating and Launching Experiments¶
Some utilities are included for creating and launching experiments comprised of multiple individual learning runs, e.g. for hyperparameter sweeps. To date, these include functions for launching locally on a machine, so launching into the cloud may require different tooling. Many experiments can be queued on a given hardware resource, and they will be cycled through to run in sequence (e.g. a desktop with 4 GPUs and each run getting exclusive use of 2 GPUs).
Launching¶
-
rlpyt.utils.launching.exp_launcher.
run_experiments
(script, affinity_code, experiment_title, runs_per_setting, variants, log_dirs, common_args=None, runs_args=None, set_egl_device=False)¶ Call in a script to run a set of experiments locally on a machine. Uses the
launch_experiment()
function for each individual run, which is a call to thescript
file. The number of experiments to run at the same time is determined from theaffinity_code
, which expresses the hardware resources of the machine and how much resource each run gets (e.g. 4 GPU machine, 2 GPUs per run). Experiments are queued and run in sequence, with the intention to avoid hardware overlap. Inputsvariants
andlog_dirs
should be lists of the same length, containing each experiment configuration and where to save its log files (which have the same name, so can’t exist in the same folder).Hint
To monitor progress, view the num_launched.txt file and experiments_tree.txt file in the experiment root directory, and also check the length of each progress.csv file, e.g.
wc -l experiment-directory/.../run_*/progress.csv
.
-
rlpyt.utils.launching.exp_launcher.
launch_experiment
(script, run_slot, affinity_code, log_dir, variant, run_ID, args, python_executable=None, set_egl_device=False)¶ Launches one learning run using
subprocess.Popen()
to call the python script. Calls the script as:python {script} {slot_affinity_code} {log_dir} {run_ID} {*args}
If
affinity_code["all_cpus"]
is provided, then the call is prepended withtasket -c ..
and the listed cpus (this is the most sure way to keep the run limited to these CPU cores). Also saves the variant file. Returns the process handle, which can be monitored.Use
set_egl_device=True
to set an environment variableEGL_DEVICE_ID
equal to the same value as the cuda index for the algorithm. For example, can use with DMControl environment modified to look for this environment variable when selecting a GPU for headless rendering.
Variants¶
Some simple tools are provided for creating hyperparameter value variants.
-
class
rlpyt.utils.launching.variant.
VariantLevel
(keys, values, dir_names)¶ A namedtuple which describes a set of hyperparameter settings.
Input
keys
should be a list of tuples, where each tuple is the sequence of keys to navigate down the configuration dictionary to the value.Input
values
should be a list of lists, where each element of the outer list is a complete set of values, and position in the inner list corresponds to the key at that position in the keys list, i.e. each combination must be explicitly written.Input
dir_names
should have the same length asvalues
, and includeunique paths for logging results from each value combination.
-
rlpyt.utils.launching.variant.
make_variants
(*variant_levels)¶ Takes in any number of
VariantLevel
objects and crosses them in order. Returns the resulting lists of full variant and log directories. Every set of values in one level is paired with every set of values in the next level, e.g. if two combinations are specified in one level and three combinations in the next, then six total configuations will result.Use in the script to create and run a set of learning runs.
-
rlpyt.utils.launching.variant.
_cross_variants
(prev_variants, prev_log_dirs, variant_level)¶ For every previous variant, make all combinations with new values.
-
rlpyt.utils.launching.variant.
load_variant
(log_dir)¶ Loads the variant.json file from the directory.
-
rlpyt.utils.launching.variant.
save_variant
(variant, log_dir)¶ Saves a variant.json file to the directory.
-
rlpyt.utils.launching.variant.
update_config
(default, variant)¶ Performs deep update on all dict structures from
variant
, updating only individual fields. Any field invariant
must be present indefault
, else raisesKeyError
(helps prevent mistakes). Operates recursively to return a new dictionary.
Affinity¶
The hardware affinity is used for several purposes: 1) the experiment launcher uses it to determine how many concurrent experiments to run, 2) runners use it to determine GPU device selection, 3) parallel samplers use it to determine the number of worker processes, and 4) multi-GPU and asynchronous runners use it to determine the number of parallel processes. The main intent of the implemented utilities is to take as input the total amount of hardware resources in the computer (CPU & GPU) and the amount of resources to be dedicated to each job, and then to divide resources evenly.
Example
An 8-GPU, 40-CPU machine would have 5 CPU assigned to each GPU. 1 GPU per run would set up 8 concurrent experiments, with each sampler using the 5 CPU. 2 GPU per run with synchronous runner would set up 4 concurrent experiments.
-
rlpyt.utils.launching.affinity.
encode_affinity
(n_cpu_core=1, n_gpu=0, contexts_per_gpu=1, gpu_per_run=1, cpu_per_run=1, cpu_per_worker=1, cpu_reserved=0, hyperthread_offset=None, n_socket=None, run_slot=None, async_sample=False, sample_gpu_per_run=0, optim_sample_share_gpu=False, alternating=False, set_affinity=True)¶ Encodes the hardware configuration into a string (with meanings defined in this file) which can be passed as a command line argument to call the training script. Use in overall experiments setup script to specify computer and experiment resources into
run_experiments()
.We refer to an “experiment” as an individual learning run, i.e. one set of hyperparameters and which does not interact with other runs.
Parameters: - n_cpu_core (int) – Total number of phyical cores to use on machine (not virtual)
- n_gpu (int) – Total number of GPUs to use on machine
- contexts_per_gpu (int) – How many experiment to share each GPU
- gpu_per_run (int) – How many GPUs to use per experiment (for multi-GPU optimization)
- cpu_per_run (int) – If not using GPU, specify how macores per experiment
- cpu_per_worker (int) – CPU cores per sampler worker; 1 unless environment is multi-threaded
- cpu_reserved (int) – Number of CPUs to reserve per GPU, and not allow sampler to use them
- hyperthread_offset (int) – Typically the number of physical cores, since they are labeled 0-x, and hyperthreads as (x+1)-2x; use 0 to disable hyperthreads, None to auto-detect
- n_socket (int) – Number of CPU sockets in machine; tries to keep CPUs grouped on same socket, and match socket-to-GPU affinity
- run_slot (int) – Which hardware slot to use; leave
None
intorun_experiments()
, but specified for inidividual train script - async_sample (bool) – True if asynchronous sampling/optimization mode; different affinity structure needed
- sample_gpu_per_run (int) – In asynchronous mode only, number of action-server GPUs per experiment
- optim_sample_share_gpu (bool) – In asynchronous mode only, whether to use same GPU(s) for both training and sampling
- alternating (bool) – True if using alternating sampler (will make more worker assignments)
- set_affinity (bool) – False to disable runner and sampler from setting cpu affinity via psutil, maybe inappropriate in cloud machines.
-
rlpyt.utils.launching.affinity.
encode_affinity
(n_cpu_core=1, n_gpu=0, contexts_per_gpu=1, gpu_per_run=1, cpu_per_run=1, cpu_per_worker=1, cpu_reserved=0, hyperthread_offset=None, n_socket=None, run_slot=None, async_sample=False, sample_gpu_per_run=0, optim_sample_share_gpu=False, alternating=False, set_affinity=True) Encodes the hardware configuration into a string (with meanings defined in this file) which can be passed as a command line argument to call the training script. Use in overall experiments setup script to specify computer and experiment resources into
run_experiments()
.We refer to an “experiment” as an individual learning run, i.e. one set of hyperparameters and which does not interact with other runs.
Parameters: - n_cpu_core (int) – Total number of phyical cores to use on machine (not virtual)
- n_gpu (int) – Total number of GPUs to use on machine
- contexts_per_gpu (int) – How many experiment to share each GPU
- gpu_per_run (int) – How many GPUs to use per experiment (for multi-GPU optimization)
- cpu_per_run (int) – If not using GPU, specify how macores per experiment
- cpu_per_worker (int) – CPU cores per sampler worker; 1 unless environment is multi-threaded
- cpu_reserved (int) – Number of CPUs to reserve per GPU, and not allow sampler to use them
- hyperthread_offset (int) – Typically the number of physical cores, since they are labeled 0-x, and hyperthreads as (x+1)-2x; use 0 to disable hyperthreads, None to auto-detect
- n_socket (int) – Number of CPU sockets in machine; tries to keep CPUs grouped on same socket, and match socket-to-GPU affinity
- run_slot (int) – Which hardware slot to use; leave
None
intorun_experiments()
, but specified for inidividual train script - async_sample (bool) – True if asynchronous sampling/optimization mode; different affinity structure needed
- sample_gpu_per_run (int) – In asynchronous mode only, number of action-server GPUs per experiment
- optim_sample_share_gpu (bool) – In asynchronous mode only, whether to use same GPU(s) for both training and sampling
- alternating (bool) – True if using alternating sampler (will make more worker assignments)
- set_affinity (bool) – False to disable runner and sampler from setting cpu affinity via psutil, maybe inappropriate in cloud machines.
-
rlpyt.utils.launching.affinity.
make_affinity
(run_slot=0, **kwargs)¶ Input same kwargs as
encode_affinity()
, returns the AttrDict form.
-
rlpyt.utils.launching.affinity.
affinity_from_code
(run_slot_affinity_code)¶ Use in individual experiment script; pass output to Runner.