Fast Efficient Hyperparameter Tuning for Policy Gradients

Paper | Code

The performance of Reinforcement Learning (RL) algorithms has been shown to be critically dependent on the choice of hyperparameters. In fact,  Mahmood et al. showed that while working with real world robots, hyperparameter settings can have a much greater impact on performance than the choice of the RL algorithm.

While a lot of work has been done on automatically finding good hyperparameters in supervised learning, their applicability to RL algorithms is limited. The main problem is sample efficiency. Unlike in supervised learning, where the same dataset can typically be used to train multiple models with different hyperparameters, training in RL typically requires fresh interactions with the environment, leading to high sample costs. This is especially problematic when applying RL to real-world problems where interactions with the environment can be expensive or time consuming (for example, learning on real robots).

Surprisingly, grid search is still the most popular way to find good hyperparameters in RL, probably because it is easy to implement, and most current benchmark environments  (for example, those from the OpenAI Gym) run fast enough that sample efficiency is a secondary concern. Bayesian Optimisation (BO) has been used in situations where training runs take a long time but the problem remains that, while BO is more sample efficient than grid search, it still requires multiple training runs, implying prohibitively high sample costs for most real-world settings.

In this work our goal is to develop a method to automatically find good hyperparameters that requires no more samples than what the underlying RL algorithm would have collected anyway for one training run. To achieve this, the method must also be robust to its own hyperparameters so that we needn’t tune these for every new problem. Simplicity, ease of implementation, and low computational cost are other desirable’ characteristics for a practical alternative to grid search even in problems where sample efficiency is not paramount.

With the above criteria in mind, we introduce Hyperparameter Optimisation On the Fly, which can be applied to automatically tune the hyperparameters of policy gradient methods.

Hyperparameter Optimisation On the Fly (HOOF)

The main idea behind HOOF is to automatically adapt the hyperparameters during each iteration of the underlying policy gradient (PG) method by greedily maximising the value of the updated policy.

Concretely, at the n^{th} iteration of the PG algorithm, we collect a batch of samples with the policy \pi_n. Next, we generate the candidate hyperparameters \psi_1, \psi_2, ..., \psi_K, and compute the corresponding candidate policies \pi_{n+1}^{\psi_k} for each of them. This does not sacrifice sample efficiency since the batch of samples collected only depend on \pi_n and not on the hyperparameters under consideration (for example, the learning rate, or GAE(\gamma,\lambda)). Finally, we set \pi_{n+1} as the candidate policy that has the maximum value, i.e., \pi_{n+1} = \text{argmax}_{\pi} J(\pi_{n+1}^{\psi_k}).

Now we have a problem: Solving for \pi_{n+1} requires an estimate of the value of each candidate policy. This would require collecting a fresh batch of samples with each candidate policy, which would make HOOF just as sample inefficient as random search. We address this by using weighted importance sampling (WIS) to construct off-policy estimates of the value of each candidate policy and then using those to solve the optimisation problem above. WIS uses only the batch of samples collected with \pi_n to estimate the value of the candidate policies. This makes HOOF extremely sample efficient, requiring only one training run worth of samples to learn the optimal policy.

HOOF with random search on hyperparameters

WIS estimates are known to have high variance and can become quickly unreliable as the candidate policies diverge from the behaviour policy \pi_n. A key insight behind HOOF is that while the WIS estimates tend to be high variance, the relative ordering of candidates based on these WIS estimates is far more stable. And to solve our optimisation problem, we only need the relative ordering of the candidate policies.

HOOF satisfies our key requirement because it does not require more than one run to automatically tune the hyperparameters. It is also easy to implement and is far more computationally efficient than grid search or random search when tuning certain hyperparameters like the learning rate.

Experimental Results

We evaluated HOOF on four simulated continuous control tasks from MuJoCo OpenAI Gym: HalfCheetah, Hopper, Ant, and Walker. First, we start with A2C as the underlying policy gradient algorithm, and use HOOF to automatically learn the learning rate. We compare this to two baselines: the learning rate set to the OpenAI Baselines default, and using meta-gradients to learn the learning rate. The results below show that HOOF is competitive to both baselines, even though these would have taken multiple training runs to achieve their performance.



A2C Experiment Results

We also evaluated HOOF with TRPO, a popular trust region-based policy gradient method that has been shown to outperform Truncated Natural Policy Gradients (TNPG) in continuous control tasks. While this result has been attributed to TRPO’s stricter enforcement of the KL constraint, we show that such enforcement becomes unnecessary once we properly adapt TNPG’s KL constraint. To do so, we applied HOOF to learn the KL constraint, and GAE(\gamma, \lambda) of TNPG (‘HOOF-TNPG’), and compared it against TRPO with the OpenAI Baseline’s default settings (‘Baseline TRPO’). The results presented below show that HOOF-TNPG learns much faster, and outperforms Baseline TRPO in all environments except for Walker where there’s no significant difference.



TNPG Experiment Results


In this post, we presented HOOF, a highly sample efficient way of learning the hyperparameters of policy gradient methods. The simplicity of HOOF also makes it quite easy to implement, which makes it a viable alternative to grid search even in problems where sample efficiency is not the primary consideration. If you are interested, you can find more details in the paper or our implementation.


Blog post: Supratik Paul, Vitaly Kurin, Shimon Whiteson.

CAVIA: Fast Context Adaptation via Meta-Learning

Adapting to previously unseen tasks is a long-standing problem in machine learning. Ideally, we want to do this fast and with as little data as possible.

Consider the following example: You want to train an object classifier which can detect whether an image contains a meerkat or a cigar. However, you only have four training images per class (this is also called 2-way 4-shot classification):

Training a (deep) neural network from scratch on this dataset would not work at all: the model would overfit to the training data, and would not be able to generalise to an unseen image like the one on the right.

However, we might have access to a large collection of labelled images of different object categories:

We can use these to build 2-way 4-shot mini-datasets like the meerkat-cigar one, and learn how to learn quickly on such types of datasets.

One particular approach to these types of problems is meta-learning. For a fantastic overview of meta-learning settings and different approaches we recommend this blog post by Lilian Weng. In our work, we build on a method which solves this problem by learning a network initialisation as follows.

Background: MAML

Model-Agnostic Meta-Learning (MAML) is a powerful gradient-based approach to the problem of fast adaptation. MAML tries to learn a parameter initialisation \theta such that adapting to new tasks can be done within several gradient updates. This approach is model and task agnostic: it can be used with any gradient-based algorithm, and can be applied to regression, classification, and reinforcement learning tasks. After meta-training, the model is evaluated on a new task: given a small set of labelled data points (in supervised learning) or trajectories (in reinforcement learning), the learned initial parameters are adapted using just a few gradient steps.

As such, MAML adapts the entire model when learning a new task. However, this is (a) often not necessary since many tasks and existing benchmarks do not require generalisation beyond task identification, and (b) can in fact be detrimental to performance, since it can lead to overfitting.

We propose an extension to MAML which addresses these points, and has the additional benefit of being interpretable and easier to parallelise. We call our algorithm Fast Context Adaptation via Meta-Learning (CAVIA), and show empirically that this results in equal or better performance compared to MAML on a range of tasks.


So, how does our CAVIA work? Let’s formalise the problem setting first. We describe the supervised learning setting here. However, it is easy to transfer it to the reinforcement learning setup (check our paper for more details).

We are given a distribution over training tasks p_{train}(\mathcal{T}) and test tasks p_{test}(\mathcal{T}). The goal of the supervised learning algorithm is to learn a model f: x \rightarrow \hat{y} mapping input features x to a label y.

To understand CAVIA, it is easier to start with MAML.

\theta_i = \theta - \alpha \nabla_{\theta}\frac{1}{M^i_{train}}\sum_{(x,y) \in \mathcal{D}^{train}_i}{\mathcal{L}_{\mathcal{T}_i}(f_{\theta}(x), y)},

where M is the dataset \mathcal{D} size and \alpha is the learning rate.

\theta = \theta - \beta \nabla_{\theta}\frac{1}{N}\sum_{\mathcal{T}_i \in \mathbf{T}}{\frac{1}{M^i_{test}}\sum_{(x,y) \in \mathcal{D}^{test}_i}{\mathcal{L}_{\mathcal{T}_i}(f_{\theta_i}(x), y)}},

where \beta is the outer loop learning rate. As we can see, in both cases, we update \theta, all the parameters of the network.

CAVIA does a similar update. However, we split all the network parameters into two disjoint subsets: global parameters \theta and context parameters \phi.

Like MAML, CAVIA consists of an inner and an outer loop update, with the difference that we update only the context parameters \phi in the inner loop, and only the shared network parameters \theta in the outer loop.

In the inner update loop, we update context parameters \phi.

\phi_i = \phi_0 - \alpha \nabla_{\phi}\frac{1}{M^i_{train}}\sum_{(x,y) \in \mathcal{D}^{train}_i}{\mathcal{L}_{\mathcal{T}_i}(f_{\phi_0, \theta}(x), y)}

In the outer update loop, we update the global parameters \theta.

\theta = \theta - \beta \nabla_{\theta}\frac{1}{N}\sum_{\mathcal{T}_i \in \mathbf{T}}{\frac{1}{M^i_{test}}\sum_{(x,y) \in \mathcal{D}^{test}_i}{\mathcal{L}_{\mathcal{T}_i}(f_{\phi, \theta_i}(x), y)}}

Keeping a separate set of parameters has two advantages. First, we can vary the size of it based on the task on hand, incorporating prior knowledge about the task into the network structure. Second, it is much easier to parallelise than MAML.

Experimental Results

We evaluated CAVIA on a range of popular meta-learning benchmarks for regression, classification and reinforcement learning tasks. One of the motivations of CAVIA is that many tasks do not require generalisation beyond task identification – and this is also true for many current benchmarks.

To illustrate this, the below figure shows the number of parameters we update on the benchmarks we tested, for MAML versus CAVIA (note the log-scale on the y-axis):

This figure shows that the amount of adaptation on these benchmarks is relatively small. In the following, we look at those benchmarks in more detail.


Fitting sine curves

Let us start with a regression task, in which we want to fit sine curves, as done in the Model-Agnostic Meta-Learning (MAML) paper. Amplitude and phase fully specify a task. We sample amplitudes from [0.1,0.5] range and the phase from [0,\pi].

Figures below show the curve fitting before and after the gradient update. While MAML and CAVIA both succeed in the task, we would like to point out, that CAVIA adjusts just 2 context parameters, when MAML adjusts approximately 1500 weights, which makes it prone to overfitting.

Before the update
After the update

For this example, we can easily visualise what the context parameters learn. Below you see a visualisation of what the context parameters learn when using only two context parameters:

The x-axis shows the resulting value of context parameter 1 after the update, and the y-axis show the resulting value of context parameter 2 after the update (each dot is a single task and its position reflects the value of the context parameters). The colour shows the true task variable (amplitude on the left, and phase on the right). As we can see, CAVIA learns an embedding which can be smoothly interpolated. The circular shape is probably due to the phase being periodic.

Image Completion

Next, we decided to test CAVIA on a more challenging task: CelebA image completion which was suggested by Marta Garnelo et al. (2018). The table below shows CAVIA superiority in terms of the pixel-wise MSE.

 Random PixelsOrdered Pixels

As the next figure justifies, CAVIA is able to learn to restore a picture of a face from ten pixels only. In this particular experiment, we used 128 context parameters and five gradient steps for adaptation.

Few-Shot Classification

We also tested CAVIA for few-shot classification on the challenging Mini-Imagenet benchmark. This task requires large convolutional networks, which have the risk of overfitting when updated on only a small number of datapoints. The question for us was whether CAVIA can scale to large networks without overfitting. In our experiments, we used 100 context parameters for CAVIA, and increased the size of \theta by increasing the number of filters (numbers in brackets in the table). The table below shows that as the network size increases, the performance of MAML goes down, whereas the performance of CAVIA increases.

Method5-way accuracy
1-shot, %5-shot, %
Matching Nets46.660.0
Meta LSTM43.44±0.7760.60±0.71
Prototypical Networks46.61±0.7865.77±0.70
MAML (32)48.07±1.7563.15±0.91
MAML (64)44.70±1.6961.87±0.93
CAVIA (32)47.24±0.6559.05±0.54
CAVIA (128)49.84±0.6864.63±0.54
CAVIA (512)51.82±0.6565.85±0.55
CAVIA (512, first order)49.92±0.6863.59±0.57

Reinforcement Learning

The final experiment we conducted is MuJoCo, a high dimensional reinforcement learning benchmark. The first task reward incentivised HalfCheetah going to a particular direction. The second encouraged agents going with a particular speed. All of the methods were trained up to 500 meta-iterations. CAVIA used 50 context parameters. As we can see in the figures below, in both of the tasks CAVIA outperformed MAML for one gradient step. MAML catches up for the second tasks achieving a similar performance after three gradient updates.


This shows that CAVIA can learn the same compared to MAML, when adapting only a context parameter vector of size 50 at test time.

We also wanted to have a look at the learned policy, which is shown in the video below for the Forward/Backward task. To get a feeling for how good the task embedding is, we learned a binary classifier using logistic regression which predicts the task (forward/backward) from the context parameters (we learn this after meta-training). As the video shows, this predicts roughly a 50/50 chance before doing any updates, and predicts the correct task (backwards) just from the context parameters after three updates.

Below is another video, this time for the forward task. Notice that in this rollout, even though the task is inferred with high certainty only after one update, the forward policy is not as elegant as the backward one.


CAVIA highlights that

  • Many interesting few-shot learning tasks require generalisation in the form of task identification, and we can use this to develop more robust algorithms.
  • Current benchmarks often require only this type of generalisation. We believe that more benchmarks are necessary to push beyond this (such as the Meta-Dataset).

We believe that CAVIA opens up the possibilities for exciting new research directions, including

  • More interpretable algorithms that allow an analysis of what the context parameters learn.
  • The reuse of the context parameters for downstream tasks, auxiliary tasks, or distributed machine learning systems.

Finally, we believe that for tasks that require adaptation beyond task identification, methods that do adapt more than just context parameters are necessary. A combination of CAVIA-like methods for task identification, and MAML-like adaptation on all parameters is a promising future direction.


In this blog post we covered CAVIA, a meta-learning method for fast adaptation to previously unseen tasks. We didn’t go into a lot of details and implementation peculiarities (e.g. how to condition the network on the parameters, how to initialise context parameters etc) and connections to related work in meta-learning. If you want to know more, check our paper, play with the code or write us an email.

TACO: Learning Task Decomposition via Temporal Alignment for Control

Humans can do magnificently complex stuff. However, they often can’t formalise or even explain how they do it. Learning from Demonstration (LfD) aims to mimic human behaviour given demonstrations of the behaviour, e.g. state-action pairs.

Composite and complex tasks.

It is obvious from our everyday experience that longer, more complex tasks, such as learning to play Moonlight Sonata on the piano are achieved by composing together a series of simpler skills, such as playing a single note. Attempting to learn a complex piano piece without any prior experience is much more likely to end up in failure and frustration in contrast to learning fundamental skills first.

Now picture the above piano example in an LfD setting. If the demonstrations included the motor actions for the entire Moonlight Sonata, it is very unlikely that learning a single flat policy via LfD would be successful. A more sensible strategy would be to break down the demonstrations into basic, reusable and easy-to-learn sub-tasks, and compose them in order to play the full piece.

Modular LfD.

This leaves us with the question: how do we break up a demonstration into these handy sub-tasks? The answer lies in the field of modular LfD. In an ideal world, every time-step in the demonstration would be labeled as belonging to a certain sub-task. Given this annotation, we could simply cut up the demonstration into separate datasets and use our favourite LfD method (e.g., behavioural cloning), to learn the individual policies, along with a high-level controller that switches from one policy to the other.

However, hand labelling a demonstration of potentially thousands of time-steps is incredibly tedious. Another option is to employ unsupervised learning and model sub-policies as latent variables that can be used to either split the data or condition the learned policy. Unsupervised methods require no additional data, though they are not guaranteed to find meaningful latent variables and can result in switching policies that are unreliable.1


In our paper, we consider a setting that lies in between fully supervised and unsupervised learning. We assume that along with a demonstration we are provided with a high-level sketch of the demonstration that describes it at a high level. For example, if the whole task is a piano piece, then the task sketch contains the sequence of notes played during the piece.  More concretely, we assume that each demonstration \rho=((s_1,a_1),(s_2,a_2),...,(s_{T},a_{T})), of length T is a sequence of states s and actions a. This is is accompanied with a sketch \tau = (b_1,b_2,\ldots, b_{L}) of a much shorter length L consisting of sub-task symbols b that simply tell us which sub-policies are active in a demonstration and the order in which they occur.

This setting leaves us with a potentially big problem, however. Since the sketch is of much shorter length, we are lacking information about the alignment of \rho and \tau, i.e., we are not told how long each element in \tau is active within \rho.

Luckily this is a well-known problem in the field of speech recognition, where an utterance can span several time-steps in an audio sample, but the label to be recognised (the actual words uttered) is much shorter. A state-of-the-art method to address this problem is Connectionist Temporal Classification (CTC).2  Applied to our setting CTC, would minimise the negative log likelihood of the sketch under the observed demonstration.

(1)   \begin{equation*} \mathcal{L}_{CTC} = -\mathbb{E}_{(\rho,\tau)}[\log(p(\tau|\rho))] \end{equation*}

It seems like all our problems have been solved! Since we have a method to perform the alignment between \tau and \rho we can use it to label each part of the demonstration with its respective subtask.  We can then train one policy (\pi_{\theta*_b}) per subtask using BC, i.e., by minimising:

(2)   \begin{eqnarray*} \mathcal{L}_{BC}= -\mathbb{E}_{\rho}[\sum_{t=1}^{T}\log \pi_\theta(a_t|s_t)].  \end{eqnarray*}

i.e. the negative likelihood of actions given states in the segmented demonstration. From now on we refer to this approach as CTC-BC, (i.e., CTC followed by BC)

CTC, however, is a method for recognition. This has important implications that render CTC-BC inappropriate for this application.

  1. CTC can result in highly inaccurate alignments since these are only a byproduct of the recognition process.
  2. The two procedures optimise equation (1) (but in the equation for BC) and equation (2) independently. I.e CTC does not know that we will be using the resulting alignment for another optimisation procedure. If this alignment is even slightly wrong then BC would be optimising each sub-policy with the wrong data!

It is clear then that instead of optimising equations (1) and (2) separately we should be optimising these jointly. This is exactly what TACO3 does:

(3)   \begin{equation*} \mathcal{L}_{TACO} = - \log(p(\tau, \mathbf{a}_{\rho} | \mathbf{s}_{\rho})) \end{equation*}

At the heart of TACO is a dynamic programming procedure that combines ideas from CTC and policy sketches.4 TACO aligns the two sequences and learns one policy per sub-task in the demonstration as well as a high-level controller that switches between policies. For more details as to how exactly this is done, see the paper or the Tensorflow implementation (PyTorch coming soon!).


To investigate TACO’s properties and performance we turn to the Dial domain, which is simpler but similar to the piano example used at the beginning of the post.

The Dial domain.

We consider a JACO arm situated next to a dial pad. The states in the demonstrations describe the robots positioning with respect to the different numbers and the actions are the torques applied to each of the joints. Each demonstration includes the motor actions required to press down a certain combination of keys, 42, 1492, <your credit card pin>, and can be as long as 400 time-steps. The combination of keys pressed in each demonstration is (you guessed it), the task sketch. The aim is then to learn one policy per possible keypress, and a high-level controller that knows when each key is done pressing and move on to the next one.

At test time, we provide our policies with a new sketch of an unseen sequence of numbers. The task is considered successful if all the keys in the sketch are pressed in the right sequence. Again this is done by composing simple policies, one for each key.

So how does TACO do? Let’s take a look at some results:

On the y-axis we have the percentage of tasks completed for each method at test time. On the x-axis is the number of demonstrations required to achieve that performance. GT-BC in the graph stands for ‘ground-truth Behavioural Cloning’, i.e., the performance we would get by manually aligning the demonstration and sketch sequences. The three messages to take away from this plot are:

  1. TACO reaches the performance of GT-BC as more demonstrations are provided, without the need for tedious manual labelling.
  2. CTC-BC completely fails to complete any tasks even when provided with many demonstrations. This is because mistakes in the alignment found by CTC result in wrong policies for each key.
  3. The test tasks themselves have not been seen at training time. This means that TACO is capable of performing 0-shot imitation.

Here are some videos of the learned policies during this experiment:

Another useful property of the policies trained using TACO is that we can execute much larger and complex tasks, in this case a longer number sequence. To see how far we can push this idea we sampled test tasks of length 3 to 20 and saw how many times we were able to fully execute them.

Accuracy against test task length for TACO and GT-BC.


Again on the y-axis here is the task accuracy, while on the x-axis is the test task length. Clearly the chances of success decrease with increasing task length; however, surprisingly, the performance of TACO-trained policies degrades more slowly than those trained using GT-BC! This suggests that the TACO training procedure results in less overfitted policies than those trained with GT-BC.


TACO is a general domain-agnostic and reliable method that allows learning from demonstration to be scaled to longer and more complex tasks. It also exhibits several interesting properties such as the ability for 0-shot imitation of longer tasks than the once it was trained on. For more maths, experiments and results have a look at the paper or play with the implementation!

Blog post: Kyriacos Shiarlis, Vitaly Kurin, Shimon Whiteson.

We would like to thank Markus Wulfmeier for his comments on this post.


DiCE: The Infinitely Differentiable Monte Carlo Estimator

If you’ve stumbled upon this blog post, you’ve probably used policy gradient methods in Reinforcement Learning (RL). Or you might have maximised the likelihood in probabilistic models. In both cases, we need to estimate the gradient of the loss, which is an expectation over random variables.

The problem is that you cannot just differentiate the objective. Usually, you will apply the score function trick (aka log likelihood trick) here. We can view this trick as providing a differentiable function, whose gradient is an estimate of the gradient of the original objective. We can then apply any deep learning toolbox to do automatic differentiation. However, sometimes we need higher-order gradients, e.g., in meta-learning or multi-agent RL when we need to differentiate through other agents’ learning steps. This makes life much harder.

Infinitely Differentiable Monte Carlo Estimator (DiCE) [1] to the rescue! You can apply the magic \magic objective repeatedly infinitely many times to get the correct higher order gradients under Stochastic Computation Graph (SCG) formalism [2]. This lets automatic differentiation software do the job instead of us manipulating the graph manually. We illustrate the benefits of our approach applying “Learning with Opponent Learning Awareness” (LOLA) [3] to the iterated prisoner’s dilemma.


As we mention above, in the surrogate loss (SL) approach, we choose an objective, whose gradient equals the true gradient of the objective and use this function to do the optimisation.

Sadly, constructing surrogate loss using the first-order gradient as an objective leads to wrong second-order gradient estimation. Simply put, applying SL twice and estimating the gradient is not the same as the second-order gradient of the true objective.

The wrong estimation happens because, in the SL approach, we treat part of the objective as a sampled cost. This causes the corresponding terms to lose a functional dependency on the sampling distribution.

We illustrate our reasoning graphically in the figure below using Stochastic Computation Graphs (SCGs) (Schulman et al. 2015) formalism.

Stochastic nodes are in orange, costs in grey, surrogate losses in blue, DiCE in purple, and gradient estimators in red.

We introduce the magic \magic operator, which allows us to compute the gradient to any order we like: \Expect[\nabla_{\theta}^n\calL_{\magic}] \rightarrowtail \nabla_{\theta}^{n}\calL, \forall n \in \{0, 1, 2, ...\}.

DiCE is easy to implement:

(1)   \begin{equation*} \magic(\calW) = \exp{(\tau - \perp(\tau))}, \tau=\sum_{w \in \calW}\log(p(w;\theta)), \end{equation*}

where \perp is an operator which sets the gradient of its operand to zero (detach in Pytorch and stop_gradient() in Tensorflow:

Alternatively, we can rewrite DiCE in the following way:

(2)   \begin{equation*} \magic(\calW) = \frac{\prod_{w \in \calW}p(w;\theta)}{\prod_{w \in \calW} \perp p(w;\theta)}. \end{equation*}

The figure below shows an example of DiCE applied to an RL problem:

DiCE applied to a reinforcement learning problem. A stochastic policy conditioned on s_t and \theta produces actions, a_t, which lead to rewards r_t and next states, s_{t+1}. Associated with each reward is a DiCE objective that takes as input the set of all causal dependencies that are functions of \theta, i.e, the actions. Arrows from \theta,a_i and r_i to gradient estimators omitted for clarity.

Variance Reduction

Variance reduction is an integral part of Monte Carlo estimation.
Though DiCE is not limited to the RL case, we are most interested in policy gradients that use the score function trick.

DiCE inherently reduces variance by taking causality into account. The cost node c is multiplied by the sum of the gradients of the log probabilities only for those nodes that influence c.

Now we propose another variance reduction mechanism by adding the following term to the DiCE objective:

(3)   \begin{align*} \calB_{\magic}^{(1)} &= \sum_{w \in \calS}{(1-\magic({w}))b_w},\nonumber \end{align*}

where b_w is any function of nodes not influenced by w. The baseline keeps the gradient estimation unbiased and does not influence the evaluation of the original objective \calL_{\magic}.

The flaw of \calB_{\magic}^{(1)} becomes apparent when we calculate second-order gradients. In two words, some the terms do not have control variates keeping variance high.

To fix the problem, we can subtract the following term from the objective to reduce the second-order gradient variance:

(5)   \begin{align*} \calB_{\magic}^{(2)} &= \sum_{w \in \calS'}{\big((1-\magic({w})\big) \big(1-\magic({\calS_w})\big)b_w}, \nonumber \end{align*}

where \calS' is the set of stochastic nodes that depend on \theta and at least one other stochastic node.

Code example

To show DiCE in action, we apply it to the iterated prisoner’s dilemma (IPD). In IPD, two agents iteratively play matrix games where they can either (C)ooperate or (D)efect. The first agent’s payoffs are the following: -2 (DD), 0 (DC), -3 (CD), -1 (CC).

Let’s build policies for both agents first:

Now, let’s build the DiCE objective:

Computing the gradient or hessian of the parameters is just calling tf.gradients() or tf.hessians() on the parameters:

You can find the complete working example here.

Empirical Results

Let’s now see the empirical verification of DiCE. From the figure below we can see that the second-order baseline \calL_{\magic}^{b_2} helps us to match the analytically derived Hessian, whereas the first-order one fails to do that.

The following figure shows that however, the quality of the gradient estimation increases with the sample size, \calL_{\magic}^{b_1} does not achieve that performance as \calL_{\magic}^{b_2} does. The results including the second-order baseline are in orange, the ones for first-order only are in blue.

Finally, we will show how DiCE helps us get better performance on IPD using LOLA [3]. Comparing LOLA-DICE agents and the original formulation LOLA-DICE agents discover strategies of high social welfare, replicating the results of the original LOLA paper in a way that is both more direct and efficient.

Joint average per step returns for different training methods. Shaded areas represent the 95% confidence intervals based on five runs. All agents used batches of size 64, which is more than 60 times smaller than the size required in the original LOLA paper.

As we can see in the figure below, the second-order baseline dramatically improves LOLA performance on the IPD problem:

LOLA performance with \calL_{\magic}^{b_1} (red) and \calL_{\magic}^{b_2} (blue).


In this post, we have described DiCE, a general method for computing any order gradient estimators for stochastic computation graphs. DiCE is easy to implement, however, at the same time it allows us to use the whole power of auto-differentiation software without manually constructing the graph for each order of the gradient. We believe DiCE will be a stepping stone for further exploration of higher order learning methods in meta-learning, reinforcement learning other applications of stochastic computation graphs.

Whether you want to build upon DiCE or are just interested to find out more, you can find our implementation here. For PyTorch lovers there is also an implementation by Alexis David Jacq.


Blogpost: Vitaly Kurin, Jakob Foerster, Shimon Whiteson.