DiCE: The Infinitely Differentiable Monte Carlo Estimator

If you’ve stumbled upon this blog post, you’ve probably used policy gradient methods in Reinforcement Learning (RL). Or you might have maximised the likelihood in probabilistic models. In both cases, we need to estimate the gradient of the loss, which is an expectation over random variables.

The problem is that you cannot just differentiate the objective. Usually, you will apply the score function trick (aka log likelihood trick) here. We can view this trick as providing a differentiable function, whose gradient is an estimate of the gradient of the original objective. We can then apply any deep learning toolbox to do automatic differentiation. However, sometimes we need higher-order gradients, e.g., in meta-learning or multi-agent RL when we need to differentiate through other agents’ learning steps. This makes life much harder.

Infinitely Differentiable Monte Carlo Estimator (DiCE) [1] to the rescue! You can apply the magic \magic objective repeatedly infinitely many times to get the correct higher order gradients under Stochastic Computation Graph (SCG) formalism [2]. This lets automatic differentiation software do the job instead of us manipulating the graph manually. We illustrate the benefits of our approach applying “Learning with Opponent Learning Awareness” (LOLA) [3] to the iterated prisoner’s dilemma.


As we mention above, in the surrogate loss (SL) approach, we choose an objective, whose gradient equals the true gradient of the objective and use this function to do the optimisation.

Sadly, constructing surrogate loss using the first-order gradient as an objective leads to wrong second-order gradient estimation. Simply put, applying SL twice and estimating the gradient is not the same as the second-order gradient of the true objective.

The wrong estimation happens because, in the SL approach, we treat part of the objective as a sampled cost. This causes the corresponding terms to lose a functional dependency on the sampling distribution.

We illustrate our reasoning graphically in the figure below using Stochastic Computation Graphs (SCGs) (Schulman et al. 2015) formalism.

Stochastic nodes are in orange, costs in grey, surrogate losses in blue, DiCE in purple, and gradient estimators in red.

We introduce the magic \magic operator, which allows us to compute the gradient to any order we like: \Expect[\nabla_{\theta}^n\calL_{\magic}] \rightarrowtail \nabla_{\theta}^{n}\calL, \forall n \in \{0, 1, 2, ...\}.

DiCE is easy to implement:

(1)   \begin{equation*} \magic(\calW) = \exp{(\tau - \perp(\tau))}, \tau=\sum_{w \in \calW}\log(p(w;\theta)), \end{equation*}

where \perp is an operator which sets the gradient of its operand to zero (detach in Pytorch and stop_gradient() in Tensorflow:

Alternatively, we can rewrite DiCE in the following way:

(2)   \begin{equation*} \magic(\calW) = \frac{\prod_{w \in \calW}p(w;\theta)}{\prod_{w \in \calW} \perp p(w;\theta)}. \end{equation*}

The figure below shows an example of DiCE applied to an RL problem:

DiCE applied to a reinforcement learning problem. A stochastic policy conditioned on s_t and \theta produces actions, a_t, which lead to rewards r_t and next states, s_{t+1}. Associated with each reward is a DiCE objective that takes as input the set of all causal dependencies that are functions of \theta, i.e, the actions. Arrows from \theta,a_i and r_i to gradient estimators omitted for clarity.

Variance Reduction

Variance reduction is an integral part of Monte Carlo estimation.
Though DiCE is not limited to the RL case, we are most interested in policy gradients that use the score function trick.

DiCE inherently reduces variance by taking causality into account. The cost node c is multiplied by the sum of the gradients of the log probabilities only for those nodes that influence c.

Now we propose another variance reduction mechanism by adding the following term to the DiCE objective:

(3)   \begin{align*} \calB_{\magic}^{(1)} &= \sum_{w \in \calS}{(1-\magic({w}))b_w},\nonumber \end{align*}

where b_w is any function of nodes not influenced by w. The baseline keeps the gradient estimation unbiased and does not influence the evaluation of the original objective \calL_{\magic}.

The flaw of \calB_{\magic}^{(1)} becomes apparent when we calculate second-order gradients. In two words, some the terms do not have control variates keeping variance high.

To fix the problem, we can subtract the following term from the objective to reduce the second-order gradient variance:

(5)   \begin{align*} \calB_{\magic}^{(2)} &= \sum_{w \in \calS'}{\big((1-\magic({w})\big) \big(1-\magic({\calS_w})\big)b_w}, \nonumber \end{align*}

where \calS' is the set of stochastic nodes that depend on \theta and at least one other stochastic node.

Code example

To show DiCE in action, we apply it to the iterated prisoner’s dilemma (IPD). In IPD, two agents iteratively play matrix games where they can either (C)ooperate or (D)efect. The first agent’s payoffs are the following: -2 (DD), 0 (DC), -3 (CD), -1 (CC).

Let’s build policies for both agents first:

Now, let’s build the DiCE objective:

Computing the gradient or hessian of the parameters is just calling tf.gradients() or tf.hessians() on the parameters:

You can find the complete working example here.

Empirical Results

Let’s now see the empirical verification of DiCE. From the figure below we can see that the second-order baseline \calL_{\magic}^{b_2} helps us to match the analytically derived Hessian, whereas the first-order one fails to do that.

The following figure shows that however, the quality of the gradient estimation increases with the sample size, \calL_{\magic}^{b_1} does not achieve that performance as \calL_{\magic}^{b_2} does. The results including the second-order baseline are in orange, the ones for first-order only are in blue.

Finally, we will show how DiCE helps us get better performance on IPD using LOLA [3]. Comparing LOLA-DICE agents and the original formulation LOLA-DICE agents discover strategies of high social welfare, replicating the results of the original LOLA paper in a way that is both more direct and efficient.

Joint average per step returns for different training methods. Shaded areas represent the 95% confidence intervals based on five runs. All agents used batches of size 64, which is more than 60 times smaller than the size required in the original LOLA paper.

As we can see in the figure below, the second-order baseline dramatically improves LOLA performance on the IPD problem:

LOLA performance with \calL_{\magic}^{b_1} (red) and \calL_{\magic}^{b_2} (blue).


In this post, we have described DiCE, a general method for computing any order gradient estimators for stochastic computation graphs. DiCE is easy to implement, however, at the same time it allows us to use the whole power of auto-differentiation software without manually constructing the graph for each order of the gradient. We believe DiCE will be a stepping stone for further exploration of higher order learning methods in meta-learning, reinforcement learning other applications of stochastic computation graphs.

Whether you want to build upon DiCE or are just interested to find out more, you can find our implementation here. For PyTorch lovers there is also an implementation by Alexis David Jacq.


Blogpost: Vitaly Kurin, Jakob Foerster, Shimon Whiteson.


Tim Rocktäschel, Jakob Foerster and Greg Farquhar

Every year we get contacted by students who wish to work on short-term machine learning research projects with us. By now, we have supervised a good number of them and we noticed that some of the advice that we gave followed a few recurring principles. In this post, we share what we believe is good advice for a master’s thesis project or a summer research internship in machine learning. This post is by no means comprehensive but instead emphasizes those pitfalls that we saw over and over again. For instance, we will not talk about how to pick a good project or how to generally approach a machine learning research project. Some of our advice is generally applicable for working on machine learning and specifically deep and/or reinforcement research projects. However, some of it is only important when faced with the time constraints of a three-month project and are considerably less important when you just started the journey of a three to five year Ph.D. degree.



Machine learning and specifically deep and reinforcement learning models are notoriously hard to debug. To give you a sense of the myriad of ways of making mistakes, have a look at Andrej Karpathy’s Twitter thread. All of us, even the more senior researchers, make such mistakes all the time. What makes these so hard to detect is that even buggy models often still learn and produce meaningful outputs. Bugs might introduce subtle changes to your model and most of them only show up at runtime. Having this in mind, the worst thing you can do is to assume your code does not contain any mistakes. What often distinguishes a productive from an unproductive machine learning researcher is their attitude towards their own code. If your default assumption is that there is likely something wrong with your code, you will search for bugs more carefully. Step through your code line-by-line, carefully inspecting intermediate outputs. Visualise them if possible. Do tensors have the right shapes? Have they been properly initialized, cloned or detached? Monitor gradients during training and look out for NaNs. It could be helpful to write unit-tests and to make your experiments reproducible by setting seeds of random number generators. For more tips on neural network debugging, have a look at Section 11.5 in Goodfellow et al.’s Deep Learning book.


While one aim of your project might be to achieve improvements on some evaluation metric, you should, more importantly, develop a good understanding of how and why your model works. Especially early in a project, final evaluation metrics contain little information that is useful for iterating and developing your algorithm or model. Instead, ask deeper questions and develop informative diagnostics. If you have introduced a gating or attention mechanism, does your model in fact make use of it? Which of the model innovations that you propose actually contribute to the overall performance gain? Did you carry out an ablation study? How many training examples/epochs did it take your model to achieve reasonable performance and does that differ to the baseline you are using? Are there any systematic differences between the test instances on which your model does well or terribly? How robust are your results with respect to changes of hyper-parameters? Can important features be predicted from the model’s hidden state? Keep in mind that your research and your project report are not really about informing the research community about some (marginal) improvement over the previous state-of-the-art, but instead about contributing to our understanding of the subject. Others in the field will want to know what works and what does not, and which of your findings could be applied to their problems.


With current deep learning libraries it is easy to make a model more complex by adding more components, layers and optimization tricks. However, when you make a change to the code or model, you should have at least an intuition for why this change should help. Likewise, when you run an experiment, you should have a clear expectation of its outcome. What do you expect the plotted results to look like, and what will they tell you? This is even more important when you find yourself in a situation where your model is not doing what it is supposed to do. Then it is more likely that you are currently seeing the symptoms of a bug, so extending your model will not help you find that bug and might even make it harder to isolate the problem. Before making your model more complex, get to the bottom of what might be wrong with it. Moreover, keep in mind that in your report you will have to justify what you did. An assessor of your report is interested in understanding your thought process. If you cannot formulate a research hypothesis and explain to yourself why what you are doing should work, then chances are good that neither can anyone else.


We have often seen highly-motivated students jumping on hard problems and trying complex solutions right away. This makes it hard to analyze in case something goes wrong. Instead, ask yourself: what is the minimal thing that should work? Can the model learn to memorize a small data set? What does it learn when using only a few parameters? Does the code work when training on a single training instance instead of a batch? What is the simplest form of generalization that we expect to see? What is a simple baseline that we expect to fail? What is a minimal extension of the baseline that should make it work?


Experiments can take a long time. Deep learning and reinforcement learning, in particular, can be extremely time consuming when amassing statistically significant numbers of random seeds. It is therefore critical to not fall into a slow iteration cycle too early in the course of a short-term project. Debug your models using simple environments and implement a proof-of-concept of your idea that can be run on your personal computer. Sometimes a simple matrix game or a grid world experiment can provide useful validation of ideas. Sometimes you can also use the exact value functions of MDPs to test algorithmic ideas without having to mess around with gradient estimation, actor-critic training etc. When moving to larger-scale experiments, streamline your process for launching experiments and checking their results. Check those results before experiments have run their full course to see if performance is flatlining. Investing in the infrastructure can be time consuming in the beginning, but it will pay off towards the end of your project. When analyzing results, be hungry for useful information.



We usually hand out projects months before the official start date. One reason for this is that three months is a really short time for i) learning about the background and related work, ii) carrying out the implementation and experiments, and iii) writing a good report. Another reason is that we generally propose research projects that, if successful, could lead to a publication in a machine learning venue. While we know that students have a lot of things going on during the term, we generally encourage you to at least start reading about relevant literature ahead of time. Ideally, by the time you start working full-time on the project, you should know what to do, how it relates to existing approaches, and have some idea of how to do it. This might also be a good time for getting familiar with your machine learning framework of choice (we recommend PyTorch!).


You really should use version control for your research code and project report. There is nothing worse than losing all your hard work days before the deadline. If you do not have one already, open a GitHub account. As a student you get free private repositories. If you do not know about version control, learn it now and thank yourself later.


In academia it is unlikely that you have access to more than a handful of GPUs during your project. However, particularly in deep reinforcement learning it is important to not draw premature conclusions from a single or few experiments. Ideally, you want to repeat experiments multiple times and, as mentioned, get a sense of the robustness to different starting conditions and hyper-parameters.


If you are doing a master’s project, your work will be assessed based on your written report, not based on the outstanding work that you did but did not have enough time to write clearly about. Start writing early and do not underestimate the effort of disseminating your research. State your aims, hypothesis and contributions clearly and allow the reader to follow your thought process. Explain your design choices and discuss your findings clearly. Ideally, you should write your report consistently during the course of the project. That way, you force yourself to think your next steps through and it is less likely that you forget about any important information when the deadline gets close.


Your supervisors are busy people, but they are here to help you. Instead of running into problems and then getting stuck until the next scheduled meeting, reach out to your supervisors when you need it. Be proactive about arranging meetings and prepare the results, code, or write-up that you want to discuss in advance. Make good use of your supervisors! Lastly, do not panic! We all have been through this and we know that it can be a daunting experience, particularly if your job prospects or the success of your Ph.D. applications depend on this research project. We really want you to succeed.