TACO: Learning Task Decomposition via Temporal Alignment for Control

Humans can do magnificently complex stuff. However, they often can’t formalise or even explain how they do it. Learning from Demonstration (LfD) aims to mimic human behaviour given demonstrations of the behaviour, e.g. state-action pairs.

Composite and complex tasks.

It is obvious from our everyday experience that longer, more complex tasks, such as learning to play Moonlight Sonata on the piano are achieved by composing together a series of simpler skills, such as playing a single note. Attempting to learn a complex piano piece without any prior experience is much more likely to end up in failure and frustration in contrast to learning fundamental skills first.

Now picture the above piano example in an LfD setting. If the demonstrations included the motor actions for the entire Moonlight Sonata, it is very unlikely that learning a single flat policy via LfD would be successful. A more sensible strategy would be to break down the demonstrations into basic, reusable and easy-to-learn sub-tasks, and compose them in order to play the full piece.

Modular LfD.

This leaves us with the question: how do we break up a demonstration into these handy sub-tasks? The answer lies in the field of modular LfD. In an ideal world, every time-step in the demonstration would be labeled as belonging to a certain sub-task. Given this annotation, we could simply cut up the demonstration into separate datasets and use our favourite LfD method (e.g., behavioural cloning), to learn the individual policies, along with a high-level controller that switches from one policy to the other.

However, hand labelling a demonstration of potentially thousands of time-steps is incredibly tedious. Another option is to employ unsupervised learning and model sub-policies as latent variables that can be used to either split the data or condition the learned policy. Unsupervised methods require no additional data, though they are not guaranteed to find meaningful latent variables and can result in switching policies that are unreliable.1

TACO.

In our paper, we consider a setting that lies in between fully supervised and unsupervised learning. We assume that along with a demonstration we are provided with a high-level sketch of the demonstration that describes it at a high level. For example, if the whole task is a piano piece, then the task sketch contains the sequence of notes played during the piece.  More concretely, we assume that each demonstration \rho=((s_1,a_1),(s_2,a_2),...,(s_{T},a_{T})), of length T is a sequence of states s and actions a. This is is accompanied with a sketch \tau = (b_1,b_2,\ldots, b_{L}) of a much shorter length L consisting of sub-task symbols b that simply tell us which sub-policies are active in a demonstration and the order in which they occur.

This setting leaves us with a potentially big problem, however. Since the sketch is of much shorter length, we are lacking information about the alignment of \rho and \tau, i.e., we are not told how long each element in \tau is active within \rho.

Luckily this is a well-known problem in the field of speech recognition, where an utterance can span several time-steps in an audio sample, but the label to be recognised (the actual words uttered) is much shorter. A state-of-the-art method to address this problem is Connectionist Temporal Classification (CTC).2  Applied to our setting CTC, would minimise the negative log likelihood of the sketch under the observed demonstration.

(1)   \begin{equation*} \mathcal{L}_{CTC} = -\mathbb{E}_{(\rho,\tau)}[\log(p(\tau|\rho))] \end{equation*}

It seems like all our problems have been solved! Since we have a method to perform the alignment between \tau and \rho we can use it to label each part of the demonstration with its respective subtask.  We can then train one policy (\pi_{\theta*_b}) per subtask using BC, i.e., by minimising:

(2)   \begin{eqnarray*} \mathcal{L}_{BC}= -\mathbb{E}_{\rho}[\sum_{t=1}^{T}\log \pi_\theta(a_t|s_t)].  \end{eqnarray*}

i.e. the negative likelihood of actions given states in the segmented demonstration. From now on we refer to this approach as CTC-BC, (i.e., CTC followed by BC)

CTC, however, is a method for recognition. This has important implications that render CTC-BC inappropriate for this application.

  1. CTC can result in highly inaccurate alignments since these are only a byproduct of the recognition process.
  2. The two procedures optimise equation (1) (but in the equation for BC) and equation (2) independently. I.e CTC does not know that we will be using the resulting alignment for another optimisation procedure. If this alignment is even slightly wrong then BC would be optimising each sub-policy with the wrong data!

It is clear then that instead of optimising equations (1) and (2) separately we should be optimising these jointly. This is exactly what TACO3 does:

(3)   \begin{equation*} \mathcal{L}_{TACO} = - \log(p(\tau, \mathbf{a}_{\rho} | \mathbf{s}_{\rho})) \end{equation*}

At the heart of TACO is a dynamic programming procedure that combines ideas from CTC and policy sketches.4 TACO aligns the two sequences and learns one policy per sub-task in the demonstration as well as a high-level controller that switches between policies. For more details as to how exactly this is done, see the paper or the Tensorflow implementation (PyTorch coming soon!).

Results.

To investigate TACO’s properties and performance we turn to the Dial domain, which is simpler but similar to the piano example used at the beginning of the post.

The Dial domain.

We consider a JACO arm situated next to a dial pad. The states in the demonstrations describe the robots positioning with respect to the different numbers and the actions are the torques applied to each of the joints. Each demonstration includes the motor actions required to press down a certain combination of keys, 42, 1492, <your credit card pin>, and can be as long as 400 time-steps. The combination of keys pressed in each demonstration is (you guessed it), the task sketch. The aim is then to learn one policy per possible keypress, and a high-level controller that knows when each key is done pressing and move on to the next one.

At test time, we provide our policies with a new sketch of an unseen sequence of numbers. The task is considered successful if all the keys in the sketch are pressed in the right sequence. Again this is done by composing simple policies, one for each key.

So how does TACO do? Let’s take a look at some results:

On the y-axis we have the percentage of tasks completed for each method at test time. On the x-axis is the number of demonstrations required to achieve that performance. GT-BC in the graph stands for ‘ground-truth Behavioural Cloning’, i.e., the performance we would get by manually aligning the demonstration and sketch sequences. The three messages to take away from this plot are:

  1. TACO reaches the performance of GT-BC as more demonstrations are provided, without the need for tedious manual labelling.
  2. CTC-BC completely fails to complete any tasks even when provided with many demonstrations. This is because mistakes in the alignment found by CTC result in wrong policies for each key.
  3. The test tasks themselves have not been seen at training time. This means that TACO is capable of performing 0-shot imitation.

Here are some videos of the learned policies during this experiment:

Another useful property of the policies trained using TACO is that we can execute much larger and complex tasks, in this case a longer number sequence. To see how far we can push this idea we sampled test tasks of length 3 to 20 and saw how many times we were able to fully execute them.

Accuracy against test task length for TACO and GT-BC.

 

Again on the y-axis here is the task accuracy, while on the x-axis is the test task length. Clearly the chances of success decrease with increasing task length; however, surprisingly, the performance of TACO-trained policies degrades more slowly than those trained using GT-BC! This suggests that the TACO training procedure results in less overfitted policies than those trained with GT-BC.

Conclusion.

TACO is a general domain-agnostic and reliable method that allows learning from demonstration to be scaled to longer and more complex tasks. It also exhibits several interesting properties such as the ability for 0-shot imitation of longer tasks than the once it was trained on. For more maths, experiments and results have a look at the paper or play with the implementation!

Blog post: Kyriacos Shiarlis, Vitaly Kurin, Shimon Whiteson.

We would like to thank Markus Wulfmeier for his comments on this post.

References

SMAC: The StarCraft Multi-Agent Challenge

Multi-agent reinforcement learning (MARL) is an exciting and growing field. A particularly interesting and widely applicable class of problems is partially observable, cooperative, multi-agent learning, in which a team of agents must learn to coordinate their behaviour while conditioning only on their private observations. These problems bring unique challenges, such as the nonstationarity of learning, multi-agent credit assignment, and the difficulty of representing the value of joint actions.

However, a lack of standardised benchmark tasks has limited progress by making it hard to test new ideas and compare algorithms. In contrast, widely-adopted environments such as the Arcade Learning Environment and MuJoCo have played a major role in the recent advances of single-agent deep RL.

To fill in the gap, we are introducing the StarCraft Multi-Agent Challenge (SMAC), a benchmark that provides elements of partial observability, challenging dynamics, and high-dimensional observation spaces. SMAC is built using the StarCraft II game engine, creating a testbed for research in cooperative MARL where each game unit is an independent RL agent.

The full games of StarCraft: BroodWar and StarCraft II have been used as RL environments for some time, due to the many interesting challenges inherent to the games. DeepMind’s AlphaStar has recently shown a very impressive level of play on one StarCraft II matchup using a centralised controller. In contrast, SMAC is not intended as an environment to train agents for use in full StarCraft II gameplay. Instead, by introducing strict decentralisation and local partial observability, we use the StarCraft II game engine to build a new set of rich multi-agent problems.

StarCraft

In a regular full game of StarCraft, one or more humans compete against each other or against a built-in game AI. Each player needs to gather resources, construct buildings, and build armies of units to defeat their opponent by conquering its territories and destroying its bases.

      

Similar to most RTSs, StarCraft has two main gameplay components:

  • Macromanagement (macro) refers to high-level strategic considerations, such as the economy and resource management.
  • Micromanagement (micro) refers to fine-grained control of individual units.

Generally, the player with the better macro will have a larger and stronger army, as well as a stronger defensive scheme. Micro is also a vital aspect of StarCraft gameplay with a high skill ceiling, and is practiced in isolation by professional players.

StarCraft has already been used as a research platform for AI, and more recently, RL. Typically, the game is framed as a competitive task: an agent takes the role of a human player, making macro decisions and performing micro as a puppeteer that issues orders to individual units from a centralised controller.

StarCraft II, the second version of the game, was recently introduced to the research community through the release of Blizzard’s StarCraft II API, an interface that provides full external control of the game, and DeepMind’s PySC2, an open source toolset that exposes the former as an environment for RL. PySC2 was used to train AlphaStar, the first AI to defeat a professional human player in the full game of StarCraft II.

SMAC

We introduce the StarCraft Multi-Agent Challenge (SMAC) as a benchmark for research in cooperative MARL.

In order to build a rich multi-agent testbed, SMAC focuses solely on unit micromanagement. We leverage the natural multi-agent structure of micro by proposing a modified version of the problem designed specifically for decentralised control. In particular, we require that each unit be controlled by an independent RL agent that conditions only on local observations restricted to a limited field of view centred on that unit. Groups of these agents must be trained to solve challenging combat scenarios, battling an opposing army under the centralised control of the game’s built-in scripted AI.

Here’s a video of our best agents for several SMAC scenarios:

Proper micro of units during battles maximises the damage dealt to enemy units while minimising damage received and requires a range of skills. For example, one important technique is focus fire, i.e., ordering units to jointly attack and kill enemy units one after another. Other common micromanagement techniques include: assembling units into formations based on their armour types, making enemy units give chase while maintaining enough distance so that little or no damage is incurred (kiting), coordinating the positioning of units to attack from different directions or taking advantage of the terrain to defeat the enemy. Learning these rich cooperative behaviours under partial observability is a challenging task, which can be used to evaluate the effectiveness of MARL algorithms.

SMAC makes use of PySC2 and the raw interface of StarCraft II API to communicate with the game. However, SMAC is conceptually different from the RL environment of PySC2. The goal of PySC2 is to learn to play the full game of StarCraft II. This is a competitive task where a centralised RL agent receives RGB pixels as input and performs both macro and micro with the player-level control similar to human players. SMAC, on the other hand, represents a set of cooperative multi-agent micro challenges where each learning agent controls a single military unit.

Differences between PySC2 and SMAC

 PySC2SMAC
Settingcompetitivecollaborative
Controlplayer-levelunit-level
Gameplaymacro & micromicro
Goalmaster the full game of StarCraft IIbenchmark cooperative MARL methods
ObservationsRGB pixelsfeature vectors
Replays availableyesno

SMAC is composed of 22 combat scenarios that are chosen to learn different aspects of unit micromanagement. The number of agents in these scenarios ranges between 2 and 27. For the full list of SMAC scenarios, please refer to the SMAC documentation.

State and Observations

The cyan and red circles respectively border the sight and shooting range of the agent.

The local observations of agents include the following information about both allied and enemy units which are within the sight range:

  • distance
  • relative x
  • relative y
  • health
  • shield
  • unit type
  • last action (only for allied units)

Agents can also observe the terrain features surrounding them; particularly, the walkability and terrain height.

Additional state information about all units on the map is also available during the training, which allows training the decentralised policies in a centralised fashion. This includes the unit features present in the observations, as well as the following attributes:

  • coordinates of all agents relative to the map centre
  • cooldown / energy
  • the last actions of all agents

Note that the global state should only be used during the training and must not be used during the decentralised execution.

Actions

Agents can take the following discrete actions:

  • move[direction] (four directions: north, south, east, or west)
  • attack[enemy id]
  • heal[agent id] (only for Medivacs)
  • stop

An agent is permitted to perform an attack/heal action only towards enemies/allies that are within the shooting range.

In the game of StarCraft II, whenever an idle unit is under attack, it automatically starts a reply attack towards the attacking enemy units without being explicitly ordered. We limited such influences of the game on our agents by disabling the automatic reply towards the enemy attacks and enemy units that are located closely.

The maximum number of actions an agent can take ranges between 7 and 70, depending on the scenario.

Rewards

Agents only receive a shared team reward and need to deduce their own contribution to the team’s success. SMAC provides a default reward scheme which can be configured using a set of flags. Specifically, rewards can be sparse, +1/-1 for winning/losing an episode, or dense, intermediate rewards after the following events:

  • dealing health/shield damage
  • receiving health/shield damage
  • killing an enemy unit
  • having an allied unit killed
  • winning the episode
  • losing the episode

Nonetheless, we strongly discourage disingenuous engineering of the reward function (e.g. tuning different reward functions for different scenarios).

Results

The following graphs illustrate the learning curves of QMIXCOMA, and Independent Q-Learning state-of-the-art methods on two SMAC scenarios, namely MMM and 2s3z. They present the median win rate of the methods across 12 random runs during the training that lasted 10 million environment steps (25%-75% percentile is shaded).

The scenarios in SMAC have incremental difficulty. For example, 3s_vs_3z, 3s_vs_4z, and 3s_vs_5z scenarios challenge 3 allied Stalkers to combat against 3, 4 and 5 enemy Zealots respectively. Stalkers are very vulnerable to Zealot attacks. Hence, they need to kite in order to stand a chance. As the following graph highlights, kiting becomes increasingly harder with the increase of enemy Zealots, requiring a more fine-grained control and additional training time.

Further results of our initial experiments using the SMAC benchmark can be found in the accompanying paper.

PyMARL

To make it easier to develop algorithms for SMAC, we are also open-sourcing our software engineering framework PyMARL. PyMARL has been designed explicitly with deep MARL in mind and allows for out-of-the-box experimentation and development.

Written in PyTorch, PyMARL features implementations of several state-of-the-art methods, such as QMIXCOMA, and Independent Q-Learning.

In collaboration with the BAIR, some of the above algorithms have also been successfully ported to the scalable RLlib framework.

We heavily appreciate contributions by the community – please feel free to fork the PyMARL github repository!

Conclusion

In this post, we presented SMAC – a set of benchmark challenges for cooperative MARL. Based on the game StarCraft II, SMAC focuses on decentralised micromanagement tasks and features 22 diverse combat scenarios which challenge MARL methods to handle partial observability and high-dimensional inputs.

We are looking forward to accepting contributions from the community and hope that SMAC will become a standard benchmark environment for years to come.

Links

Blogpost: Mikayel Samvelyan, Gregory Farquhar, Christian Schroeder de Witt, Tabish Rashid, Nantas NardelliJakob Foerster, Shimon Whiteson

DiCE: The Infinitely Differentiable Monte Carlo Estimator

If you’ve stumbled upon this blog post, you’ve probably used policy gradient methods in Reinforcement Learning (RL). Or you might have maximised the likelihood in probabilistic models. In both cases, we need to estimate the gradient of the loss, which is an expectation over random variables.

The problem is that you cannot just differentiate the objective. Usually, you will apply the score function trick (aka log likelihood trick) here. We can view this trick as providing a differentiable function, whose gradient is an estimate of the gradient of the original objective. We can then apply any deep learning toolbox to do automatic differentiation. However, sometimes we need higher-order gradients, e.g., in meta-learning or multi-agent RL when we need to differentiate through other agents’ learning steps. This makes life much harder.

Infinitely Differentiable Monte Carlo Estimator (DiCE) [1] to the rescue! You can apply the magic \magic objective repeatedly infinitely many times to get the correct higher order gradients under Stochastic Computation Graph (SCG) formalism [2]. This lets automatic differentiation software do the job instead of us manipulating the graph manually. We illustrate the benefits of our approach applying “Learning with Opponent Learning Awareness” (LOLA) [3] to the iterated prisoner’s dilemma.

DiCE

As we mention above, in the surrogate loss (SL) approach, we choose an objective, whose gradient equals the true gradient of the objective and use this function to do the optimisation.

Sadly, constructing surrogate loss using the first-order gradient as an objective leads to wrong second-order gradient estimation. Simply put, applying SL twice and estimating the gradient is not the same as the second-order gradient of the true objective.

The wrong estimation happens because, in the SL approach, we treat part of the objective as a sampled cost. This causes the corresponding terms to lose a functional dependency on the sampling distribution.

We illustrate our reasoning graphically in the figure below using Stochastic Computation Graphs (SCGs) (Schulman et al. 2015) formalism.

Stochastic nodes are in orange, costs in grey, surrogate losses in blue, DiCE in purple, and gradient estimators in red.

We introduce the magic \magic operator, which allows us to compute the gradient to any order we like: \Expect[\nabla_{\theta}^n\calL_{\magic}] \rightarrowtail \nabla_{\theta}^{n}\calL, \forall n \in \{0, 1, 2, ...\}.

DiCE is easy to implement:

(1)   \begin{equation*} \magic(\calW) = \exp{(\tau - \perp(\tau))}, \tau=\sum_{w \in \calW}\log(p(w;\theta)), \end{equation*}

where \perp is an operator which sets the gradient of its operand to zero (detach in Pytorch and stop_gradient() in Tensorflow:

Alternatively, we can rewrite DiCE in the following way:

(2)   \begin{equation*} \magic(\calW) = \frac{\prod_{w \in \calW}p(w;\theta)}{\prod_{w \in \calW} \perp p(w;\theta)}. \end{equation*}

The figure below shows an example of DiCE applied to an RL problem:

DiCE applied to a reinforcement learning problem. A stochastic policy conditioned on s_t and \theta produces actions, a_t, which lead to rewards r_t and next states, s_{t+1}. Associated with each reward is a DiCE objective that takes as input the set of all causal dependencies that are functions of \theta, i.e, the actions. Arrows from \theta,a_i and r_i to gradient estimators omitted for clarity.

Variance Reduction

Variance reduction is an integral part of Monte Carlo estimation.
Though DiCE is not limited to the RL case, we are most interested in policy gradients that use the score function trick.

DiCE inherently reduces variance by taking causality into account. The cost node c is multiplied by the sum of the gradients of the log probabilities only for those nodes that influence c.

Now we propose another variance reduction mechanism by adding the following term to the DiCE objective:

(3)   \begin{align*} \calB_{\magic}^{(1)} &= \sum_{w \in \calS}{(1-\magic({w}))b_w},\nonumber \end{align*}

where b_w is any function of nodes not influenced by w. The baseline keeps the gradient estimation unbiased and does not influence the evaluation of the original objective \calL_{\magic}.

The flaw of \calB_{\magic}^{(1)} becomes apparent when we calculate second-order gradients. In two words, some the terms do not have control variates keeping variance high.

To fix the problem, we can subtract the following term from the objective to reduce the second-order gradient variance:

(5)   \begin{align*} \calB_{\magic}^{(2)} &= \sum_{w \in \calS'}{\big((1-\magic({w})\big) \big(1-\magic({\calS_w})\big)b_w}, \nonumber \end{align*}

where \calS' is the set of stochastic nodes that depend on \theta and at least one other stochastic node.

Code example

To show DiCE in action, we apply it to the iterated prisoner’s dilemma (IPD). In IPD, two agents iteratively play matrix games where they can either (C)ooperate or (D)efect. The first agent’s payoffs are the following: -2 (DD), 0 (DC), -3 (CD), -1 (CC).

Let’s build policies for both agents first:

Now, let’s build the DiCE objective:

Computing the gradient or hessian of the parameters is just calling tf.gradients() or tf.hessians() on the parameters:

You can find the complete working example here.

Empirical Results

Let’s now see the empirical verification of DiCE. From the figure below we can see that the second-order baseline \calL_{\magic}^{b_2} helps us to match the analytically derived Hessian, whereas the first-order one fails to do that.



The following figure shows that however, the quality of the gradient estimation increases with the sample size, \calL_{\magic}^{b_1} does not achieve that performance as \calL_{\magic}^{b_2} does. The results including the second-order baseline are in orange, the ones for first-order only are in blue.

Finally, we will show how DiCE helps us get better performance on IPD using LOLA [3]. Comparing LOLA-DICE agents and the original formulation LOLA-DICE agents discover strategies of high social welfare, replicating the results of the original LOLA paper in a way that is both more direct and efficient.

Joint average per step returns for different training methods. Shaded areas represent the 95% confidence intervals based on five runs. All agents used batches of size 64, which is more than 60 times smaller than the size required in the original LOLA paper.

As we can see in the figure below, the second-order baseline dramatically improves LOLA performance on the IPD problem:

LOLA performance with \calL_{\magic}^{b_1} (red) and \calL_{\magic}^{b_2} (blue).

Conclusion

In this post, we have described DiCE, a general method for computing any order gradient estimators for stochastic computation graphs. DiCE is easy to implement, however, at the same time it allows us to use the whole power of auto-differentiation software without manually constructing the graph for each order of the gradient. We believe DiCE will be a stepping stone for further exploration of higher order learning methods in meta-learning, reinforcement learning other applications of stochastic computation graphs.

Whether you want to build upon DiCE or are just interested to find out more, you can find our implementation here. For PyTorch lovers there is also an implementation by Alexis David Jacq.

References

Blogpost: Vitaly Kurin, Jakob Foerster, Shimon Whiteson.

ADVICE FOR SHORT-TERM MACHINE LEARNING RESEARCH PROJECTS

Tim Rocktäschel, Jakob Foerster and Greg Farquhar

Every year we get contacted by students who wish to work on short-term machine learning research projects with us. By now, we have supervised a good number of them and we noticed that some of the advice that we gave followed a few recurring principles. In this post, we share what we believe is good advice for a master’s thesis project or a summer research internship in machine learning. This post is by no means comprehensive but instead emphasizes those pitfalls that we saw over and over again. For instance, we will not talk about how to pick a good project or how to generally approach a machine learning research project. Some of our advice is generally applicable for working on machine learning and specifically deep and/or reinforcement research projects. However, some of it is only important when faced with the time constraints of a three-month project and are considerably less important when you just started the journey of a three to five year Ph.D. degree.

1. MAJOR PITFALLS

1.1 ASSUMING YOUR CODE IS BUG-FREE

Machine learning and specifically deep and reinforcement learning models are notoriously hard to debug. To give you a sense of the myriad of ways of making mistakes, have a look at Andrej Karpathy’s Twitter thread. All of us, even the more senior researchers, make such mistakes all the time. What makes these so hard to detect is that even buggy models often still learn and produce meaningful outputs. Bugs might introduce subtle changes to your model and most of them only show up at runtime. Having this in mind, the worst thing you can do is to assume your code does not contain any mistakes. What often distinguishes a productive from an unproductive machine learning researcher is their attitude towards their own code. If your default assumption is that there is likely something wrong with your code, you will search for bugs more carefully. Step through your code line-by-line, carefully inspecting intermediate outputs. Visualise them if possible. Do tensors have the right shapes? Have they been properly initialized, cloned or detached? Monitor gradients during training and look out for NaNs. It could be helpful to write unit-tests and to make your experiments reproducible by setting seeds of random number generators. For more tips on neural network debugging, have a look at Section 11.5 in Goodfellow et al.’s Deep Learning book.

1.2 ONLY LOOKING AT FINAL EVALUATION METRICS

While one aim of your project might be to achieve improvements on some evaluation metric, you should, more importantly, develop a good understanding of how and why your model works. Especially early in a project, final evaluation metrics contain little information that is useful for iterating and developing your algorithm or model. Instead, ask deeper questions and develop informative diagnostics. If you have introduced a gating or attention mechanism, does your model in fact make use of it? Which of the model innovations that you propose actually contribute to the overall performance gain? Did you carry out an ablation study? How many training examples/epochs did it take your model to achieve reasonable performance and does that differ to the baseline you are using? Are there any systematic differences between the test instances on which your model does well or terribly? How robust are your results with respect to changes of hyper-parameters? Can important features be predicted from the model’s hidden state? Keep in mind that your research and your project report are not really about informing the research community about some (marginal) improvement over the previous state-of-the-art, but instead about contributing to our understanding of the subject. Others in the field will want to know what works and what does not, and which of your findings could be applied to their problems.

1.3 TRYING RANDOM CHANGES AND HAVING NO CLEAR EXPECTATIONS

With current deep learning libraries it is easy to make a model more complex by adding more components, layers and optimization tricks. However, when you make a change to the code or model, you should have at least an intuition for why this change should help. Likewise, when you run an experiment, you should have a clear expectation of its outcome. What do you expect the plotted results to look like, and what will they tell you? This is even more important when you find yourself in a situation where your model is not doing what it is supposed to do. Then it is more likely that you are currently seeing the symptoms of a bug, so extending your model will not help you find that bug and might even make it harder to isolate the problem. Before making your model more complex, get to the bottom of what might be wrong with it. Moreover, keep in mind that in your report you will have to justify what you did. An assessor of your report is interested in understanding your thought process. If you cannot formulate a research hypothesis and explain to yourself why what you are doing should work, then chances are good that neither can anyone else.

1.4 OVERCOMPLICATING

We have often seen highly-motivated students jumping on hard problems and trying complex solutions right away. This makes it hard to analyze in case something goes wrong. Instead, ask yourself: what is the minimal thing that should work? Can the model learn to memorize a small data set? What does it learn when using only a few parameters? Does the code work when training on a single training instance instead of a batch? What is the simplest form of generalization that we expect to see? What is a simple baseline that we expect to fail? What is a minimal extension of the baseline that should make it work?

1.5 ITERATING TOO SLOWLY

Experiments can take a long time. Deep learning and reinforcement learning, in particular, can be extremely time consuming when amassing statistically significant numbers of random seeds. It is therefore critical to not fall into a slow iteration cycle too early in the course of a short-term project. Debug your models using simple environments and implement a proof-of-concept of your idea that can be run on your personal computer. Sometimes a simple matrix game or a grid world experiment can provide useful validation of ideas. Sometimes you can also use the exact value functions of MDPs to test algorithmic ideas without having to mess around with gradient estimation, actor-critic training etc. When moving to larger-scale experiments, streamline your process for launching experiments and checking their results. Check those results before experiments have run their full course to see if performance is flatlining. Investing in the infrastructure can be time consuming in the beginning, but it will pay off towards the end of your project. When analyzing results, be hungry for useful information.

2 SOME ADVICE

2.1 START READING UP ON THE BACKGROUND AND RELATED WORK BEFORE THE START OF YOUR PROJECT

We usually hand out projects months before the official start date. One reason for this is that three months is a really short time for i) learning about the background and related work, ii) carrying out the implementation and experiments, and iii) writing a good report. Another reason is that we generally propose research projects that, if successful, could lead to a publication in a machine learning venue. While we know that students have a lot of things going on during the term, we generally encourage you to at least start reading about relevant literature ahead of time. Ideally, by the time you start working full-time on the project, you should know what to do, how it relates to existing approaches, and have some idea of how to do it. This might also be a good time for getting familiar with your machine learning framework of choice (we recommend PyTorch!).

2.2 USE VERSION CONTROL

You really should use version control for your research code and project report. There is nothing worse than losing all your hard work days before the deadline. If you do not have one already, open a GitHub account. As a student you get free private repositories. If you do not know about version control, learn it now and thank yourself later.

2.3 EVALUATE USING RANDOM REPEATS

In academia it is unlikely that you have access to more than a handful of GPUs during your project. However, particularly in deep reinforcement learning it is important to not draw premature conclusions from a single or few experiments. Ideally, you want to repeat experiments multiple times and, as mentioned, get a sense of the robustness to different starting conditions and hyper-parameters.

2.4 START WRITING EARLY AND CONSISTENTLY THROUGHOUT THE PROJECT

If you are doing a master’s project, your work will be assessed based on your written report, not based on the outstanding work that you did but did not have enough time to write clearly about. Start writing early and do not underestimate the effort of disseminating your research. State your aims, hypothesis and contributions clearly and allow the reader to follow your thought process. Explain your design choices and discuss your findings clearly. Ideally, you should write your report consistently during the course of the project. That way, you force yourself to think your next steps through and it is less likely that you forget about any important information when the deadline gets close.

2.5 PROACTIVELY SEEK HELP WHEN YOU NEED IT

Your supervisors are busy people, but they are here to help you. Instead of running into problems and then getting stuck until the next scheduled meeting, reach out to your supervisors when you need it. Be proactive about arranging meetings and prepare the results, code, or write-up that you want to discuss in advance. Make good use of your supervisors! Lastly, do not panic! We all have been through this and we know that it can be a daunting experience, particularly if your job prospects or the success of your Ph.D. applications depend on this research project. We really want you to succeed.