Humans can do magnificently complex stuff. However, they often can’t formalise or even explain how they do it. Learning from Demonstration (LfD) aims to mimic human behaviour given demonstrations of the behaviour, e.g. state-action pairs.
Composite and complex tasks.
It is obvious from our everyday experience that longer, more complex tasks, such as learning to play Moonlight Sonata on the piano are achieved by composing together a series of simpler skills, such as playing a single note. Attempting to learn a complex piano piece without any prior experience is much more likely to end up in failure and frustration in contrast to learning fundamental skills first.
Now picture the above piano example in an LfD setting. If the demonstrations included the motor actions for the entire Moonlight Sonata, it is very unlikely that learning a single flat policy via LfD would be successful. A more sensible strategy would be to break down the demonstrations into basic, reusable and easy-to-learn sub-tasks, and compose them in order to play the full piece.
This leaves us with the question: how do we break up a demonstration into these handy sub-tasks? The answer lies in the field of modular LfD. In an ideal world, every time-step in the demonstration would be labeled as belonging to a certain sub-task. Given this annotation, we could simply cut up the demonstration into separate datasets and use our favourite LfD method (e.g., behavioural cloning), to learn the individual policies, along with a high-level controller that switches from one policy to the other.
However, hand labelling a demonstration of potentially thousands of time-steps is incredibly tedious. Another option is to employ unsupervised learning and model sub-policies as latent variables that can be used to either split the data or condition the learned policy. Unsupervised methods require no additional data, though they are not guaranteed to find meaningful latent variables and can result in switching policies that are unreliable.1
In our paper, we consider a setting that lies in between fully supervised and unsupervised learning. We assume that along with a demonstration we are provided with a high-level sketch of the demonstration that describes it at a high level. For example, if the whole task is a piano piece, then the task sketch contains the sequence of notes played during the piece. More concretely, we assume that each demonstration , of length is a sequence of states and actions . This is is accompanied with a sketch of a much shorter length consisting of sub-task symbols that simply tell us which sub-policies are active in a demonstration and the order in which they occur.
This setting leaves us with a potentially big problem, however. Since the sketch is of much shorter length, we are lacking information about the alignment of and , i.e., we are not told how long each element in is active within .
Luckily this is a well-known problem in the field of speech recognition, where an utterance can span several time-steps in an audio sample, but the label to be recognised (the actual words uttered) is much shorter. A state-of-the-art method to address this problem is Connectionist Temporal Classification (CTC).2 Applied to our setting CTC, would minimise the negative log likelihood of the sketch under the observed demonstration.
It seems like all our problems have been solved! Since we have a method to perform the alignment between and we can use it to label each part of the demonstration with its respective subtask. We can then train one policy () per subtask using BC, i.e., by minimising:
i.e. the negative likelihood of actions given states in the segmented demonstration. From now on we refer to this approach as CTC-BC, (i.e., CTC followed by BC)
CTC, however, is a method for recognition. This has important implications that render CTC-BC inappropriate for this application.
- CTC can result in highly inaccurate alignments since these are only a byproduct of the recognition process.
- The two procedures optimise equation (1) (but in the equation for BC) and equation (2) independently. I.e CTC does not know that we will be using the resulting alignment for another optimisation procedure. If this alignment is even slightly wrong then BC would be optimising each sub-policy with the wrong data!
It is clear then that instead of optimising equations (1) and (2) separately we should be optimising these jointly. This is exactly what TACO3 does:
At the heart of TACO is a dynamic programming procedure that combines ideas from CTC and policy sketches.4 TACO aligns the two sequences and learns one policy per sub-task in the demonstration as well as a high-level controller that switches between policies. For more details as to how exactly this is done, see the paper or the Tensorflow implementation (PyTorch coming soon!).
To investigate TACO’s properties and performance we turn to the Dial domain, which is simpler but similar to the piano example used at the beginning of the post.
We consider a JACO arm situated next to a dial pad. The states in the demonstrations describe the robots positioning with respect to the different numbers and the actions are the torques applied to each of the joints. Each demonstration includes the motor actions required to press down a certain combination of keys, 42, 1492, <your credit card pin>, and can be as long as 400 time-steps. The combination of keys pressed in each demonstration is (you guessed it), the task sketch. The aim is then to learn one policy per possible keypress, and a high-level controller that knows when each key is done pressing and move on to the next one.
At test time, we provide our policies with a new sketch of an unseen sequence of numbers. The task is considered successful if all the keys in the sketch are pressed in the right sequence. Again this is done by composing simple policies, one for each key.
So how does TACO do? Let’s take a look at some results:
On the -axis we have the percentage of tasks completed for each method at test time. On the -axis is the number of demonstrations required to achieve that performance. GT-BC in the graph stands for ‘ground-truth Behavioural Cloning’, i.e., the performance we would get by manually aligning the demonstration and sketch sequences. The three messages to take away from this plot are:
- TACO reaches the performance of GT-BC as more demonstrations are provided, without the need for tedious manual labelling.
- CTC-BC completely fails to complete any tasks even when provided with many demonstrations. This is because mistakes in the alignment found by CTC result in wrong policies for each key.
- The test tasks themselves have not been seen at training time. This means that TACO is capable of performing 0-shot imitation.
Here are some videos of the learned policies during this experiment:
Another useful property of the policies trained using TACO is that we can execute much larger and complex tasks, in this case a longer number sequence. To see how far we can push this idea we sampled test tasks of length 3 to 20 and saw how many times we were able to fully execute them.
Again on the -axis here is the task accuracy, while on the -axis is the test task length. Clearly the chances of success decrease with increasing task length; however, surprisingly, the performance of TACO-trained policies degrades more slowly than those trained using GT-BC! This suggests that the TACO training procedure results in less overfitted policies than those trained with GT-BC.
TACO is a general domain-agnostic and reliable method that allows learning from demonstration to be scaled to longer and more complex tasks. It also exhibits several interesting properties such as the ability for 0-shot imitation of longer tasks than the once it was trained on. For more maths, experiments and results have a look at the paper or play with the implementation!
We would like to thank Markus Wulfmeier for his comments on this post.