The performance of Reinforcement Learning (RL) algorithms has been shown to be critically dependent on the choice of hyperparameters. In fact, *Mahmood et al.* showed that while working with real world robots, hyperparameter settings can have a much greater impact on performance than the choice of the RL algorithm.

While a lot of work has been done on automatically finding good hyperparameters in supervised learning, their applicability to RL algorithms is limited. The main problem is sample efficiency. Unlike in supervised learning, where the same dataset can typically be used to train multiple models with different hyperparameters, training in RL typically requires fresh interactions with the environment, leading to high sample costs. This is especially problematic when applying RL to real-world problems where interactions with the environment can be expensive or time consuming (for example, learning on real robots).

Surprisingly, grid search is still the most popular way to find good hyperparameters in RL, probably because it is easy to implement, and most current benchmark environments (for example, those from the OpenAI Gym) run fast enough that sample efficiency is a secondary concern. Bayesian Optimisation (BO) has been used in situations where training runs take a long time but the problem remains that, while BO is more sample efficient than grid search, it still requires multiple training runs, implying prohibitively high sample costs for most real-world settings.

In this work our goal is to develop a method to automatically find good hyperparameters that requires *no more samples than what the underlying RL algorithm would have collected anyway for one training run*. To achieve this, the method must also be robust to its own hyperparameters so that we needn’t tune these for every new problem. Simplicity, ease of implementation, and low computational cost are other desirable’ characteristics for a practical alternative to grid search even in problems where sample efficiency is not paramount.

With the above criteria in mind, we introduce Hyperparameter Optimisation On the Fly, which can be applied to automatically tune the hyperparameters of policy gradient methods.

## Hyperparameter Optimisation On the Fly (HOOF)

The main idea behind HOOF is to automatically adapt the hyperparameters during each iteration of the underlying policy gradient (PG) method by greedily maximising the value of the updated policy.

Concretely, at the iteration of the PG algorithm, we collect a batch of samples with the policy . Next, we generate the candidate hyperparameters , and compute the corresponding candidate policies for each of them. This does not sacrifice sample efficiency since the batch of samples collected only depend on and not on the hyperparameters under consideration (for example, the learning rate, or GAE). Finally, we set as the candidate policy that has the maximum value, i.e., .

Now we have a problem: Solving for requires an estimate of the value of each candidate policy. This would require collecting a fresh batch of samples with each candidate policy, which would make HOOF just as sample inefficient as random search. We address this by using weighted importance sampling (WIS) to construct *off-policy* estimates of the value of each candidate policy and then using those to solve the optimisation problem above. WIS uses only the batch of samples collected with to estimate the value of the candidate policies. This makes HOOF extremely sample efficient, requiring only one training run worth of samples to learn the optimal policy.

WIS estimates are known to have high variance and can become quickly unreliable as the candidate policies diverge from the behaviour policy . A key insight behind HOOF is that while the WIS estimates tend to be high variance, the relative ordering of candidates based on these WIS estimates is far more stable. And to solve our optimisation problem, we only need the relative ordering of the candidate policies.

HOOF satisfies our key requirement because it does not require more than one run to automatically tune the hyperparameters. It is also easy to implement and is far more computationally efficient than grid search or random search when tuning certain hyperparameters like the learning rate.

## Experimental Results

We evaluated HOOF on four simulated continuous control tasks from MuJoCo OpenAI Gym: HalfCheetah, Hopper, Ant, and Walker. First, we start with A2C as the underlying policy gradient algorithm, and use HOOF to automatically learn the learning rate. We compare this to two baselines: the learning rate set to the OpenAI Baselines default, and using meta-gradients to learn the learning rate. The results below show that HOOF is competitive to both baselines, even though these would have taken multiple training runs to achieve their performance.

*A2C Experiment Results*

We also evaluated HOOF with TRPO, a popular trust region-based policy gradient method that has been shown to outperform Truncated Natural Policy Gradients (TNPG) in continuous control tasks. While this result has been attributed to TRPO’s stricter enforcement of the KL constraint, we show that such enforcement becomes unnecessary once we properly adapt TNPG’s KL constraint. To do so, we applied HOOF to learn the KL constraint, and GAE() of TNPG (‘HOOF-TNPG’), and compared it against TRPO with the OpenAI Baseline’s default settings (‘Baseline TRPO’). The results presented below show that HOOF-TNPG learns much faster, and outperforms Baseline TRPO in all environments except for Walker where there’s no significant difference.

*TNPG Experiment Results*

## Conclusion

In this post, we presented HOOF, a highly sample efficient way of learning the hyperparameters of policy gradient methods. The simplicity of HOOF also makes it quite easy to implement, which makes it a viable alternative to grid search even in problems where sample efficiency is not the primary consideration. If you are interested, you can find more details in the paper or our implementation.

Blog post: Supratik Paul, Vitaly Kurin, Shimon Whiteson.