Towards an Understanding of Default Policies in Multitask Policy Optimization

Moskovitz, Ted,
Arbel, Michael,
Parker-Holder, Jack,
and Pacchiano, Aldo

International Conference on Artificial Intelligence and Statistics (AISTATS)
2022

Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms, with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees.

Amortized Implicit Differentiation for Stochastic Bilevel Optimization

Arbel, Michael,
and Mairal, Julien

International Conference on Learning Representations (ICLR)
2022

We study a class of algorithms for solving bilevel optimization problems in both stochastic and deterministic settings when the inner-level objective is strongly convex. Specifically, we consider algorithms based on inexact implicit differentiation and we exploit a warm-start strategy to amortize the estimation of the exact gradient. We then introduce a unified theoretical framework inspired by the study of singularly perturbed systems (Habets, 1974) to analyze such amortized algorithms. By using this framework, our analysis shows these algorithms to match the computational complexity of oracle methods that have access to an unbiased estimate of the gradient, thus outperforming many existing results for bilevel optimization. We illustrate these findings on synthetic experiments and demonstrate the efficiency of these algorithms on hyper-parameter optimization experiments involving several thousands of variables.

Continual Repeated Annealed Flow Transport Monte Carlo

Matthews, Alexander GDG,
Arbel, Michael,
Rezende, Danilo J,
and Doucet, Arnaud

arXiv preprint arXiv:2201.13117
2022

We propose Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT), a method that combines a sequential Monte Carlo (SMC) sampler (itself a generalization of Annealed Importance Sampling) with variational inference using normalizing flows. The normalizing flows are directly trained to transport between annealing temperatures using a KL divergence for each transition. This optimization objective is itself estimated using the normalizing flow/SMC approximation. We show conceptually and using multiple empirical examples that CRAFT improves on Annealed Flow Transport Monte Carlo (Arbel et al., 2021), on which it builds and also on Markov chain Monte Carlo (MCMC) based Stochastic Normalizing Flows (Wu et al., 2020). By incorporating CRAFT within particle MCMC, we show that such learnt samplers can achieve impressively accurate results on a challenging lattice field theory example.

Learning Unnormalized Likelihood Models for Simulation-Based Inference

Glaser, Pierre,
Arbel, Michael,
Doucet, Arnaud,
and Gretton, Arthur

work in progress
2022

We introduce an algorithm for Simulation-Based Inference (SBI), to conduct inference from limited experimental observations when a high-fidelity simulator is available.
Our method, SBI-EBM, learns an energy-based model (EBM) of the likelihood using synthetic data generated by the simulator, conditioned on parameters drawn from a proposal distribution.
Our EBM is trained in a multi-round manner, meaning that at each round, new parameters are sampled which focus the simulator on outputs near the true observations.
Central to SBI-EBM is its use of Sequential Monte Carlo, allowing for efficient mode exploration.
The EBM used in SBI-EBM is trained directly by optimizing a KL loss: this is in contrast to simulation-based models which either rely on flow-based density estimators, or use noise-contrastive objectives: choices that come with known pitfalls.
We show that SBI-EBM attains competitive performance on a range of benchmarks.

Efficient wasserstein natural gradients for reinforcement learning

Moskovitz, Ted*,
Arbel, Michael*,
Huszar, Ferenc,
and Gretton, Arthur

International Conference on Learning Representations (ICLR)
2021

A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines.

The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods

Thiry, Louis,
Arbel, Michael,
Belilovsky, Eugene,
and Oyallon, Edouard

International Conference on Learning Representations (ICLR)
2021

A recent line of work showed that various forms of convolutional kernel methods can be competitive with standard supervised deep convolutional networks on datasets like CIFAR-10, obtaining accuracies in the range of 87-90% while being more amenable to theoretical analysis. In this work, we highlight the importance of a data-dependent feature extraction step that is key to the obtain good performance in convolutional kernel methods. This step typically corresponds to a whitened dictionary of patches, and gives rise to a data-driven convolutional kernel methods. We extensively study its effect, demonstrating it is the key ingredient for high performance of these methods. Specifically, we show that one of the simplest instances of such kernel methods, based on a single layer of image patches followed by a linear classifier is already obtaining classification accuracies on CIFAR-10 in the same range as previous more sophisticated convolutional kernel methods. We scale this method to the challenging ImageNet dataset, showing such a simple approach can exceed all existing non-learned representation methods. This is a new baseline for object recognition without representation learning methods, that initiates the investigation of convolutional kernel models on ImageNet. We conduct experiments to analyze the dictionary that we used, our ablations showing they exhibit low-dimensional properties.

International Conference on Learning Representations (ICLR)
2021

Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms, with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees.

Arbel, Michael*,
Matthews, Alexander GDG*,
and Doucet, Arnaud

International Conference on Machine Learning (ICML)
2021

Annealed Importance Sampling (AIS) and its Sequential Monte Carlo (SMC) extensions are state-of-the-art methods for estimating normalizing constants of probability distributions. We propose here a novel Monte Carlo algorithm, Annealed Flow Transport (AFT), that builds upon AIS and SMC and combines them with normalizing flows (NFs) for improved performance. This method transports a set of particles using not only importance sampling (IS), Markov chain Monte Carlo (MCMC) and resampling steps - as in SMC, but also relies on NFs which are learned sequentially to push particles towards the successive annealed targets. We provide limit theorems for the resulting Monte Carlo estimates of the normalizing constant and expectations with respect to the target distribution. Additionally, we show that a continuous-time scaling limit of the population version of AFT is given by a Feynman–Kac measure which simplifies to the law of a controlled diffusion for expressive NFs. We demonstrate experimentally the benefits and limitations of our methodology on a variety of applications.

Tactical Optimism and Pessimism for Deep Reinforcement Learning

Moskovitz, Ted,
Parker-Holder, Jack,
Pacchiano, Aldo,
Arbel, Michael,
and Jordan, Michael I

Adv. Neural Information Processing Systems (NeurIPS)
2021

In recent years, deep off-policy actor-critic algorithms have become a dominant approach to reinforcement learning for continuous control. One of the primary drivers of this improved performance is the use of pessimistic value updates to address function approximation errors, which previously led to disappointing performance. However, a direct consequence of pessimism is reduced exploration, running counter to theoretical support for the efficacy of optimism in the face of uncertainty. So which approach is best? In this work, we show that the most effective degree of optimism can vary both across tasks and over the course of learning. Inspired by this insight, we introduce a novel deep actor-critic framework, Tactical Optimistic and Pessimistic (TOP) estimation, which switches between optimistic and pessimistic value learning online. This is achieved by formulating the selection as a multi-arm bandit problem. We show in a series of continuous control tasks that TOP outperforms existing methods which rely on a fixed degree of optimism, setting a new state of the art in challenging pixel-based environments. Since our changes are simple to implement, we believe these insights can easily be incorporated into a multitude of off-policy algorithms.

KALE Flow: A Relaxed KL Gradient Flow for Probabilities with Disjoint Support

Glaser, Pierre,
Arbel, Michael,
and Gretton, Arthur

Adv. Neural Information Processing Systems (NeurIPS)
2021

We study the gradient flow for a relaxed approximation to the Kullback-Leibler (KL) divergence between a moving source and a fixed target distribution. This approximation, termed the KALE (KL approximate lower-bound estimator), solves a regularized version of the Fenchel dual problem defining the KL over a restricted class of functions. When using a Reproducing Kernel Hilbert Space (RKHS) to define the function class, we show that the KALE continuously interpolates between the KL and the Maximum Mean Discrepancy (MMD). Like the MMD and other Integral Probability Metrics, the KALE remains well defined for mutually singular distributions. Nonetheless, the KALE inherits from the limiting KL a greater sensitivity to mismatch in the support of the distributions, compared with the MMD. These two properties make the KALE gradient flow particularly well suited when the target distribution is supported on a low-dimensional manifold. Under an assumption of sufficient smoothness of the trajectories, we show the global convergence of the KALE flow. We propose a particle implementation of the flow given initial samples from the source and the target distribution, which we use to empirically confirm the KALE’s properties.

Arbel, Michael,
Gretton, Arthur,
Li, Wuchen,
and Montufar, Guido

International Conference on Learning Representations (ICLR)
2020

Barycentric averaging is a principled way of summarizing populations of measures. Existing algorithms for estimating barycenters typically parametrize them as weighted sums of Diracs and optimize their weights and/or locations. However, these approaches do not scale to high-dimensional settings due to the curse of dimensionality. In this paper, we propose a scalable and general algorithm for estimating barycenters of measures in high dimensions. The key idea is to turn the optimization over measures into an optimization over generative models, introducing inductive biases that allow the method to scale while still accurately estimating barycenters. We prove local convergence under mild assumptions on the discrepancy showing that the approach is well-posed. We demonstrate that our method is fast, achieves good performance on low-dimensional problems, and scales to high-dimensional settings. In particular, our approach is the first to be used to estimate barycenters in thousands of dimensions.

Synchronizing Probability Measures on Rotations via Optimal Transport

Birdal, Tolga,
Arbel, Michael,
Simsekli, Umut,
and Guibas, Leonidas

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
2020

We introduce a new paradigm, measure synchronization, for synchronizing graphs with measure-valued edges. We formulate this problem as maximization of the cycle-consistency in the space of probability measures over relative rotations. In particular, we aim at estimating marginal distributions of absolute orientations by synchronizing the conditional ones, which are defined on the Riemannian manifold of quaternions. Such graph optimization on distributions-on-manifolds enables a natural treatment of multimodal hypotheses, ambiguities and uncertainties arising in many computer vision applications such as SLAM, SfM, and object pose estimation. We first formally define the problem as a generalization of the classical rotation graph synchronization, where in our case the vertices denote probability measures over rotations. We then measure the quality of the synchronization by using Sinkhorn divergences, which reduces to other popular metrics such as Wasserstein distance or the maximum mean discrepancy as limit cases. We propose a nonparametric Riemannian particle optimization approach to solve the problem. Even though the problem is non-convex, by drawing a connection to the recently proposed sparse optimization methods, we show that the proposed algorithm converges to the global optimum in a special case of the problem under certain conditions. Our qualitative and quantitative experiments show the validity of our approach and we bring in new perspectives to the study of synchronization.

A non-asymptotic analysis for Stein variational gradient descent

Korba, Anna,
Salim, Adil,
Arbel, Michael,
Luise, Giulia,
and Gretton, Arthur

Adv. Neural Information Processing Systems (NeurIPS)
2020

We study the Stein Variational Gradient Descent (SVGD) algorithm, which optimises a set of particles to approximate a target probability distribution π∝e−V on ℝd. In the population limit, SVGD performs gradient descent in the space of probability distributions on the KL divergence with respect to π, where the gradient is smoothed through a kernel integral operator. In this paper, we provide a novel finite time analysis for the SVGD algorithm. We provide a descent lemma establishing that the algorithm decreases the objective at each iteration, and rates of convergence for the average Stein Fisher divergence (also referred to as Kernel Stein Discrepancy). We also provide a convergence result of the finite particle system corresponding to the practical implementation of SVGD to its population version.

Estimating barycenters of measures in high dimensions

Cohen, Samuel,
Arbel, Michael,
and Deisenroth, Marc Peter

arXiv preprint arXiv:2007.07105
2020

Barycentric averaging is a principled way of summarizing populations of measures. Existing algorithms for estimating barycenters typically parametrize them as weighted sums of Diracs and optimize their weights and/or locations. However, these approaches do not scale to high-dimensional settings due to the curse of dimensionality. In this paper, we propose a scalable and general algorithm for estimating barycenters of measures in high dimensions. The key idea is to turn the optimization over measures into an optimization over generative models, introducing inductive biases that allow the method to scale while still accurately estimating barycenters. We prove local convergence under mild assumptions on the discrepancy showing that the approach is well-posed. We demonstrate that our method is fast, achieves good performance on low-dimensional problems, and scales to high-dimensional settings. In particular, our approach is the first to be used to estimate barycenters in thousands of dimensions.

Arbel, Michael,
Korba, Anna,
Salim, Adil,
and Gretton, Arthur

Adv. Neural Information Processing Systems (NeurIPS)
2019

We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties.
The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to particle transport when optimizing neural networks.
We also propose a way to regularize this MMD flow, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. The practical implementation of the flow is straightforward, since both the MMD and its gradient have simple closed-form expressions, which can be easily estimated with samples.

International Conference on Artificial Intelligence and Statistics (AISTATS)
2018

A nonparametric family of conditional distributions is introduced, which generalizes conditional exponential families using functional parameters in a suitable RKHS. An algorithm is provided for learning the generalized natural parameter, and consistency of the estimator is established in the well specified case. In experiments, the new method generally outperforms a competing approach with consistency guarantees, and is competitive with a deep conditional density model on datasets that exhibit abrupt transitions and heteroscedasticity.

Efficient and principled score estimation with Nystrom kernel exponential families

Sutherland, Danica J.*,
Strathmann, Heiko*,
Arbel, Michael,
and Gretton, Arthur

International Conference on Artificial Intelligence and Statistics (AISTATS)
2018

We propose a fast method with statistical guarantees for learning an exponential family density model where the natural parameter is in a reproducing kernel Hilbert space, and may be infinite-dimensional. The model is learned by fitting the derivative of the log density, the score, thus avoiding the need to compute a normalization constant. Our approach improves the computational efficiency of an earlier solution by using a low-rank, Nyström-like solution. The new solution retains the consistency and convergence rates of the full-rank solution (exactly in Fisher distance, and nearly in other distances), with guarantees on the degree of cost and storage reduction. We evaluate the method in experiments on density estimation and in the construction of an adaptive Hamiltonian Monte Carlo sampler. Compared to an existing score learning approach using a denoising autoencoder, our estimator is empirically more data-efficient when estimating the score, runs faster, and has fewer parameters (which can be tuned in a principled and interpretable way), in addition to providing statistical guarantees.

Bińkowski, Mikołaj*,
Sutherland, Danica J.*,
Arbel, Michael,
and Gretton, Arthur

International Conference on Learning Representations (ICLR)
2018

We investigate the training and performance of generative adversarial networks using the Maximum Mean Discrepancy (MMD) as critic, termed MMD GANs. As our main theoretical contribution, we clarify the situation with bias in GAN loss functions raised by recent work: we show that gradient estimators used in the optimization process for both MMD GANs and Wasserstein GANs are unbiased, but learning a discriminator based on samples leads to biased gradients for the generator parameters. We also discuss the issue of kernel choice for the MMD critic, and characterize the kernel corresponding to the energy distance used for the Cramer GAN critic. Being an integral probability metric, the MMD benefits from training strategies recently developed for Wasserstein GANs. In experiments, the MMD GAN is able to employ a smaller critic network than the Wasserstein GAN, resulting in a simpler and faster-training algorithm with matching performance. We also propose an improved measure of GAN convergence, the Kernel Inception Distance, and show how to use it to dynamically adapt learning rates during GAN training.

Arbel, Michael*,
Sutherland, Danica J.*,
Bińkowski, Mikołaj,
and Gretton, Arthur

Adv. Neural Information Processing Systems (NeurIPS)
2018

We propose a principled method for gradient-based regularization of the critic of GAN-like models trained by adversarially optimizing the kernel of a Maximum Mean Discrepancy (MMD). We show that controlling the gradient of the critic is vital to having a sensible loss function, and devise a method to enforce exact, analytical gradient constraints at no additional cost compared to existing approximate techniques based on additive regularizers. The new loss function is provably continuous, and experiments show that it stabilizes and accelerates training, giving image generation models that outperform state-of-the art methods on 160×160 CelebA and 64×64 unconditional ImageNet.