[1902.06720] Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.
[1812.11592] A Geometric Theory of Higher-Order Automatic Differentiation
First-order automatic differentiation is a ubiquitous tool across statistics, machine learning, and computer science. Higher-order implementations of automatic differentiation, however, have yet to realize the same utility. In this paper I derive a comprehensive, differential geometric treatment of automatic differentiation that naturally identifies the higher-order differential operators amenable to automatic differentiation as well as explicit procedures that provide a scaffolding for high-performance implementations.
Nevergrad: An open source tool for derivative-free optimization - Facebook Code
We are open-sourcing Nevergrad, a Python3 library that makes it easier to perform gradient-free optimizations used in many machine learning tasks.
[1811.03804] Gradient Descent Finds Global Minima of Deep Neural Networks
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. Our bounds also shed light on the advantage of using ResNet over the fully connected feedforward architecture; our bound requires the number of neurons per layer scaling exponentially with depth for feedforward networks whereas for ResNet the bound only requires the number of neurons per layer scaling polynomially with depth. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.
