gradient-descent   136

« earlier    

[1902.06720] Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.
neural-net  linear-models  gradient-descent 
4 weeks ago by arsyed
[1812.11592] A Geometric Theory of Higher-Order Automatic Differentiation
First-order automatic differentiation is a ubiquitous tool across statistics, machine learning, and computer science. Higher-order implementations of automatic differentiation, however, have yet to realize the same utility. In this paper I derive a comprehensive, differential geometric treatment of automatic differentiation that naturally identifies the higher-order differential operators amenable to automatic differentiation as well as explicit procedures that provide a scaffolding for high-performance implementations.
gradient-descent  automatic-differentiation  algorithms  topology  algebraic-topology  the-shape-of-data  machine-learning  rather-interesting  to-understand 
8 weeks ago by Vaguery
Nevergrad: An open source tool for derivative-free optimization - Facebook Code
We are open-sourcing Nevergrad, a Python3 library that makes it easier to perform gradient-free optimizations used in many machine learning tasks.
optimization  machine-learning  gradient-descent 
december 2018 by pmigdal
[1811.03804] Gradient Descent Finds Global Minima of Deep Neural Networks
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. Our bounds also shed light on the advantage of using ResNet over the fully connected feedforward architecture; our bound requires the number of neurons per layer scaling exponentially with depth for feedforward networks whereas for ResNet the bound only requires the number of neurons per layer scaling polynomially with depth. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.
dnn  neural-net  analysis  gradient-descent  optimization 
november 2018 by arsyed

« earlier    

related tags

academia  acm  acmtariat  adam  adaptive  adversarial  ai-control  ai  algebraic-topology  algorithm  algorithms  analogy  analysis  ankur-moitra  applications  approximation  arxiv  atoms  attention  auto-learning  automatic-differentiation  average-case  bandits  bare-hands  batch-norm  bayesian  ben-recht  benchmarks  best-practices  better-explained  bias-variance  big-picture  boltzmann  by:sebastian-ruder  characterization  cheatsheet  checklists  circuits  classification  clever-rats  coarse-fine  coding-theory  combo-optimization  commentary  comparison  competition  complement-substitute  complexity  composition-decomposition  compressed-sensing  computational-complexity  computer-vision  concentration-of-measure  concept  concurrency  consider:looking-to-see  consider:representation  convexity-curvature  convexoptimization  cool  cooperate-defect  coordination  cost-benefit  counting  course  crux  cryptanalysis  crypto  curvature  debate  debugging  decay  decision-making  decision-theory  deep-learning  deepgoog  descriptive  details  developmental  differential  dimensional-analysis  dimensionality  direct-indirect  discrete  discussion  distributed  dl  dnn  dropout  duality  dynamic  dynamical-systems  dynamical  economics  embeddings  empirical  engineering  enigma  ensembles  equilibrium  error  evolution  expanders  experiment  expert-experience  expert  explanans  explanation  explore-exploit  exposition  extrema  flexibility  fourier  frontier  game-theory  games  gbm  generalization  generative  google  gpus  graph-theory  graphical-models  graphs  ground-up  guide  gwern  hashing  heuristic  hi-order-bits  high-dimension  higher-ed  history  howto  hsu  huge-data-the-biggest  humanity  ideas  incentives  init  insight  intelligence  interdisciplinary  intricacy  iteration-recursion  iterative-methods  jobs  land  language  latent-variables  learning-rate  learning-theory  lecture-notes  lens  lesswrong  let-me-see  levers  linear-algebra  linear-models  linear-programming  linearity  liner-notes  links  list  local-global  machine-learning  machine_learning  machinelearning  marginal  markets  markov  math.ca  math.co  math.ds  math  matrix-factorization  meta-learning  metabuch  metropolis  michael-jordan  mihai  minibatch  mit  ml  mltheory  model-class  models  moloch  moments  momentum  monte-carlo  motivation  mrtz  multi  natural-gradient  natural_gradient  nature  neural-net  neural-networks  neural_network  neuralnet  neuro-nitgrit  neuro  newton  nibble  nitty-gritty  nlp  nn  norms  nudge-targets  number  numerics  off-convex  online-learning  openai  opt  optimisation  optimization  optimizer  org:bleg  org:inst  org:mat  org:med  oscillation  overview  p:***  p:someday  p:whenever  pac  papers  parallel  pdf  performance  perturbation  philosophy  polynomials  potential  prediction  preprint  presentation  princeton  programming  project  python  q-n-a  quixotic  r  random-matrices  random-networks  random  rather-interesting  ratty  reading  realness  reasonable-approach  reduction  ref  reference  reflection  regression  regularization  reinforcement  replication  representation  research-article  research  retention  rhetoric  rigorous-crypto  risk  rnns  rounding  s:*  saas  saddle-point  sample-complexity  sampling  sanjeev-arora  science  scitariat  sdp  search-engine  search  sebastien-bubeck  sequential  sgd  signal-noise  similarity  slides  smoothness  sparsity  spectral  speculation  speedometer  stanford  stat-mech  state-of-art  stochastic-processes  stories  sublinear  submodular  summary  supply-demand  survey  synthesis  systems  talks  tcs  teaching  techtariat  telos-atelos  tensorflow  the-self  the-shape-of-data  theory  thinking  threat-modeling  tim-roughgarden  time-complexity  time  tips  to-understand  todo  toolkit  topology  track-record  training  turing  tutorial  unit  unsupervised  valiant  values  video  visual-understanding  visualization  volo-avolo  wiki  wire-guided  yoga  👳 

Copy this bookmark:



description:


tags: