gradient-descent   89

« earlier    

3.17.3 Gradient Descent only Converges to Minimizers on Vimeo
"We show that gradient descent generically does not converge to saddle points. This is proved using the Stable Manifold theorem from dynamical systems theory."
talks  optimization  gradient-descent  saddle-point 
11 weeks ago by arsyed
The Gradient: A Visual Descent
In this post I aim to visually, mathematically and programatically explain the gradient, and how its understanding is crucial for gradient descent.
gradient-descent 
11 weeks ago by Tafkas
Decoding the Enigma with Recurrent Neural Networks
I am blown away by this -- given that Recurrent Neural Networks are Turing-complete, they can actually automate cryptanalysis given sufficient resources, at least to the degree of simulating the internal workings of the Enigma algorithm given plaintext, ciphertext and key:
The model needed to be very large to capture all the Enigma’s transformations. I had success with a single-celled LSTM model with 3000 hidden units. Training involved about a million steps of batched gradient descent: after a few days on a k40 GPU, I was getting 96-97% accuracy!
machine-learning  deep-learning  rnns  enigma  crypto  cryptanalysis  turing  history  gpus  gradient-descent 
12 weeks ago by jm
How to Escape Saddle Points Efficiently – Off the convex path
A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see Rong Ge’s and Ben Recht’s blog posts), the critical open problem is one of efficiency — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade, that provides the first provable positive answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!
acmtariat  org:bleg  nibble  liner-notes  machine-learning  acm  optimization  gradient-descent  local-global  off-convex  time-complexity  random  perturbation  michael-jordan  iterative-methods  research  learning-theory  math.DS  iteration-recursion 
july 2017 by nhaliday
[1704.04289] Stochastic Gradient Descent as Approximate Bayesian Inference
"Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler."
papers  optimization  gradient-descent  sgd  bayesian 
july 2017 by arsyed
[1609.05191] Gradient Descent Learns Linear Dynamical Systems
"We prove that gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system. Even though the objective function is non-convex, we provide polynomial running time and sample complexity bounds under strong but natural assumptions. Linear systems identification has been studied for many decades, yet, to the best of our knowledge, these are the first polynomial guarantees for the problem we consider."
papers  gradient-descent  dynamical 
june 2017 by arsyed
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented *without any locking*. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild outperforms alternative schemes that use locking by an order of magnitude.
papers  optimization  parallel  ben-recht  sgd  gradient-descent 
june 2017 by arsyed
[1509.01240] Train faster, generalize better: Stability of stochastic gradient descent
"We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions.
Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit."
papers  optimization  generalization  gradient-descent  sgd  ben-recht 
june 2017 by arsyed
A Dive into Stack Overflow Jobs Search – Aurélien Gasser – Medium
Finding the optimal combination of weights for the matching algorithm is an optimization problem: our goal is to minimize the median rank of the jobs applied to. The lower the median rank, the better we are at predicting the jobs the visitors are going to apply to. There are many techniques to tackle optimization problems. What worked best is the optim function offered by the R language. It uses a technique called gradient descent, which performs much better than genetic algorithms in our case. The weights it finds are slightly better, and more importantly, it is about 70x faster. Concretely, finding the best combination of weights now takes about 10 minutes, versus about 12 hours with a genetic algorithm!
search  search-engine  details  jobs  r  gradient-descent 
may 2017 by hellsten
[1705.10412] Gradient Descent Can Take Exponential Time to Escape Saddle Points
"Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings."
papers  optimization  gradient-descent  saddle-point 
may 2017 by arsyed
[1705.08292v1] The Marginal Value of Adaptive Gradient Methods in Machine Learning
Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.
papers  optimization  adaptive  gradient-descent  sgd 
may 2017 by arsyed
[1705.08292] The Marginal Value of Adaptive Gradient Methods in Machine Learning
"Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks."
papers  optimization  adaptive  gradient-descent  sgd 
may 2017 by arsyed

« earlier    

related tags

academia  acm  acmtariat  adaptive-control  adaptive  algorithm  algorithms  ankur-moitra  approximation  arxiv  async  auto-learning  average-case  bandits  bayesian  because-proof  ben-recht  benchmarks  best-practices  better-explained  bias-variance  big-picture  boltzmann  by:sebastian-ruder  c++  characterization  cheatsheet  checklists  code  coding-theory  combo-optimization  composition-decomposition  compressed-sensing  computational-complexity  concentration-of-measure  concurrency  conditional  conjugate-gradient-descent  consider:looking-to-see  consider:representation  cool  counting  course  cryptanalysis  crypto  curvature  debugging  decision-theory  deep-learning  deepgoog  descent  details  differential  dimensionality  discrete  distributed  duality  dynamic  dynamical  embeddings  enigma  ensembles  expanders  expectation-maximization  expert-experience  expert  explanation  exposition  extrema  fads-and-fallacies  fourier  game-theory  generalization  gpus  gradient  graph-theory  graphical-models  graphs  ground-up  guide  hacks  hadoop  hashing  heuristic  high-dimension  higher-ed  history  hsu  huge-data-the-biggest  ideas  information-geometry  init  interdisciplinary  iteration-recursion  iterative-methods  jobs  justin-domke  latent-variables  learning-theory  learning  lecture-notes  lens  let-me-see  levers  lib  library  libs  linear-algebra  linear-models  linear-programming  linear-regression  linearity  liner-notes  local-global  machine-learning  machine.learning  machine_learning  machinelearning  mapreduce  marginal  markov  math.ca  math.co  math.ds  math  mathematics  matrix-factorization  matrix  meta-learning  metabuch  metropolis  michael-jordan  mihai  minimization  mit  ml  mltheory  mma  model-class  moments  momentum  monte-carlo  motivation  mrtz  multi  multithreading  neuro  nibble  nlp  nn  norms  nudge-targets  numerics  off-convex  online-learning  opt  optimisation  optimization  org:bleg  org:inst  org:med  oscillation  overview  p:***  p:someday  p:whenever  pac  papers  parallel  pdf  pegasos  performance  perturbation  polynomials  preprint  presentation  princeton  programming  python  r  random-matrices  random-networks  random  rather-interesting  read  reasonable-approach  ref  reference  reflection  regression  regularization  reinforcement  representation  research-article  research  rhetoric  rigorous-crypto  rnns  rounding  s:*  saddle-point  sample-complexity  sampling  sanjeev-arora  scitariat  sdp  search-engine  search  sebastien-bubeck  sgd  silos-in-action  slides  smoothness  software  sparsity  specialist-gradient-descent  spectral-projected-gradient  spectral  stanford  stat-mech  statistics  stochastic-approximation  stochastic-processes  stochastic  stories  sublinear  submodular  summary  survey  svm  synthesis  talks  tcs  teaching  techtariat  tensorflow  thinking  this-analysis-justifies  tim-roughgarden  time-complexity  to:watch  toolkit  turing  tutorial  tutorials  unit  unsupervised  valiant  video  visual-understanding  visualization  vowpal  wiki  word2vec  wrapper  y!  yoga  👳 

Copy this bookmark:



description:


tags: