**gradient-descent**89

3.17.3 Gradient Descent only Converges to Minimizers on Vimeo

11 weeks ago by arsyed

"We show that gradient descent generically does not converge to saddle points. This is proved using the Stable Manifold theorem from dynamical systems theory."

talks
optimization
gradient-descent
saddle-point
11 weeks ago by arsyed

The Gradient: A Visual Descent

11 weeks ago by Tafkas

In this post I aim to visually, mathematically and programatically explain the gradient, and how its understanding is crucial for gradient descent.

gradient-descent
11 weeks ago by Tafkas

Decoding the Enigma with Recurrent Neural Networks

12 weeks ago by jm

I am blown away by this -- given that Recurrent Neural Networks are Turing-complete, they can actually automate cryptanalysis given sufficient resources, at least to the degree of simulating the internal workings of the Enigma algorithm given plaintext, ciphertext and key:

machine-learning
deep-learning
rnns
enigma
crypto
cryptanalysis
turing
history
gpus
gradient-descent
The model needed to be very large to capture all the Enigma’s transformations. I had success with a single-celled LSTM model with 3000 hidden units. Training involved about a million steps of batched gradient descent: after a few days on a k40 GPU, I was getting 96-97% accuracy!

12 weeks ago by jm

How to Escape Saddle Points Efficiently – Off the convex path

july 2017 by nhaliday

A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see Rong Ge’s and Ben Recht’s blog posts), the critical open problem is one of efficiency — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade, that provides the first provable positive answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!

acmtariat
org:bleg
nibble
liner-notes
machine-learning
acm
optimization
gradient-descent
local-global
off-convex
time-complexity
random
perturbation
michael-jordan
iterative-methods
research
learning-theory
math.DS
iteration-recursion
july 2017 by nhaliday

[1704.04289] Stochastic Gradient Descent as Approximate Bayesian Inference

july 2017 by arsyed

"Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler."

papers
optimization
gradient-descent
sgd
bayesian
july 2017 by arsyed

[1609.05191] Gradient Descent Learns Linear Dynamical Systems

june 2017 by arsyed

"We prove that gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system. Even though the objective function is non-convex, we provide polynomial running time and sample complexity bounds under strong but natural assumptions. Linear systems identification has been studied for many decades, yet, to the best of our knowledge, these are the first polynomial guarantees for the problem we consider."

papers
gradient-descent
dynamical
june 2017 by arsyed

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

june 2017 by arsyed

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented *without any locking*. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild outperforms alternative schemes that use locking by an order of magnitude.

papers
optimization
parallel
ben-recht
sgd
gradient-descent
june 2017 by arsyed

[1509.01240] Train faster, generalize better: Stability of stochastic gradient descent

june 2017 by arsyed

"We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions.

Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit."

papers
optimization
generalization
gradient-descent
sgd
ben-recht
Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit."

june 2017 by arsyed

A Dive into Stack Overflow Jobs Search – Aurélien Gasser – Medium

may 2017 by hellsten

Finding the optimal combination of weights for the matching algorithm is an optimization problem: our goal is to minimize the median rank of the jobs applied to. The lower the median rank, the better we are at predicting the jobs the visitors are going to apply to. There are many techniques to tackle optimization problems. What worked best is the optim function offered by the R language. It uses a technique called gradient descent, which performs much better than genetic algorithms in our case. The weights it finds are slightly better, and more importantly, it is about 70x faster. Concretely, finding the best combination of weights now takes about 10 minutes, versus about 12 hours with a genetic algorithm!

search
search-engine
details
jobs
r
gradient-descent
may 2017 by hellsten

[1705.10412] Gradient Descent Can Take Exponential Time to Escape Saddle Points

may 2017 by arsyed

"Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings."

papers
optimization
gradient-descent
saddle-point
may 2017 by arsyed

[1705.08292v1] The Marginal Value of Adaptive Gradient Methods in Machine Learning

may 2017 by arsyed

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

papers
optimization
adaptive
gradient-descent
sgd
may 2017 by arsyed

[1705.08292] The Marginal Value of Adaptive Gradient Methods in Machine Learning

may 2017 by arsyed

"Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks."

papers
optimization
adaptive
gradient-descent
sgd
may 2017 by arsyed

**related tags**

Copy this bookmark: