**arsyed + learning-rate**
12

Léonard Blier on Twitter: "Introducing the Alrao (All learning rates at once) optimization method for neural networks. We randomly sample one learning rate for each feature from a distribution spanning several orders of magnitude. It performs close to op

9 weeks ago by arsyed

More seriously, adaptive gradient methods tend to overfit more, so just need more regularization. Seems like randomizing the lrs could be one too!

optimization
sgd
adaptive
learning-rate
9 weeks ago by arsyed

Eric Jang: Aesthetically Pleasing Learning Rates

july 2018 by arsyed

"[...] so if ever you see or think about writing a paper with a constant learning rate, just use literally any schedule instead. Even a silly one. And then cite this blog post."

humor
learning-rate
neural-net
optimization
july 2018 by arsyed

[1803.02021] Understanding Short-Horizon Bias in Stochastic Meta-Optimization

may 2018 by arsyed

Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the meta-objective must be defined with an orders-of-magnitude shorter time horizon than is typical for neural net training. We show that such short-horizon meta-objectives cause a serious bias towards small step sizes, an effect we term short-horizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze short-horizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes.

neural-net
optimization
meta-learning
hyperparameter
tuning
learning-rate
may 2018 by arsyed

[1708.07120] Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

april 2018 by arsyed

In this paper, we show a phenomenon, which we named "super-convergence", where residual networks can be trained using an order of magnitude fewer iterations than is used with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with cyclical learning rates and a large maximum learning rate. Furthermore, we present evidence that training with large learning rates improves performance by regularizing the network. In addition, we show that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited. We also derive a simplification of the Hessian Free optimization method to compute an estimate of the optimal learning rate. The architectures and code to replicate the figures in this paper are available at github.com/lnsmith54/super-convergence.

neural-net
sgd
resnet
training
performance
leslie-smith
learning-rate
april 2018 by arsyed

[1506.01186] Cyclical Learning Rates for Training Neural Networks

april 2018 by arsyed

It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" -- linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.

neural-net
learning-rate
training
leslie-smith
april 2018 by arsyed

[1803.09820] A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

april 2018 by arsyed

Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting the hyper-parameters remains a black art that requires years of experience to acquire. This report proposes several efficient ways to set the hyper-parameters that significantly reduce training time and improves performance. Specifically, this report shows how to examine the training validation/test loss function for subtle clues of underfitting and overfitting and suggests guidelines for moving toward the optimal balance point. Then it discusses how to increase/decrease the learning rate/momentum to speed up training. Our experiments show that it is crucial to balance every manner of regularization for each dataset and architecture. Weight decay is used as a sample regularizer to show how its optimal value is tightly coupled with the learning rates and momentums. Files to help replicate the results reported here are available.

neural-net
hyperparameter
batch-size
learning-rate
momentum
training
tips
leslie-smith
april 2018 by arsyed

bckenstler/CLR

march 2017 by arsyed

"This repository includes a Keras callback to be used in training that allows implementation of cyclical learning rate policies, as detailed in Leslie Smith's paper Cyclical Learning Rates for Training Neural Networks arXiv:1506.01186v4."

keras
libs
learning-rate
cycling
march 2017 by arsyed

Loss functions in tensorflow (with an if - else) - Stack Overflow

june 2016 by arsyed

step = tf.Variable(0)

learning_rate = tf.train.exponential_decay(

0.2, # Base learning rate.

step, # Current index into the dataset.

1, # Decay step.

0.9 # Decay rate

)

opt = tf.train.GradientDescentOptimizer(learning_rate)

tensorflow
optimizer
gradient-descent
learning-rate
decay
learning_rate = tf.train.exponential_decay(

0.2, # Base learning rate.

step, # Current index into the dataset.

1, # Decay step.

0.9 # Decay rate

)

opt = tf.train.GradientDescentOptimizer(learning_rate)

june 2016 by arsyed

**related tags**

Copy this bookmark: