**learning-rate**31

Léonard Blier on Twitter: "Introducing the Alrao (All learning rates at once) optimization method for neural networks. We randomly sample one learning rate for each feature from a distribution spanning several orders of magnitude. It performs close to op

october 2018 by arsyed

More seriously, adaptive gradient methods tend to overfit more, so just need more regularization. Seems like randomizing the lrs could be one too!

optimization
sgd
adaptive
learning-rate
october 2018 by arsyed

Eric Jang: Aesthetically Pleasing Learning Rates

july 2018 by arsyed

"[...] so if ever you see or think about writing a paper with a constant learning rate, just use literally any schedule instead. Even a silly one. And then cite this blog post."

humor
learning-rate
neural-net
optimization
july 2018 by arsyed

[1711.00489] Don't Decay the Learning Rate, Increase the Batch Size

july 2018 by foodbaby

It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate ϵ and scaling the batch size B∝ϵ. Finally, one can increase the momentum coefficient m and scale B∝1/(1−m), although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes.

learning-rate
july 2018 by foodbaby

[1803.02021] Understanding Short-Horizon Bias in Stochastic Meta-Optimization

may 2018 by arsyed

Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the meta-objective must be defined with an orders-of-magnitude shorter time horizon than is typical for neural net training. We show that such short-horizon meta-objectives cause a serious bias towards small step sizes, an effect we term short-horizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze short-horizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes.

neural-net
optimization
meta-learning
hyperparameter
tuning
learning-rate
may 2018 by arsyed

[1708.07120] Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

april 2018 by arsyed

In this paper, we show a phenomenon, which we named "super-convergence", where residual networks can be trained using an order of magnitude fewer iterations than is used with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with cyclical learning rates and a large maximum learning rate. Furthermore, we present evidence that training with large learning rates improves performance by regularizing the network. In addition, we show that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited. We also derive a simplification of the Hessian Free optimization method to compute an estimate of the optimal learning rate. The architectures and code to replicate the figures in this paper are available at github.com/lnsmith54/super-convergence.

neural-net
sgd
resnet
training
performance
leslie-smith
learning-rate
april 2018 by arsyed

[1506.01186] Cyclical Learning Rates for Training Neural Networks

april 2018 by arsyed

It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" -- linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.

neural-net
learning-rate
training
leslie-smith
april 2018 by arsyed

[1803.09820] A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

april 2018 by arsyed

Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting the hyper-parameters remains a black art that requires years of experience to acquire. This report proposes several efficient ways to set the hyper-parameters that significantly reduce training time and improves performance. Specifically, this report shows how to examine the training validation/test loss function for subtle clues of underfitting and overfitting and suggests guidelines for moving toward the optimal balance point. Then it discusses how to increase/decrease the learning rate/momentum to speed up training. Our experiments show that it is crucial to balance every manner of regularization for each dataset and architecture. Weight decay is used as a sample regularizer to show how its optimal value is tightly coupled with the learning rates and momentums. Files to help replicate the results reported here are available.

neural-net
hyperparameter
batch-size
learning-rate
momentum
training
tips
leslie-smith
april 2018 by arsyed

Visualizing Learning rate vs Batch size

february 2018 by rishaanp

Neural Nets basics using Fast.ai tools

fast.ai
learning-rate
batch-size
learning-rate-finder
february 2018 by rishaanp

Case Study: A world class image classifier for dogs and cats (err.., anything)

february 2018 by rishaanp

It is amazing how far computer vision has come in the last couple of years. Problems that are insanely intractable for classical machine learning methods are a piece of cake for the emerging field of…

deep-learning
convolutions
convolutional-neural-networks
neural-networks
differential-learning-rates
learning-rate
kaggle
fast.ai
transfer-learning
from pocket
february 2018 by rishaanp

Improving the way we work with learning rate. – techburst

february 2018 by rishaanp

Most optimization algorithms(such as SGD, RMSprop, Adam) require setting the learning rate — the most important hyper-parameter for training deep neural networks. Naive method for choosing learning…

deep-learning
learning-rate
cyclical-learning-rate
fast.ai
learning-rate-annealing
from pocket
february 2018 by rishaanp

The Cyclical Learning Rate technique // teleported.in

february 2018 by rishaanp

Learning rate (LR) is one of the most important hyperparameters to be tuned and holds key to faster and effective training of neural networks. Simply put, LR decides how much of the loss gradient is to be applied to our current weights to move them in the direction of lower loss.

Cyclical-Learning-Rate
Learning-Rate
fast.ai
SGDR
from pocket
february 2018 by rishaanp

**related tags**

Copy this bookmark: