learning-rate   31

Eric Jang: Aesthetically Pleasing Learning Rates
"[...] so if ever you see or think about writing a paper with a constant learning rate, just use literally any schedule instead. Even a silly one. And then cite this blog post."
humor  learning-rate  neural-net  optimization 
july 2018 by arsyed
[1711.00489] Don't Decay the Learning Rate, Increase the Batch Size
It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate ϵ and scaling the batch size B∝ϵ. Finally, one can increase the momentum coefficient m and scale B∝1/(1−m), although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes.
july 2018 by foodbaby
[1803.02021] Understanding Short-Horizon Bias in Stochastic Meta-Optimization
Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. But because the training procedure must be unrolled thousands of times, the meta-objective must be defined with an orders-of-magnitude shorter time horizon than is typical for neural net training. We show that such short-horizon meta-objectives cause a serious bias towards small step sizes, an effect we term short-horizon bias. We introduce a toy problem, a noisy quadratic cost function, on which we analyze short-horizon bias by deriving and comparing the optimal schedules for short and long time horizons. We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. We believe short-horizon bias is a fundamental problem that needs to be addressed if meta-optimization is to scale to practical neural net training regimes.
neural-net  optimization  meta-learning  hyperparameter  tuning  learning-rate 
may 2018 by arsyed
[1708.07120] Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates
In this paper, we show a phenomenon, which we named "super-convergence", where residual networks can be trained using an order of magnitude fewer iterations than is used with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with cyclical learning rates and a large maximum learning rate. Furthermore, we present evidence that training with large learning rates improves performance by regularizing the network. In addition, we show that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited. We also derive a simplification of the Hessian Free optimization method to compute an estimate of the optimal learning rate. The architectures and code to replicate the figures in this paper are available at github.com/lnsmith54/super-convergence.
neural-net  sgd  resnet  training  performance  leslie-smith  learning-rate 
april 2018 by arsyed
[1506.01186] Cyclical Learning Rates for Training Neural Networks
It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" -- linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.
neural-net  learning-rate  training  leslie-smith 
april 2018 by arsyed
[1803.09820] A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay
Although deep learning has produced dazzling successes for applications of image, speech, and video processing in the past few years, most trainings are with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting the hyper-parameters remains a black art that requires years of experience to acquire. This report proposes several efficient ways to set the hyper-parameters that significantly reduce training time and improves performance. Specifically, this report shows how to examine the training validation/test loss function for subtle clues of underfitting and overfitting and suggests guidelines for moving toward the optimal balance point. Then it discusses how to increase/decrease the learning rate/momentum to speed up training. Our experiments show that it is crucial to balance every manner of regularization for each dataset and architecture. Weight decay is used as a sample regularizer to show how its optimal value is tightly coupled with the learning rates and momentums. Files to help replicate the results reported here are available.
neural-net  hyperparameter  batch-size  learning-rate  momentum  training  tips  leslie-smith 
april 2018 by arsyed
Case Study: A world class image classifier for dogs and cats (err.., anything)
It is amazing how far computer vision has come in the last couple of years. Problems that are insanely intractable for classical machine learning methods are a piece of cake for the emerging field of…
deep-learning  convolutions  convolutional-neural-networks  neural-networks  differential-learning-rates  learning-rate  kaggle  fast.ai  transfer-learning  from pocket
february 2018 by rishaanp
Improving the way we work with learning rate. – techburst
Most optimization algorithms(such as SGD, RMSprop, Adam) require setting the learning rate — the most important hyper-parameter for training deep neural networks. Naive method for choosing learning…
deep-learning  learning-rate  cyclical-learning-rate  fast.ai  learning-rate-annealing  from pocket
february 2018 by rishaanp
The Cyclical Learning Rate technique // teleported.in
Learning rate (LR) is one of the most important hyperparameters to be tuned and holds key to faster and effective training of neural networks. Simply put, LR decides how much of the loss gradient is to be applied to our current weights to move them in the direction of lower loss.
Cyclical-Learning-Rate  Learning-Rate  fast.ai  SGDR  from pocket
february 2018 by rishaanp

related tags

adaptive  algorithm  algorithms  annealing  asr  batch-size  competition  convergence  convolutional-neural-networks  convolutions  cosine  cyclical-learning-rate  cyclical  cycling  decay  deep-learning  differential-learning-rates  dlib  early-stopping  eesen  fast.ai  fastai  gradient-descent  humor  hyperparameter  ilya-sutskever  initialization  kaggle  keras  learning-rate-annealing  learning-rate-finder  learning  leslie-smith  libs  lr  machine-learning  meta-learning  minibatch  momentum  neural-net  neural-networks  newbob  optimal  optimization  optimizer  over-fitting  overfitting  performance  pytorch  resnet  rnn  sgd  sgdr  skflow  tensorflow  tips  training  transfer-learning  tuning 

Copy this bookmark: