arsyed + sgd   123

[1808.01204] Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels. Furthermore, the analysis provides interesting insights into several aspects of learning neural networks and can be verified based on empirical studies on synthetic data and on the MNIST dataset.
neural-net  sgd  generalization 
yesterday by arsyed
[1702.08159] McKernel: A Library for Approximate Kernel Expansions in Log-linear Time
Kernel Methods Next Generation (KMNG) introduces a framework to use kernel approximates in the mini-batch setting with SGD Optimizer as an alternative to Deep Learning. McKernel is a C++ library for KMNG ML Large-scale. It contains a CPU optimized implementation of the Fastfood algorithm that allows the computation of approximated kernel expansions in log-linear time. The algorithm requires to compute the product of Walsh Hadamard Transform (WHT) matrices. A cache friendly SIMD Fast Walsh Hadamard Transform (FWHT) that achieves compelling speed and outperforms current state-of-the-art methods has been developed. McKernel allows to obtain non-linear classification combining Fastfood and a linear classifier.
kernel-methods  sgd  minibatch 
5 weeks ago by arsyed
[1710.11029] Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks
Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in general. So SGD does perform variational inference, but for a different loss than the one used to compute the gradients. Even more surprisingly, SGD does not even converge in the classical sense: we show that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they resemble closed loops with deterministic components. We prove that such "out-of-equilibrium" behavior is a consequence of highly non-isotropic gradient noise in SGD; the covariance matrix of mini-batch gradients for deep networks has a rank as small as 1% of its dimension. We provide extensive empirical validation of these claims, proven in the appendix.
optimization  deep-learning  sgd  variational-inference 
september 2018 by arsyed
ICML 2018 notes – Yaroslav Bulatov – Medium
"quantifying generalization capacity [...] How can 1M parameter neural net generalize after getting trained to convergence on 100k examples?

[...] able to bound capacity of neural nets to that below of number of parameters based on “noise stability” of neural net. Basically trained neural net will “reject” Gaussian noise injected at intermediate layers — a few layers down the road, injected noise is attenuated and original activations are mostly unchanged.

[...] use this “noise stability” to prove that neural network is compressible — throwing out a lot of the weights is possible without changing numerics much. Compressibility implies generalization since there are few compressible networks.

[...] while noise stability in trained network implies generalization, you want the opposite of noise stability in your untrained network. All perturbations should propagate without attenuation. Initializing neural nets in this way, using orthogonal matrices for fully connected layers, and “delta-orthogonal” for conv layers, enabled training 10k layer tanh network [...]"
neural-net  analysis  generalization  capacity  compression  orthogonal-initialization  initialization  sgd  adam  optimization 
august 2018 by arsyed
[1806.07353] Faster SGD training by minibatch persistency
It is well known that, for most datasets, the use of large-size minibatches for Stochastic Gradient Descent (SGD) typically leads to slow convergence and poor generalization. On the other hand, large minibatches are of great practical interest as they allow for a better exploitation of modern GPUs. Previous literature on the subject concentrated on how to adjust the main SGD parameters (in particular, the learning rate) when using large minibatches. In this work we introduce an additional feature, that we call minibatch persistency, that consists in reusing the same minibatch for K consecutive SGD iterations. The computational conjecture here is that a large minibatch contains a significant sample of the training set, so one can afford to slightly overfitting it without worsening generalization too much. The approach is intended to speedup SGD convergence, and also has the advantage of reducing the overhead related to data loading on the internal GPU memory. We present computational results on CIFAR-10 with an AlexNet architecture, showing that even small persistency values (K=2 or 5) already lead to a significantly faster convergence and to a comparable (or even better) generalization than the standard "disposable minibatch" approach (K=1), in particular when large minibatches are used. The lesson learned is that minibatch persistency can be a simple yet effective way to deal with large minibatches.
neural-net  sgd  minibatch  performance 
july 2018 by arsyed
AdamW and Super-convergence is now the fastest way to train neural nets · fast.ai
"When you hear people saying that Adam doesn’t generalize as well as SGD+Momentum, you’ll nearly always find that they’re choosing poor hyper-parameters for their model. Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam."
sgd  adam  optimization  neural-net  adamw 
july 2018 by arsyed
[1805.11604] How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)
Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood. The popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called "internal covariate shift". In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training. These findings bring us closer to a true understanding of our DNN training toolkit.
neural-net  optimization  batch-norm  sgd 
july 2018 by arsyed
[1708.07120] Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates
In this paper, we show a phenomenon, which we named "super-convergence", where residual networks can be trained using an order of magnitude fewer iterations than is used with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with cyclical learning rates and a large maximum learning rate. Furthermore, we present evidence that training with large learning rates improves performance by regularizing the network. In addition, we show that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited. We also derive a simplification of the Hessian Free optimization method to compute an estimate of the optimal learning rate. The architectures and code to replicate the figures in this paper are available at github.com/lnsmith54/super-convergence.
neural-net  sgd  resnet  training  performance  leslie-smith  learning-rate 
april 2018 by arsyed
[1706.02677] Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency.
neural-net  sgd  minibatch  batch-size  training 
april 2018 by arsyed
[1804.07612] Revisiting Small Batch Training for Deep Neural Networks
The collected experimental results for the CIFAR-10, CIFAR-100 and ImageNet datasets show that increasing the mini-batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. On the other hand, small mini-batch sizes provide more up-to-date gradient calculations, which yields more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.
neural-net  batch-size  training  tips  sgd  minibatch 
april 2018 by arsyed
Evaluated expressions of variables differ sometimes when using the GPU · Issue #2226 · tensorflow/tensorflow
"The GPU kernel for reduce_sum is known to be non-deterministic since it
uses the atomic operations. When the number of elements is large, the
difference could be quite large.

If your model is trained with dropout in it, it tends to be less likely to
be affected by the noise."

"It sounds like your 97.7% vs. 9.8% difference occurs for a model where different CPU systems can also produce 10%, and thus that you may have found an impressively unreliable set of hyperparameters. If using a smaller learning rate helps, it may just mean that the GPU version blows up slightly earlier than the CPU version."

"the problem could be attributed to small differences introduced in the non-deterministic behavior on GPU. Or some numerical instability due to the precision difference between CPU and GPU."
tensorflow  cpu  gpu  sgd  brittle  stability  numeric 
april 2018 by arsyed
[1804.04235] Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network weight matrices, we propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline. Secondly, we show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. Combining these methods and dropping momentum, we achieve comparable results to the published Adam regime in training the Transformer model on the WMT 2014 English-German machine translation task, while using very little auxiliary storage in the optimizer. Finally, we propose scaling the parameter updates based on the scale of the parameters themselves.
optimization  sgd 
april 2018 by arsyed
[1804.02772] Active Mini-Batch Sampling using Repulsive Point Processes
The convergence speed of stochastic gradient descent (SGD) can be improved by actively selecting mini-batches. We explore sampling schemes where similar data points are less likely to be selected in the same mini-batch. In particular, we prove that such repulsive sampling schemes lowers the variance of the gradient estimator. This generalizes recent work on using Determinantal Point Processes (DPPs) for mini-batch diversification (Zhang et al., 2017) to the broader class of repulsive point processes. We first show that the phenomenon of variance reduction by diversified sampling generalizes in particular to non-stationary point processes. We then show that other point processes may be computationally much more efficient than DPPs. In particular, we propose and investigate Poisson Disk sampling---frequently encountered in the computer graphics community---for this task. We show empirically that our approach improves over standard SGD both in terms of convergence speed as well as final model performance.
neural-net  sgd  minibatch  dpp  sampling  active-learning 
april 2018 by arsyed
[1707.09835] Meta-SGD: Learning to Learn Quickly for Few-Shot Learning
Few-shot learning is challenging for learning algorithms that learn each task in isolation and from scratch. In contrast, meta-learning learns from many related tasks a meta-learner that can learn a new task more accurately and faster with fewer examples, where the choice of meta-learners is crucial. In this paper, we develop Meta-SGD, an SGD-like, easily trainable meta-learner that can initialize and adapt any differentiable learner in just one step, on both supervised learning and reinforcement learning. Compared to the popular meta-learner LSTM, Meta-SGD is conceptually simpler, easier to implement, and can be learned more efficiently. Compared to the latest meta-learner MAML, Meta-SGD has a much higher capacity by learning to learn not just the learner initialization, but also the learner update direction and learning rate, all in a single meta-learning process. Meta-SGD shows highly competitive performance for few-shot learning on regression, classification, and reinforcement learning.
meta-learning  sgd  few-shot 
march 2018 by arsyed
[1712.07628] Improving Generalization Performance by Switching from Adam to SGD
"Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling with recurrent networks on the PTB and WT2 data sets. The results show that our strategy is capable of closing the generalization gap between SGD and Adam on a majority of the tasks."
papers  optimization  sgd  adam 
december 2017 by arsyed
On the inductive bias of stochastic gradient descent | OpenReview
Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in general. So SGD does perform variational inference, but for a different loss than the one used to compute the gradients. Even more surprisingly, SGD does not even converge in the classical sense: we show that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points. Instead, they resemble closed loops with deterministic components. We prove that such "out-of-equilibrium" behavior is a consequence of the fact that the gradient noise in SGD is highly non-isotropic; the covariance matrix of mini-batch gradients has a rank as small as 1% of its dimension. We provide extensive empirical validation of these claims, proven in the appendix.
neural-net  sgd  inductive-bias  variational-inference 
december 2017 by arsyed
Why should the data be shuffled for machine learning tasks - Data Science Stack Exchange
In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.
sgd  shuffling 
november 2017 by arsyed
[1711.00489v1] Don't Decay the Learning Rate, Increase the Batch Size
"It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate ϵ and scaling the batch size B∝ϵ. Finally, one can increase the momentum coefficient m and scale B∝1/(1−m), although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train Inception-ResNet-V2 on ImageNet to 77% validation accuracy in under 2500 parameter updates, efficiently utilizing training batches of 65536 images."
papers  neural-net  optimization  sgd  batch-size  via:abiola 
november 2017 by arsyed
[1710.10345] The Implicit Bias of Gradient Descent on Separable Data
"We show that gradient descent on an unregularized logistic regression problem, for linearly separable datasets, converges to the direction of the max-margin (hard margin SVM) solution. The result generalizes also to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods."
sgd  nural-net  optimization  analysis 
october 2017 by arsyed
[1710.06451] Understanding Generalization and Stochastic Gradient Descent
"This paper tackles two related questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work is inspired by Zhang et al. (2017), who showed deep networks can easily memorize randomly labeled training data, despite generalizing well when shown real labels of the same inputs. We show here that the same phenomenon occurs in small linear models. These observations are explained by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima. Next, we explore the "generalization gap" between small and large batch training, identifying an optimum batch size which maximizes the test set accuracy. Noise in the gradient updates is beneficial, driving the dynamics towards robust minima for which the evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we predict the optimum batch size is proportional to both the learning rate and the size of the training set, and verify these predictions empirically."
papers  generalization  sgd  deep-learning 
october 2017 by arsyed
Overview of Optimization Algorithms : MachineLearning
This paper explains why SGD and other techniques that add noise during training are successful: Training Recurrent Neural Networks by Diffusion
Dropout: Improving neural networks by preventing co-adaptation of feature detectors
Adding gradient noise: Adding Gradient Noise Improves Learning for Very Deep Networks
Layerwise pretraining: http://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf
optimization  sgd 
september 2017 by arsyed
DSpace@MIT: Theory of Deep Learning III: Generalization Properties of SGD
In Theory III we characterize with a mix of theory and experiments the generalization properties of Stochastic Gradient Descent in overparametrized deep convolutional networks. We show that Stochastic Gradient Descent (SGD) selects with high probability solutions that 1) have zero (or small) empirical error, 2) are degenerate as shown in Theory II and 3) have maximum generalization.
papers  deep-learning  sgd  theory 
august 2017 by arsyed
paper-notes/Understanding deep learning requires rethinking generalization.md at master · ethancaballero/paper-notes
"tl;dr, vc-dimension (expressivity) does not necessarily correlate with generalization error

the fact that we use gradient descent to train might be what's causing this

why does it generalize to test data even though we don't have theoretical guarantees like we do with other model?

If we get better generalization theories than vc-dimension or rademacher complexity, we can probably design better regularizers than dropout, bachnorm, l2, etc."
neural-net  generalization  gradient-descent  sgd 
august 2017 by arsyed
[1707.06386] Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains
"We consider the minimization of an objective function given access to unbiased estimates of its gradient through stochastic gradient descent (SGD) with constant step-size. While the detailed analysis was only performed for quadratic functions, we provide an explicit asymptotic expansion of the moments of the averaged SGD iterates that outlines the dependence on initial conditions, the effect of noise and the step-size, as well as the lack of convergence in the general (non-quadratic) case. For this analysis, we bring tools from Markov chain theory into the analysis of stochastic gradient and create new ones (similar but different from stochastic MCMC methods). We then show that Richardson-Romberg extrapolation may be used to get closer to the global optimum and we show empirical improvements of the new extrapolation scheme."
papers  optimization  markov-chain  probability  sgd  via:mraginsky 
july 2017 by arsyed
[1704.04289] Stochastic Gradient Descent as Approximate Bayesian Inference
"Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler."
papers  optimization  gradient-descent  sgd  bayesian 
july 2017 by arsyed
When not to use deep learning
"One of my favorites is the interpretation of the methods as part of performing Bayesian inference. In essence, every time that you do some form of numerical optimization, you’re performing some Bayesian inference with particular assumptions and priors. Indeed, there’s a whole field, called probabilistic numerics, that has emerged from taking this view. Stochastic gradient descent is no different, and recent work suggests that the procedure is really a Markov chain that, under certain assumptions, has a stationary distribution that can be seen as a sort of variational approximation to the posterior. So when you stop your SGD and take the final parameters, you’re basically sampling from this approximate distribution. I found this idea to be illuminating, because the optimizer’s parameters (in this case, the learning rate) make so much more sense that way. As an example, as you increase the learning parameter of SGD the Markov chain becomes unstable until it finds wide local minima that samples a large area; that is, you increase the variance of procedure. On the other hand, if you decrease the learning parameter, the Markov chain slowly approximates narrower minima until it converges in a tight region; that is, you increase the bias for a certain region. Another parameter, the batch size in SGD, also controls what type of region the algorithm converges two: wider regions for small batches and sharper regions with larger batches."
deep-learning  neural-net  bayesian  sgd  optimization 
july 2017 by arsyed
[1706.03471] YellowFin and the Art of Momentum Tuning
Hyperparameter tuning is one of the big costs of deep learning. State-of-the-art optimizers, such as Adagrad, RMSProp and Adam, make things easier by adaptively tuning an individual learning rate for each variable. This level of fine adaptation is understood to yield a more powerful method. However, our experiments, as well as recent theory by Wilson et al., show that hand-tuned stochastic gradient descent (SGD) achieves better results, at the same rate or faster. The hypothesis put forth is that adaptive methods converge to different minima (Wilson et al.). Here we point out another factor: none of these methods tune their momentum parameter, known to be very important for deep learning applications (Sutskever et al.). Tuning the momentum parameter becomes even more important in asynchronous-parallel systems: recent theory (Mitliagkas et al.) shows that asynchrony introduces momentum-like dynamics, and that tuning down algorithmic momentum is important for efficient parallelization.
We revisit the simple momentum SGD algorithm and show that hand-tuning a single learning rate and momentum value makes it competitive with Adam. We then analyze its robustness in learning rate misspecification and objective curvature variation. Based on these insights, we design YellowFin, an automatic tuner for both momentum and learning rate in SGD. YellowFin optionally uses a novel momentum-sensing component along with a negative-feedback loop mechanism to compensate for the added dynamics of asynchrony on the fly. We empirically show YellowFin converges in fewer iterations than Adam on large ResNet and LSTM models, with a speedup up to 2.8x in synchronous and 2.7x in asynchronous settings.
papers  optimization  sgd  momentum  yellowfin  adaptive 
july 2017 by arsyed
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented *without any locking*. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild outperforms alternative schemes that use locking by an order of magnitude.
papers  optimization  parallel  ben-recht  sgd  gradient-descent 
june 2017 by arsyed
[1509.01240] Train faster, generalize better: Stability of stochastic gradient descent
"We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions.
Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit."
papers  optimization  generalization  gradient-descent  sgd  ben-recht 
june 2017 by arsyed
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour – Facebook Research
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ∼90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency.
papers  neural-net  sgd  training  minibatch  optimization 
june 2017 by arsyed
Everything that Works Works Because it's Bayesian: Why Deep Nets Generalize?
"The reason deep networks work so well (and generalize at all) is not just because they are some brilliant model, but because of the specific details of how we optimize them. Stochastic gradient descent does more than just converge to a local optimum, it is biased to favour local optima with certain desirable properties, resulting in better generalization.
[...]
So SGD tends to find flat minima, minima where the Hessian - and consequently the inverse Fisher information matrix - has small eigenvalues. Why would flat minima be interesting from a Bayesian pespective?
[...]
If you are in a flat minimum, there is a relatively large region of parameter space where many parameters are almost equivalent inasmuch as they result in almost equally low error. Therefore, given an error tolerance level, one can describe the parameters at the flat minimum with limited precision, using fewer bits while keeping the error within tolerance. In a sharp minimum, you have to describe the location of your minimum very precisely, otherwise your error may increase by a lot."
papers  deep-learning  bayesian  generalization  sgd  optimization 
may 2017 by arsyed
[1611.01838] Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.
papers  neural-net  optimization  sgd  entropy 
may 2017 by arsyed
[1705.08292] The Marginal Value of Adaptive Gradient Methods in Machine Learning
"Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks."
papers  optimization  adaptive  gradient-descent  sgd  adam 
may 2017 by arsyed
[1603.05953] Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
"The Katyusha momentum paper gives the best summary of SGD-based optimization methods to date. See pp 2-3" [-- Delip Rao‏]
papers  sgd  momentum  optimization 
february 2017 by arsyed
neuralnetwork - Benefits of stochastic gradient descent besides speed/overhead and their optimization - Data Science Stack Exchange
On large datasets, SGD can converge faster than batch training because it performs updates more frequently. We can get away with this because the data often contains redundant information, so the gradient can be reasonably approximated without using the full dataset. Minibatch training can be faster than training on single data points because it can take advantage of vectorized operations to process the entire minibatch at once. The stochastic nature of online/minibatch training can also make it possible to hop out of local minima that might otherwise trap batch training.

One reason to use batch training is cases where the gradient can't be approximated using individual points/minibatches (e.g. where the loss function can't be decomposed as a sum of errors for each data point).
gradient-descent  sgd  minibatch 
january 2017 by arsyed
[1609.04836] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
"The stochastic gradient descent method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, usually 32--512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. There have been some attempts to investigate the cause for this generalization drop in the large-batch regime, however the precise answer for this phenomenon is, hitherto unknown. In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We also discuss several empirical strategies that help large-batch methods eliminate the generalization gap and conclude with a set of future research ideas and open questions."
papers  neural-net  sgd  deep-learning  training  batch  minibatch  batch-size 
november 2016 by arsyed
(2) In deep learning, why don't we use the whole training set to compute the gradient? - Quora
When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)). In other words, there are diminishing marginal returns to putting more examples in the minibatch. You can read more about this in Chapter 8 of the deep learning textbook, on optimization algorithms for deep learning: http://www.deeplearningbook.org/...
complexity  neural-net  optimization  gradient-descent  sgd  minibatch 
november 2016 by arsyed
Hogwild Stochastic Gradient Descent
"Repeat until convergence: 1. pick a random element in the training set 2. update the weights with the example θ←(θ−α∇L(f(xi),yi))θ←(θ−α∇L(f(xi),yi))

Because of the fast updates, SGD has the following properties: 1. Unlike GD, each update in SGD can improve or enlarge the loss 2. SGD converges faster than GD because we update the weights much more frequently 3. SGD takes much longer to get to the most “optimal” solution, making it less likely to overfit
[...]

When multithreading, a single thread generally takes only a few microsecond to calculate the new weight, but may need to wait several miliseconds to obtain permission (the lock) to update the weight. [...] Because although weight updates would inevitably overwrite each other, the absense of locks enables SGD to perform many times more updates overall. Going back to the 3 properties of SGD listed earlier, (2) and (3) allow the asynchronous SGD to yield good results in a fraction of the time."
sgd  parallel  hogwild 
october 2016 by arsyed
How can i handle the last mini-batch size? - Google Groups
I just skip the last minibatch, shuffle the samples and start again. The MNIST CNN example from the tutorial does the same thing. 
machine-learning  tips  minibatch  sgd  shuffling 
july 2016 by arsyed
[1502.03167] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
"Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters."
papers  neural-net  regularization  normalization  dnn  sgd  tricks  batch-norm 
july 2016 by arsyed
Make your Stochastic Gradient Descent more Stochastic | Delip Rao
"Stop. Stare at that for a while. Enjoy this magic. As with such things, the authors give no theoretical justification other than showing it to work on a variety of networks (kudos for that), and a hand wavy connection to simulated annealing, but examining the expression for \sigma_tσ
​t
​​ should tell that the additive noise is highest at the beginning and has little to no effect during later stages of training. Like other methods for careful initialization, this should be effective in breaking symmetries and getting the training started on the right foot.

This is not entirely weird if you think more about it. Any dataset is after all has a sampling bias. So, in a way, even exact gradients are exact only with respect to this sample, and the empirical gradients are only an approximation of the gradient of manifold of the underlying physical process. So real question: 1) why bother computing exact gradients? 2) Are there computationally inexpensive/sloppy approaches to computing approximate gradients that will make the training process faster? (Remember sampling from the Gaussian will take additional time.)"
neural-net  training  optimization  sgd  noise  tricks  injected-noise 
june 2016 by arsyed
Dimensional analysis of gradient ascent
"In machine learning, we've become fond of online methods, which adapt the step size as they go. The general idea is to estimate a step size matrix that passes the unit check (for each dimension of xx). Furthermore, we want do as little extra work as possible to get this estimate (e.g., we want to avoid computing a Hessian because that would be extra work). So, the step size should be based only iterates and gradients up to time tt.

AdaGrad doesn't doesn't pass the unit check. This motivated AdaDelta.

AdaDelta uses the ratio of (running estimates of) the root-mean-squares of ΔxΔx and ∂f/∂x∂f/∂x. The mean is taken using an exponentially weighted moving average. See paper for actual implementation.

Adam came later and made some tweaks to remove (unintended) bias in the AdaDelta estimates of the numerator and denominator.

In summary, it's important/useful to analyze the units of numerical algorithms in order to get a sanity check (i.e., catch mistakes) as well as to develop an understanding of why certain parameters exist and how properties of a problem affect the values we should use for them."
algorithms  machine-learning  optimization  gradient-descent  sgd  dimensional-analysis 
may 2016 by arsyed
Large-Scale Machine Learning with Stochastic Gradient Descent - Springer
"During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set."
papers  algorithms  optimization  sgd 
may 2016 by arsyed
Online Algorithms and Stochastic Approximations
"The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use today, as illustrated by several examples. The stochastic approximation theory then provides general results describing the convergence of all these learning algorithms at once."
papers  optimization  algorithms  sgd  stochastic-approximation 
may 2016 by arsyed
[1212.5701] ADADELTA: An Adaptive Learning Rate Method
"We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment."
papers  optimization  adadelta  algorithms  sgd 
may 2016 by arsyed
[1509.01240] Train faster, generalize better: Stability of stochastic gradient descent
"We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions.
Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit."
papers  optimization  machine-learning  sgd 
may 2016 by arsyed
Trading representability for scalability
Support Vector Machines (SVMs) are among the most popular and successful classification algorithms. Kernel SVMs often reach state-of-the-art accuracies, but suffer from the curse of kernelization due to linear model growth with data size on noisy data. Linear SVMs have the ability to efficiently learn from truly large data, but they are applicable to a limited number of domains due to low representational power. To fill the representability and scalability gap between linear and nonlinear SVMs, we propose the Adaptive Multi-hyperplane Machine (AMM) algorithm that accomplishes fast training and prediction and has capability to solve nonlinear classification problems. AMM model consists of a set of hyperplanes (weights), each assigned to one of the multiple classes, and predicts based on the associated class of the weight that provides the largest prediction. The number of weights is automatically determined through an iterative algorithm based on the stochastic gradient descent algorithm which is guaranteed to converge to a local optimum. Since the generalization bound decreases with the number of weights, a weight pruning mechanism is proposed and analyzed. The experiments on several large data sets show that AMM is nearly as fast during training and prediction as the state-of-the-art linear SVM solver and that it can be orders of magnitude faster than kernel SVM. In accuracy, AMM is somewhere between linear and kernel SVMs. For example, on an OCR task with 8 million highly dimensional training examples, AMM trained in 300 seconds on a single-core processor had 0.54% error rate, which was significantly lower than 2.03% error rate of a linear SVM trained in the same time and comparable to 0.43% error rate of a kernel SVM trained in 2 days on 512 processors. The results indicate that AMM could be an attractive option when solving large-scale classification problems. The software is available at www.dabi.temple.edu/~vucetic/AMM.html.
papers  machine-learning  scaling  classification  ensemble  svm  hyperplane  sgd 
may 2016 by arsyed
[1506.02617] Path-SGD: Path-Normalized Optimization in Deep Neural Networks
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
papers  neural-net  dnn  optimization  sgd 
january 2016 by arsyed
Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we useour new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.
papers  sgd  hogwild  parallel  algorithms  optimization 
december 2015 by arsyed
Machined Learnings: Learning is easier than optimization
"The gist is that training error is a proxy for what we really care about, generalization error, and therefore it can be counterproductive to expend computational effort excessively optimizing training error. The counter-productivity arises in (at least) two different ways. The first is that excessive computational effort might cause you to use less data in order to compensate for computation time, and using less data means the proxy is less accurate. The second is that less aggressive optimization can act as a regularizer, which when corrected with a ``better'' optimization routine actually leads to worse generalization error."

"Today, neural networks are still trained with algorithms that hug close to SGD, and arguably the main differences we see today are in architectures and regularizers (and architecture is a kind of regularizer). The fact is, SGD is a so-so optimization algorithm, but a great learning algorithm."

"A second example is the hashing trick. [...] Everybody who first encounters this thinks, "aren't there collisions?". The answer is, "yes there are collisions" and "it doesn't seem to matter much in practice." One can talk about how it preserves dot products in expectation, or that how statistically stable conjunction features can be as informative in the presence of redundancy, but here's what's really neat: I've seen lots of examples where increasing the number of bits in the hash function degrades performance. In other words, less collisions, but worse generalization; this is because the hash collisions are providing a useful constraint on model complexity and learning is easier than optimization."
machine-learning  sgd  hashing-tricks  tricks  regularization  optimization 
october 2015 by arsyed
(2) What is the Alternating Least Squares method in recommendation systems? - Quora
"In an SGD (Stochastic Gradient descent) approach, for each example in the dataset you compute the error (rui−pTuqi)(rui−puTqi) and then you update the parameters by a factor in the opposite direction of the gradient.

Alternating Least Squares (ALS) represents a different approach to optimizing the loss function. The key insight is that you can turn the non-convex optimization problem in Equation (2) into an "easy" quadratic problem if you fix either pupu or qiqi. ALS fixes each one of those alternatively. When one is fixed, the other one is computed, and vice versa.

There are two main benefits of this approach. First, this is very easy to parallelize. Second, whenever dealing with implicit datasets, which are usually not sparse, SGD is not practical (users times items can easily be in the order of billions). ALS is a much more efficient optimization technique in these cases."
algorithms  optimization  sgd  alternating-least-squares  recsys  matrix-factorization 
october 2015 by arsyed
« earlier      
per page:    204080120160

related tags

active-learning  adadelta  adam  adamw  adaptive  algorithms  alternating-least-squares  analysis  asr  async  backprop  batch  batch-norm  batch-size  bayesian  ben-recht  bfgs  big-data  brittle  capacity  classification  code  comparison  complexity  compression  convergence  convex  cpu  crf  deep-learning  dimensional-analysis  dnn  dpp  eesen  em  ensemble  entropy  epoch  few-shot  francis-bach  generalization  gensim  gil  gpu  gradient-descent  hacks  hashing-tricks  hogwild  hugo-larochelle  hyperplane  iclr  icml  ilya-sutskever  inductive-bias  initialization  injected-noise  invariance  kernel-methods  laso  learning-rate  leon-bottou  leslie-smith  libs  linear-regression  logistic-regression  machine-learning  markov-chain  math  matrix-factorization  meta-learning  mini-batch  minibatch  minima  momentum  multiclass  multiprocessing  multithreading  neural-net  newbob  newton  nlp  noise  normalization  notes  numeric  nural-net  online-learning  optimization  orthogonal-initialization  papers  parallel  perceptron  performance  playlist  probability  program-induction  python  R  randomization  recsys  ref  regularization  research  resnet  reviews  rnn  robbins-munro  sampling  scaling  search  semisupervised  sgd  shuffling  sklearn  slides  software  sparsity  stability  statistics  stochastic-approximation  stochastic-optimization  svm  talks  tensorflow  terpret  theory  tips  training  tricks  tutorials  variance-reduction  variational-inference  via:abiola  via:jseppanen  via:mraginsky  videos  word2vec  yellowfin 

Copy this bookmark:



description:


tags: