gradient-descent   102

« earlier    

3.17.3 Gradient Descent only Converges to Minimizers on Vimeo
"We show that gradient descent generically does not converge to saddle points. This is proved using the Stable Manifold theorem from dynamical systems theory."
talks  optimization  gradient-descent  saddle-point 
august 2017 by arsyed
The Gradient: A Visual Descent
In this post I aim to visually, mathematically and programatically explain the gradient, and how its understanding is crucial for gradient descent.
july 2017 by Tafkas
Decoding the Enigma with Recurrent Neural Networks
I am blown away by this -- given that Recurrent Neural Networks are Turing-complete, they can actually automate cryptanalysis given sufficient resources, at least to the degree of simulating the internal workings of the Enigma algorithm given plaintext, ciphertext and key:
The model needed to be very large to capture all the Enigma’s transformations. I had success with a single-celled LSTM model with 3000 hidden units. Training involved about a million steps of batched gradient descent: after a few days on a k40 GPU, I was getting 96-97% accuracy!
machine-learning  deep-learning  rnns  enigma  crypto  cryptanalysis  turing  history  gpus  gradient-descent 
july 2017 by jm
How to Escape Saddle Points Efficiently – Off the convex path
A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see Rong Ge’s and Ben Recht’s blog posts), the critical open problem is one of efficiency — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade, that provides the first provable positive answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!
acmtariat  org:bleg  nibble  liner-notes  machine-learning  acm  optimization  gradient-descent  local-global  off-convex  time-complexity  random  perturbation  michael-jordan  iterative-methods  research  learning-theory  math.DS  iteration-recursion 
july 2017 by nhaliday
[1704.04289] Stochastic Gradient Descent as Approximate Bayesian Inference
"Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler."
papers  optimization  gradient-descent  sgd  bayesian 
july 2017 by arsyed

« earlier    

related tags

academia  acm  acmtariat  adam  adaptive-control  adaptive  adversarial  ai-control  ai  algorithm  algorithms  analogy  analysis  ankur-moitra  applications  approximation  arxiv  async  atoms  attention  auto-learning  average-case  bandits  bare-hands  batch-norm  bayesian  because-proof  ben-recht  benchmarks  best-practices  better-explained  bias-variance  big-picture  boltzmann  by:sebastian-ruder  c++  characterization  cheatsheet  checklists  circuits  classification  clever-rats  coarse-fine  code  coding-theory  combo-optimization  commentary  comparison  competition  complement-substitute  composition-decomposition  compressed-sensing  computational-complexity  computer-vision  concentration-of-measure  concept  concurrency  conditional  conjugate-gradient-descent  consider:looking-to-see  consider:representation  convexity-curvature  cool  cooperate-defect  coordination  cost-benefit  counting  course  crux  cryptanalysis  crypto  curvature  debate  debugging  decision-making  decision-theory  deep-learning  deepgoog  descent  descriptive  details  developmental  differential  dimensionality  direct-indirect  discrete  discussion  distributed  duality  dynamic  dynamical  economics  embeddings  empirical  engineering  enigma  ensembles  equilibrium  error  evolution  expanders  expectation-maximization  experiment  expert-experience  expert  explanans  explanation  explore-exploit  exposition  extrema  fads-and-fallacies  flexibility  fourier  frontier  game-theory  games  generalization  generative  google  gpus  gradient  graph-theory  graphical-models  graphs  ground-up  guide  gwern  hacks  hadoop  hashing  heuristic  hi-order-bits  high-dimension  higher-ed  history  hsu  huge-data-the-biggest  humanity  ideas  incentives  information-geometry  init  insight  intelligence  interdisciplinary  intricacy  iteration-recursion  iterative-methods  jobs  justin-domke  land  language  latent-variables  learning-theory  learning  lecture-notes  lens  lesswrong  let-me-see  levers  library  libs  linear-algebra  linear-models  linear-programming  linear-regression  linearity  liner-notes  links  list  local-global  machine-learning  machine.learning  machine_learning  machinelearning  mapreduce  marginal  markets  markov  math.ds  math  mathematics  matrix-factorization  matrix  meta-learning  metabuch  metropolis  michael-jordan  mihai  minimization  mit  ml  mltheory  model-class  models  moloch  moments  momentum  monte-carlo  motivation  mrtz  multi  multithreading  natural-gradient  nature  neural-networks  neuralnet  neuro-nitgrit  neuro  nibble  nitty-gritty  nlp  nn  norms  nudge-targets  number  numerics  off-convex  online-learning  openai  opt  optimisation  optimization  org:bleg  org:inst  org:mat  org:med  oscillation  overview  p:***  p:someday  p:whenever  pac  papers  parallel  pdf  pegasos  performance  perturbation  philosophy  polynomials  prediction  preprint  presentation  princeton  programming  project  python  q-n-a  r  random-matrices  random-networks  random  rather-interesting  ratty  read  reading  realness  reasonable-approach  reduction  ref  reference  reflection  regression  regularization  reinforcement  replication  representation  research-article  research  retention  rhetoric  rigorous-crypto  risk  rnns  rounding  s:*  saas  saddle-point  sample-complexity  sampling  sanjeev-arora  science  scitariat  sdp  search-engine  search  sebastien-bubeck  sequential  sgd  signal-noise  silos-in-action  similarity  slides  smoothness  software  sparsity  spectral-projected-gradient  spectral  speculation  speedometer  stanford  stat-mech  state-of-art  statistics  stochastic-approximation  stochastic-processes  stochastic  stories  sublinear  submodular  summary  supply-demand  survey  svm  synthesis  systems  talks  tcs  teaching  techtariat  telos-atelos  tensorflow  the-self  thinking  this-analysis-justifies  threat-modeling  tim-roughgarden  time-complexity  time  to:watch  todo  toolkit  track-record  turing  tutorial  tutorials  unit  unsupervised  valiant  values  video  visual-understanding  visualization  volo-avolo  vowpal  wiki  wire-guided  word2vec  wrapper  yoga  👳 

Copy this bookmark: