generalization   200

« earlier    

[1710.05468] Generalization in Deep Learning
"This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature. Based on our theoretical insight, this paper also proposes a family of new regularization methods. Its simplest member was empirically shown to improve base models and achieve state-of-the-art performance on MNIST and CIFAR-10 benchmarks. Moreover, this paper presents both data-dependent and data-independent generalization guarantees with improved convergence rates. Our results suggest several new open areas of research."
papers  deep-learning  generalization 
2 days ago by arsyed
The Two Phases of Gradient Descent in Deep Learning
Good article that reviews recent papers on the theory behind SGD in deep learning. The links to other papers in this article are also very helpful.
deeplearning  ai  theory  sgd  compression  generalization  informationtheory 
21 days ago by drmeme
New Theory Cracks Open the Black Box of Deep Learning | Quanta Magazine
Great review article of a paper explaining the results of a new theory on how deep learning works. They describe SGD as having two distinct phases, a drift phase and a diffusion phase. SGD begins in the first phase, basically exploring the multidimensional space of solutions. When it begins converging, it arrives at the diffusion phase where it is extremely chaotic and the convergence rate slows to a crawl. Also, read the original article at and a video of a talk at
deeplearning  ai  theory  sgd  compression  generalization  informationtheory 
21 days ago by drmeme
What does ">" really mean?
This Snapshot is about the generalization of ">" from ordinary numbers to so-called fields. At the end, I will touch on some ideas in recent research.
mathematics  generalization  rather-interesting  summary 
23 days ago by Vaguery
[1703.09580] Early Stopping without a Validation Set
"Early stopping is a widely used technique to prevent poor generalization performance when training an over-expressive model by means of gradient-based optimization. To find a good point to halt the optimizer, a common practice is to split the dataset into a training and a smaller validation set to obtain an ongoing estimate of the generalization performance. We propose a novel early stopping criterion based on fast-to-compute local statistics of the computed gradients and entirely removes the need for a held-out validation set. Our experiments show that this is a viable approach in the setting of least-squares and logistic regression, as well as neural networks."
papers  machine-learning  early-stopping  generalization 
8 weeks ago by arsyed
[1703.09833] Theory II: Landscape of the Empirical Risk in Deep Learning
Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least in the case of the most successful Deep Convolutional Neural Networks (DCNNs), practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove in the regression framework the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The argument that relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity (which empirically works as well). As described in our Theory III [2] paper, the same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the most robust zero-minimizer. We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR-10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.
papers  deep-learning  neural-net  analysis  generalization 
10 weeks ago by arsyed
[1706.08498] Spectrally-normalized margin bounds for neural networks
"This paper presents a margin-based multiclass generalization bound for neural networks which scales with their margin-normalized "spectral complexity": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network on the mnist and cifar10 datasets, with both original and random labels, where it tightly correlates with the observed excess risks."
papers  neural-net  analysis  generalization 
july 2017 by arsyed
[1706.08947] Exploring Generalization in Deep Learning
"With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena."
papers  neural-net  deep-learning  generalization 
july 2017 by arsyed
[1509.01240] Train faster, generalize better: Stability of stochastic gradient descent
"We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions.
Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit."
papers  optimization  generalization  gradient-descent  sgd  ben-recht 
june 2017 by arsyed
Everything that Works Works Because it's Bayesian: Why Deep Nets Generalize?
"The reason deep networks work so well (and generalize at all) is not just because they are some brilliant model, but because of the specific details of how we optimize them. Stochastic gradient descent does more than just converge to a local optimum, it is biased to favour local optima with certain desirable properties, resulting in better generalization.
So SGD tends to find flat minima, minima where the Hessian - and consequently the inverse Fisher information matrix - has small eigenvalues. Why would flat minima be interesting from a Bayesian pespective?
If you are in a flat minimum, there is a relatively large region of parameter space where many parameters are almost equivalent inasmuch as they result in almost equally low error. Therefore, given an error tolerance level, one can describe the parameters at the flat minimum with limited precision, using fewer bits while keeping the error within tolerance. In a sharp minimum, you have to describe the location of your minimum very precisely, otherwise your error may increase by a lot."
papers  deep-learning  bayesian  generalization  sgd  optimization 
may 2017 by arsyed

« earlier    

related tags

academia  accuracy  acm  acmtariat  adaptation  adriancolyer  ai-control  ai  algebra  algorithm  algorithms  alt-inst  analysis  anecdotal-evidence  antidemos  aphorism  applications  architecture  argumentation-sytle  article  arxiv  asia  auto-learning  bangbang  bayesian  behavioral-econ  behavioral-gen  ben-recht  benchmarking  berkeley  best-practices  bestpractices  bias-variance  bias  biases  big-peeps  big-picture  big_data  bio  biodet  bioinformatics  books  cardio  cartography  case-based-reasoning  cats  causation  caveats  cellular-automata  chart  clever-rats  clustering  cog-psych  cognitive-science  combinatorics  commentary  comparison  complex-systems  compression  computation  computational-complexity  computational-geometry  computer-vision  concept  conceptual-vocab  conference  confounding  confusion  consider:feature-construction  consider:feature-discovery  consider:looking-at-gp-models  consider:looking-to-see  consider:non-rectangles  consider:ontology  consider:performance-measures  consider:rediscovery  consider:representation  consider:symbolic-regression  constraint-satisfaction  context  contracts  cost-benefit  counter-revolution  counterfactual  counterintuitive-means-interesting  coupled-oscillators  cracker-econ  criminology  critique  curse-of-dimensionality  curvature  cycles  data-fusion-sortof  data-science  data  datascience  debate  decision-making  deep-learning  deep  deepgoog  deeplearning  define-your-terms  definition  degrees-of-freedom  differential-privacy  dimensionality  discovery  discussion  diversity  early-stopping  econometrics  economics  econotariat  education  effect-size  embedded-cognition  emergent  empirical  enhancement  ensembles  epistemic  ergodic  error  essay  europe  events  expert-experience  expert  explanation  exploratory  exposition  extrema  farmers-and-foragers  feature-construction  finance  flexibility  flux-stasis  formalization  fractals  frontier  fuzzy-logic  game-theory  garett-jones  gender-diff  gender  generative-models  generative  genetics  genomics  geo  geojson  geometry  giants  gis  github  gotchas  government  gradient-descent  graph-theory  greedy  ground-up  gwas  gwern  heterodox  high-dimension  history  hmm  hn  holdout  homo-hetero  housing  howto  hsu  huge-data-the-biggest  humor  hypergraphs  hypothesis-testing  hypothetical-cases  ideas  ifs  image-processing  inductive-reasoning  industrial-org  info-dynamics  infographics  information-theory  informationtheory  init  innovation  intel  intelligence  interdisciplinary  interpretability  intervention  intricacy  intuition  investing  iq  iraq-syria  it's-more-complicated-than-you-think  jargon  javascript  kids  latent-variables  latin-america  learning-theory  learning  len:long  lens  lesswrong  levers  liner-notes  links  list  local-global  logistic  longitudinal  machine-learning  machinelearning  macro  magic-squares  map-territory  mapping  maps  marginal  market-failure  markov  matching  mathematical-recreations  mathematics  matrix-factorization  measurement  medicine  memorization  mena  meta-analysis  meta:medicine  meta:prediction  meta:rhetoric  meta:science  metabuch  methodology  metrics  microfoundations  ml  model-class  model-selection  modeling  models  moments  monetary-fiscal  monte-carlo  mostly-modern  mrtz  multi  ner  neural-net  news  nibble  nlp  nn  noise-structure  nonlinear-dynamics  nudge-targets  null-result  off-convex  openai  optimization  orfe  org:bleg  org:econlib  org:edu  org:junk  org:lite  org:sci  oscillation  osm  out-of-sample-recognition  out-of-the-box  overfitting  overflow  packing  paper  papers  parent  patterns  pdf  personality  perturbation  phalanges  poast  policy  politics  pop-diff  pop-structure  population-biology  preprint  problem-solving  programming  proof  proposal  prototype-theory  psych-architecture  psychiatry  psychology  psychometrics  q-n-a  qra  qtl  quotes  race  rademacher  random  ranking  rather-interesting  rationality  ratty  realness  reasoning  reference  reflection  regression  regularization  regularizer  reinforcement  relaxation  replication  representation  research-article  research-program  research  reusable-holdout  review  rhetoric  rivers  robust  roots  rule-induction  rules  s:*  sampling-bias  scale  science  scitariat  sebastien-bubeck  securities  self-similarity  sensitivity  sgd  shapefile  signal-noise  signal-processing  similarity  simulation  social-psych  social-science  social  society  sociology  solid-study  sparsity  spatial  speculation  stability  stackex  statistical-physics  statistics  stats  stereotype  stochastic-resonance  stories  stress  study  success  summary  supervised-learning  supply-demand  survival  synthesis  tcs  technology  techtariat  terminal  testerror  the-trenches  the-world-is-just-atoms  theory  things  thinking  to-read  to-understand  to-write-about  tool  track-record  tradeoffs  trainingerror  trends  truth  tsp  tutorial  twitter  uncertainty  variance  vc  visual-understanding  visualization  water  west-hunter  wiki  wire-guided  wonkish  yak-shaving  yoga  🌞  🎩  🔬 

Copy this bookmark: