generalization   239

« earlier    

[1805.01445] The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models
Seq2Seq based neural architectures have become the go-to architecture to apply to sequence to sequence language tasks. Despite their excellent performance on these tasks, recent work has noted that these models usually do not fully capture the linguistic structure required to generalize beyond the dense sections of the data distribution \cite{ettinger2017towards}, and as such, are likely to fail on samples from the tail end of the distribution (such as inputs that are noisy \citep{belkinovnmtbreak} or of different lengths \citep{bentivoglinmtlength}). In this paper, we look at a model's ability to generalize on a simple symbol rewriting task with a clearly defined structure. We find that the model's ability to generalize this structure beyond the training distribution depends greatly on the chosen random seed, even when performance on the standard test set remains the same. This suggests that a model's ability to capture generalizable structure is highly sensitive. Moreover, this sensitivity may not be apparent when evaluating it on standard test sets.
nlp  seq2seq  rnn  generalization 
6 weeks ago by arsyed
[1712.00409] Deep Learning Scaling is Predictable, Empirically
Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art.
This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents---the "steepness" of the learning curve---yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.
deep-learning  generalization 
10 weeks ago by arsyed
My notes on (Liang et al., 2017): Generalization and the Fisher-Rao norm
"The main mantra of this paper is along the lines of results by Bartlett (1998) who observed that in neural networks, generalization is about the size of the weights, not the number of weights. This theory underlies the use of techniques such as weight decay and even early stopping, since both can be seen as ways to keep the neural network's weight vector small. Reasoning about a neural network's generalization ability in terms of the size, or norm, of its weight vector is called norm-based capacity control.

The main contribution of Liang et al (2017) is proposing the Fisher-Rao norm as a measure of how big the networks' weights are, and hence as an indicator of a trained network's generalization ability."
neural-net  generalization 
11 weeks ago by arsyed
Do smoother areas of the error surface lead to better generalization?
In the first lecture of the outstanding Deep Learning Course (linking to version 1, which is also superb, v2 to become available early 2018), we learned how to train a state of the art model using…
deep-learning  generalization  Neural-Networks 
february 2018 by rishaanp
Information Processing: Mathematical Theory of Deep Neural Networks (Princeton workshop)
"Recently, long-past-due theoretical results have begun to emerge. These results, and those that will follow in their wake, will begin to shed light on the properties of large, adaptive, distributed learning architectures, and stand to revolutionize how computer science and neuroscience understand these systems."
hsu  scitariat  commentary  links  research  research-program  workshop  events  princeton  sanjeev-arora  deep-learning  machine-learning  ai  generalization  explanans  off-convex  nibble  frontier  speedometer  state-of-art  big-surf  announcement 
january 2018 by nhaliday
[1711.00350] Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks
"Humans can understand and produce new utterances effortlessly, thanks to their systematic compositional skills. Once a person learns the meaning of a new verb "dax," he or she can immediately understand the meaning of "dax twice" or "sing and dax." In this paper, we introduce the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences. We then test the zero-shot generalization capabilities of a variety of recurrent neural networks (RNNs) trained on SCAN with sequence-to-sequence methods. We find that RNNs can generalize well when the differences between training and test commands are small, so that they can apply "mix-and-match" strategies to solve the task. However, when generalization requires systematic compositional skills (as in the "dax" example above), RNNs fail spectacularly. We conclude with a proof-of-concept experiment in neural machine translation, supporting the conjecture that lack of systematicity is an important factor explaining why neural networks need very large training sets."
papers  deep-learning  rnn  seq2seq  generalization  brenden-lake 
january 2018 by arsyed
[1711.11561] Measuring the tendency of CNNs to Learn Surface Statistical Regularities
"Deep CNNs are known to exhibit the following peculiarity: on the one hand they generalize extremely well to a test set, while on the other hand they are extremely sensitive to so-called adversarial perturbations. The extreme sensitivity of high performance CNNs to adversarial examples casts serious doubt that these networks are learning high level abstractions in the dataset. We are concerned with the following question: How can a deep CNN that does not learn any high level semantics of the dataset manage to generalize so well? The goal of this article is to measure the tendency of CNNs to learn surface statistical regularities of the dataset. To this end, we use Fourier filtering to construct datasets which share the exact same high level abstractions but exhibit qualitatively different surface statistical regularities. For the SVHN and CIFAR-10 datasets, we present two Fourier filtered variants: a low frequency variant and a randomly filtered variant. Each of the Fourier filtering schemes is tuned to preserve the recognizability of the objects. Our main finding is that CNNs exhibit a tendency to latch onto the Fourier image statistics of the training dataset, sometimes exhibiting up to a 28% generalization gap across the various test sets. Moreover, we observe that significantly increasing the depth of a network has a very marginal impact on closing the aforementioned generalization gap. Thus we provide quantitative evidence supporting the hypothesis that deep CNNs tend to learn surface statistical regularities in the dataset rather than higher-level abstract concepts."
papers  neural-net  convnet  generalization  via:csantos 
january 2018 by arsyed

« earlier    

related tags

:/  ability-competence  absolute-relative  abstraction  academia  accuracy  acm  acmtariat  adaptation  adriancolyer  adversarial  ai-control  ai  algorithm  alien-character  altruism  analysis  analytical-holistic  anglo  anglosphere  announcement  anthropology  applicability-prereqs  architecture  arms  article  asia  audio  auto-learning  bare-hands  bayesian  behavioral-econ  behavioral-gen  being-right  ben-recht  benchmarking  best-practices  bias-variance  bias  biases  big-peeps  big-picture  big-surf  bio  biodet  biotech  bits  books  bounded-cognition  brenden-lake  broad-econ  by:yoshuabengio  cardio  cartography  causation  chart  china  class  classification  clever-rats  cliometrics  cog-psych  combinatorics  commandline  commentary  community  comparison  competition  complex-systems  complexity  composition-decomposition  compression  computation  computational-geometry  computer-vision  concept  conceptual-vocab  conference  confidence  confounding  confusion  consider:looking-at-gp-models  consider:looking-to-see  consider:performance-measures  consider:rediscovery  consider:representation  consider:symbolic-regression  consider:the-other-way-too  constraint-satisfaction  context  contracts  control  convexity-curvature  convnet  cool  cooperate-defect  coordination  cost-benefit  counterexample  counterfactual  cracker-econ  creative  criminology  critique  cs  cultural-dynamics  curvature  cybernetics  cycles  data-fusion-sortof  data-science  data  debate  decision-making  decision-theory  deep-learning  deep  deep_learning  deepgoog  deeplearning  define-your-terms  definite-planning  definition  degrees-of-freedom  dennett  density  descriptive  detail-architecture  developing-world  dignity  dimensionality  dirty-hands  discovery  discussion  disease  diversity  domestication  duty  early-stopping  earth  ecology  econometrics  economics  econotariat  education  effect-size  egalitarianism-hierarchy  egt  embedded-cognition  emergent  empirical  endo-exo  endogenous-exogenous  energy-resources  enhancement  ensembles  epistemic  ergodic  error  essay  ethics  europe  events  evopsych  existence  expert-experience  expert  explanans  explanation  exploratory  exposition  extrema  farmers-and-foragers  features  field-study  finance  flexibility  flux-stasis  formal-models  fourier  free-riding  frequency  frontier  futurism  games  garett-jones  gavisti  gender-diff  gender  generative-models  generative  genetics  geo  geojson  geometry  gis  github  good-evil  gotchas  gradient-descent  grammar  greedy  ground-up  gt-101  gwas  gwern  haidt  hard-tech  hari-seldon  heavy-industry  henrich  heterodox  history  hmm  homo-hetero  housing  howto  hsu  huge-data-the-biggest  human-capital  human-ml  humanity  humility  hypergraphs  hypocrisy  hypothesis-testing  ideas  illusion  image-processing  impetus  incentives  individualism-collectivism  industrial-org  info-dynamics  infographics  information-theory  informationtheory  innovation  intel  intelligence  interdisciplinary  interests  interpretability  intervention  interview  intricacy  intuition  investing  iq  iraq-syria  jargon  javascript  justice  kids  language  large-factor  latent-variables  latin-america  learning-theory  learning  len:long  lens  lesswrong  liner-notes  links  list  local-global  lol  longitudinal  lower-bounds  machine-learning  machine_learning  machinelearning  macro  magic-squares  management  map-territory  mapping  maps  marginal-rev  marginal  market-failure  markov  matching  math.ds  mathematical-recreations  mathematics  matrix-factorization  maxim-gun  measurement  mechanics  memorization  mena  meta-analysis  meta:prediction  meta:rhetoric  meta:science  metameta  methodology  metrics  microfoundations  military  mixup  ml  model-class  model-selection  models  modernity  moments  monetary-fiscal  monte-carlo  morality  mostly-modern  mrtz  multi  n-factor  nature  ner  network-structure  neural-net  neural-networks  neural_networks  neuralnetworks  neurons  news  nibble  nitty-gritty  nlp  nn  nonlinearity  nordic  nudge-targets  null-result  occam  occident  off-convex  one-shot  one-way-to-look-at-it  openai  optimism  optimization  order-disorder  orfe  org:bleg  org:econlib  org:edu  org:junk  org:lite  org:mat  org:sci  organization  organizing  orient  oscillation  osm  out-of-sample-recognition  outliers  overfitting  overflow  pac  packing  paper  papers  parent  parsimony  pdf  people  personality  perturbation  pessimism  phalanges  philosophy  piketty  piracy  poast  policy  pop-diff  population  pragmatic  prediction  preprint  princeton  priors-posteriors  problem-solving  programming  properties  pseudoe  psych-architecture  psychiatry  psychology  psychometrics  public-goodish  q-n-a  qra  quotes  race  rademacher  random  ranking  rant  rather-interesting  rationality  ratty  realness  reason  reference  reflection  regression  regularization  regularizer  reinforcement  religion  replication  research-program  research  review  rhetoric  rigor  risk  rivers  rnn  robotics  robust  roots  s-factor  s:*  sample-complexity  sampling-bias  sampling  sanctity-degradation  sanjeev-arora  sapiens  sasha-rakhlin  science  scitariat  search  sebastien-bubeck  securities  self-interest  seq2seq  sgd  shapefile  signal-noise  signum  simplify  simulation  singularity  sinosphere  skeleton  skunkworks  social-norms  social-psych  social-science  social  sociality  society  sociology  solid-study  spatial  speculation  speedometer  spock  stackex  stat-power  state-of-art  statesmen  statistical  statistical_mechanics  statistics  stats  stereotype  stereotypes  stories  stress  structure  study  stylized-facts  subjective-objective  success  summary  supervised-learning  supply-demand  survey  synthesis  systematic-ad-hoc  technology  telos-atelos  terminal  testerror  the-great-west-whale  the-self  the-trenches  theory-of-mind  theory-practice  theory  thick-thin  things  thinking  time  to-read  to-understand  to-write-about  tool  track-record  trainingerror  trends  truth  turing  twitter  uncertainty  uniqueness  universalism-particularism  usa  values  variance-components  variance  vc-dimension  vc  visualization  visuo  volo-avolo  water  waves  west-hunter  wiki  wire-guided  within-without  wonkish  workshop  world  yak-shaving  🌞  🎩  🔬 

Copy this bookmark: