neural-net   1343

« earlier    

[1803.01814] Norm matters: efficient and accurate normalization schemes in deep networks
Over the past few years batch-normalization has been commonly used in deep networks, allowing faster training and high performance for a wide variety of applications. However, the reasons behind its merits remained unanswered, with several shortcomings that hindered its use for certain tasks. In this work we present a novel view on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective. We also improve the use of weight-normalization and show the connection between practices such as normalization, weight decay and learning-rate adjustments. Finally, we suggest several alternatives to the widely used L2 batch-norm, using normalization in L1 and L∞ spaces that can substantially improve numerical stability in low-precision implementations as well as provide computational and memory benefits. We demonstrate that such methods enable the first batch-norm alternative to work for half-precision implementations.
neural-net  normalization 
7 days ago by arsyed
[1806.10909] ResNet with one-neuron hidden layers is a Universal Approximator
We demonstrate that a very deep ResNet with stacked modules with one neuron per hidden layer and ReLU activation functions can uniformly approximate any Lebesgue integrable function in d dimensions, i.e. ℓ1(ℝd). Because of the identity mapping inherent to ResNets, our network has alternating layers of dimension one and d. This stands in sharp contrast to fully connected networks, which are not universal approximators if their width is the input dimension d [Lu et al, 2017; Hanin and Sellke, 2017]. Hence, our result implies an increase in representational power for narrow deep networks by the ResNet architecture.
resnet  neural-net  universal-approximator 
7 days ago by arsyed
Modern Neural Networks Generalize on Small Data Sets
In this paper, we use a linear program to empirically decompose fitted neural networks into ensembles of low-bias sub-networks. We show that these sub-networks are relatively uncorrelated which leads to an internal regularization process, very much like a random forest, which can explain why a neural network is surprisingly resistant to overfitting. We then demonstrate this in practice by applying large neural networks, with hundreds of parameters per training observation, to a collection of 116 real-world data sets from the UCI Machine Learning Repository. This collection of data sets contains a much smaller number of training examples than the types of image classification tasks generally studied in the deep learning literature, as well as non-trivial label noise. We show that even in this setting deep neural nets are capable of achieving superior classification accuracy without overfitting.
neural-net  generalization  small-data  richard-berk 
7 days ago by arsyed
[1808.01204] Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels. Furthermore, the analysis provides interesting insights into several aspects of learning neural networks and can be verified based on empirical studies on synthetic data and on the MNIST dataset.
neural-net  sgd  generalization 
7 days ago by arsyed
[1810.12281] Three Mechanisms of Weight Decay Regularization
Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of L2 regularization. Literal weight decay has been shown to outperform L2 regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.
neural-net  optimization  regularization  weight-decay  l2  roger-grosse 
4 weeks ago by arsyed
TensorSpace.js – Present tensor in space, neural network 3D visualization framework_Github - jishuwen(技术文)
TensorSpace is a neural network 3D visualization framework built by TensorFlow.js, Three.js and Tween.js. TensorSpace provides Keras-like APIs to build deep learning layers, load pre-trained models, and generate a 3D visualization in the browser.
neural-net  tensorflow  visualization 
4 weeks ago by arsyed
[1811.03666] On the Statistical and Information-theoretic Characteristics of Deep Network Representations
It has been common to argue or imply that a regularizer can be used to alter a statistical property of a hidden layer's representation and thus improve generalization or performance of deep networks. For instance, dropout has been known to improve performance by reducing co-adaptation, and representational sparsity has been argued as a good characteristic because many data-generation processes have a small number of factors that are independent. In this work, we analytically and empirically investigate the popular characteristics of learned representations, including correlation, sparsity, dead unit, rank, and mutual information, and disprove many of the \textit{conventional wisdom}. We first show that infinitely many Identical Output Networks (IONs) can be constructed for any deep network with a linear layer, where any invertible affine transformation can be applied to alter the layer's representation characteristics. The existence of ION proves that the correlation characteristics of representation is irrelevant to the performance. Extensions to ReLU layers are provided, too. Then, we consider sparsity, dead unit, and rank to show that only loose relationships exist among the three characteristics. It is shown that a higher sparsity or additional dead units do not imply a better or worse performance when the rank of representation is fixed. We also develop a rank regularizer and show that neither representation sparsity nor lower rank is helpful for improving performance even when the data-generation process has a small number of independent factors. Mutual information I(zl;x) and I(zl;y) are investigated, and we show that regularizers can affect I(zl;x) and thus indirectly influence the performance. Finally, we explain how a rich set of regularizers can be used as a powerful tool for performance tuning.
deep-learning  neural-net  anlaysis  information-theory  optimization 
4 weeks ago by arsyed
[1811.03804] Gradient Descent Finds Global Minima of Deep Neural Networks
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. Our bounds also shed light on the advantage of using ResNet over the fully connected feedforward architecture; our bound requires the number of neurons per layer scaling exponentially with depth for feedforward networks whereas for ResNet the bound only requires the number of neurons per layer scaling polynomially with depth. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.
dnn  neural-net  analysis  gradient-descent  optimization 
4 weeks ago by arsyed
[1811.01753] How deep is deep enough? - Optimizing deep neural network architecture
Deep neural networks use stacked layers of feature detectors to repeatedly transform the input data, so that structurally different classes of input become well separated in the final layer. While the method has turned out extremely powerful in many applications, its success depends critically on the correct choice of hyperparameters, in particular the number of network layers. Here, we introduce a new measure, called the generalized discrimination value (GDV), which quantifies how well different object classes separate in each layer. Due to its definition, the GDV is invariant to translation and scaling of the input data, independent of the number of features, as well as independent of the number and permutation of the neurons within a layer. We compute the GDV in each layer of a Deep Belief Network that was trained unsupervised on the MNIST data set. Strikingly, we find that the GDV first improves with each successive network layer, but then gets worse again beyond layer 30, thus indicating the optimal network depth for this data classification task. Our further investigations suggest that the GDV can serve as a universal tool to determine the optimal number of layers in deep neural networks for any type of input data.
deep-learning  neural-net  analysis  depth 
5 weeks ago by arsyed

« earlier    

related tags

2017  activation  active-learning  adam  adamw  adversarial-examples  analysis  anlaysis  architecture-search  asr  attention  auditory  automatic-differentiation  backprop  batch-norm  batch-size  bayesian-optimization  bayesian  bias  blogs  bugs  capacity  clifford-algebra  cnn  co-training  code  combinatorial-optimization  complex  compression  computer-vision  convnet  convolutions  copy  data-selection  debugging  decision-tree  decoder  deep-learning  deep_learning  density-estimation  depth  distribution  dnn  dropout  dtw  dynamical-systems  encoder  ensemble  explanation  folklore  functional  funny  gary-marcus  gating  gaussian-processes  generalization  generative-models  generative  geometry  gotchas  gpu  gradient-accumulation  gradient-descent  graph  grid-search  hnish  humor  icml  ill-posed  image-recognition  imagenet  incremental-learning  information-theory  initialization  interpretation  inverse-problem  inverse  invertible  javascript  knowledge-distillation  kws  l2  language  learning-rate  libs  lstm  machine-learning  machine_learning  machine_learning_visualization  machinelearning  math  mean-field-theory  meta-learning  minibatch  model-selection  modeling  modelinterpretation  multimodal  multitask-learning  music  myths  neuroscience  nlp  normalization  object-recognition  ode  optimization  orthogonal-initialization  parallel  pca  perception  performance  pointer-net  polynomial  prediction  pretraining  prior-knowledge  prior  program-induction  program-synthesis  python  pytorch  randomization  recurrence  regression  regularization  reinforcement-learning  relu  representation-learning  resnet  richard-berk  rnn  roger-grosse  scaling  security  semisupervised  sgd  similarity  small-data  speech-synthesis  speech  structure-discovery  tensorflow  terpret  testing  thesis  tips  training  transducer  transfer-learning  tsp  tutorial  uncertainty  universal-approximator  variance-net  visualization  voice-conversion  weight-decay  weights 

Copy this bookmark: