foodbaby + theory   56

[1711.05408] Recurrent Neural Networks as Weighted Language Recognizers
We investigate computational complexity of questions of various problems for simple recurrent neural networks (RNNs) as formal models for recognizing weighted languages. We focus on the single-layer, ReLU-activation, rational-weight RNNs with softmax, which are commonly used in natural language processing applications. We show that most problems for such RNNs are undecidable, including consistency, equivalence, minimization, and finding the highest-weighted string. However, for consistent RNNs the last problem becomes decidable, although the solution can be exponentially long. If additionally the string is limited to polynomial length, the problem becomes NP-complete and APX-hard. In summary, this shows that approximations and heuristic algorithms are necessary in practical applications of those RNNs. We also consider RNNs as unweighted language recognizers and situate RNNs between Turing Machines and Random-Access Machines regarding their real-time recognition powers.
RNN  DNN  theory 
february 2018 by foodbaby
Intriguing Properties of Randomly Weighted Networks: Generalizing while Learning Next to Nothing | OpenReview
Training deep neural networks results in strong learned representations that show good generalization capabilities. In most cases, training involves iterative modification of all weights inside the network via back-propagation. In this paper, we propose to take an extreme approach and fix \emph{almost all weights} of a deep convolutional neural network in their randomly initialized values, allowing only a small portion to be learned. As our experiments show, this often results in performance which is on par with the performance of learning all weights. The implications of this intriguing property or deep neural networks are discussed and we suggest ways to harness it to create more robust representations.
CNN  theory 
january 2018 by foodbaby
[1712.07897] Non-convex Optimization for Machine Learning
A vast majority of machine learning algorithms train their models and perform inference by solving optimization problems. In order to capture the learning and prediction problems accurately, structural constraints such as sparsity or low rank are frequently imposed or else the objective itself is designed to be a non-convex function. This is especially true of algorithms that operate in high-dimensional spaces or that train non-linear models such as tensor models and deep networks.
The freedom to express the learning problem as a non-convex optimization problem gives immense modeling power to the algorithm designer, but often such problems are NP-hard to solve. A popular workaround to this has been to relax non-convex problems to convex ones and use traditional methods to solve the (convex) relaxed optimization problems. However this approach may be lossy and nevertheless presents significant challenges for large scale optimization.
On the other hand, direct approaches to non-convex optimization have met with resounding success in several domains and remain the methods of choice for the practitioner, as they frequently outperform relaxation-based techniques - popular heuristics include projected gradient descent and alternating minimization. However, these are often poorly understood in terms of their convergence and other properties.
This monograph presents a selection of recent advances that bridge a long-standing gap in our understanding of these heuristics. The monograph will lead the reader through several widely used non-convex optimization techniques, as well as applications thereof. The goal of this monograph is to both, introduce the rich literature in this area, as well as equip the reader with the tools and techniques needed to analyze these simple procedures for non-convex problems.
non-convex  optimization  theory 
january 2018 by foodbaby
[1706.01350] Emergence of Invariance and Disentangling in Deep Representations
Using established principles from Information Theory and Statistics, we show that in a deep neural network invariance to nuisance factors is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then show that, in order to avoid memorization, we need to limit the quantity of information stored in the weights, which leads to a novel usage of the Information Bottleneck Lagrangian on the weights as a learning criterion. This also has an alternative interpretation as minimizing a PAC-Bayesian bound on the test error. Finally, we exploit a duality between weights and activations induced by the architecture, to show that the information in the weights bounds the minimality and Total Correlation of the layers, therefore showing that regularizing the weights explicitly or implicitly, using SGD, not only helps avoid overfitting, but also fosters invariance and disentangling of the learned representation. The theory also enables predicting sharp phase transitions between underfitting and overfitting random labels at precise information values, and sheds light on the relation between the geometry of the loss function, in particular so-called "flat minima," and generalization.
DL  theory 
january 2018 by foodbaby
Using Text Embeddings for Information Retrieval
theory and intuition behind word embeddings plus applications
word  embeddings  IR  slides  2016  session2vec  theory  word2vec 
december 2017 by foodbaby
[1611.03530] Understanding deep learning requires rethinking generalization
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
DNN  theory 
december 2017 by foodbaby
[1612.04010] An empirical analysis of the optimization of deep network loss surfaces
The success of deep neural networks hinges on our ability to accurately and efficiently optimize high-dimensional, non-convex functions. In this paper, we empirically investigate the loss functions of state-of-the-art networks, and how commonly-used stochastic gradient descent variants optimize these loss functions. To do this, we visualize the loss function by projecting them down to low-dimensional spaces chosen based on the convergence points of different optimization algorithms. Our observations suggest that optimization algorithms encounter and choose different descent directions at many saddle points to find different final weights. Based on consistency we observe across re-runs of the same stochastic optimization algorithm, we hypothesize that each optimization algorithm makes characteristic choices at these saddle points.
optimization  SGD  papers  neural  networks  theory 
december 2017 by foodbaby
[1412.6544] Qualitatively characterizing neural network optimization problems
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
SGD  optimization  neural  networks  theory 
december 2017 by foodbaby
[1712.01208] The Case for Learned Index Structures
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.
data  structures  database  theory 
december 2017 by foodbaby
Designing Access Methods: The RUM Conjecture
The fundamental challenges that every researcher, systems ar-
chitect, or designer faces when designing a new access method are
how to minimize, i) read times (R), ii) update cost (U), and iii) memory (or storage) overhead (M). In this paper, we conjecture that when optimizing the read-update-memory overheads, optimizing in any two areas negatively impacts the third. We present a
simple model of the RUM overheads, and we articulate the RUM Conjecture. We show how the RUM Conjecture manifests in state-
of-the-art access methods, and we envision a trend toward RUM-
aware access methods for future data systems.
database  theory 
november 2017 by foodbaby

related tags

academic  analysis  art  articles  Audience  barthes  blog  books  Border_crossing  bridgewater  category  chandler  class  CNN  communication  consistency  course  Courses_2005FC  Courses_2007FC  culture  data  data-pipelines  database  deep-learning  dewey  discourse  DL  DNN  ebooks-criticaltheory  education  Education/Learning  elliott-jaques  embeddings  erikolinwright  ethnomethodology  evolution  Forex  foucault  FP  history  hr  ideology  Ideology/Analysis/Hegemony  imported  indexes  indicators  interest  IR  join  lacan  language  language-models  learning  levels  linguistics  literary_theory  literature  marx  marxism  Marxism_explained_well  media  merge  metaphor  microservices  ML  Money  monitoring  negative-sampling  network  networks  neural  NN  non-convex  optimization  organisation  organisational  papers  pdf  pedagogy  philosophy  political-psychology  politics  programming  psychology  rates  ray-dalio  reference  research  resources  review  RNN  Semiotics  session2vec  SGD  slides  social  socialtheory  SOCIOL200_Theory_and_Society  sociology  Sociology_of_Media  softmax  strategy  stratum  structures  Study  survey  teaching  television  theory  Trading  triton  video  virilio  wikipedia  word  word2vec  zizek  _liminal_zones 

Copy this bookmark: