[1902.06720] Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

5 hours ago

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

neural-net
linear-models
gradient-descent
5 hours ago

[1902.04023] Computing Extremely Accurate Quantiles Using t-Digests

yesterday

}We present on-line algorithms for computing approximations of rank-based statistics that give high accuracy, particularly near the tails of a distribution, with very small sketches. Notably, the method allows a quantile q to be computed with an accuracy relative to max(q,1−q) rather than absolute accuracy as with most other methods. This new algorithm is robust with respect to skewed distributions or ordered datasets and allows separately computed summaries to be combined with no loss in accuracy.

An open-source Java implementation of this algorithm is available from the author. Independent implementations in Go and Python are also available."

algorithms
online
approximation
quantile
An open-source Java implementation of this algorithm is available from the author. Independent implementations in Go and Python are also available."

yesterday

Performance Evaluation in Machine Learning:The Good, The Bad, The Ugly and The Way Forward

7 days ago

"This paper gives an overview of some ways in which our understanding of performance evaluation measures for machine-learned classifiers has improved over the last twenty years. I also highlight a range of areas where this understanding is still lacking, leading to ill-advised practices in classifier evaluation. This suggests that in order to make further progress we need to develop a proper measurement theory of machine learning. I then demonstrate by example what such a measurement theory might look like and what kinds of new results it would entail. Finally, I argue that key properties such as classification ability and data set difficulty are unlikely to be directly observable, suggesting the need for latent-variable models and causal inference."

machine-learning
evaluation
measurement
7 days ago

How Not to Count the Poor by Thomas Pogge, Sanjay G. Reddy :: SSRN

7 days ago

The World Bank's approach to estimating the extent, distribution and trend of global income poverty is neither meaningful nor reliable. The Bank uses an arbitrary international poverty line that is not adequately anchored in any specification of the real requirements of human beings. Moreover, it employs a concept of purchasing power equivalence that is neither well defined nor appropriate for poverty assessment. These difficulties are inherent in the Bank's "money-metric" approach and cannot be credibly overcome without dispensing with this approach altogether. In addition, the Bank extrapolates incorrectly from limited data and thereby creates an appearance of precision that masks the high probable error of its estimates. It is difficult to judge the nature and extent of the errors in global poverty estimates that these three flaws produce. However, there is reason to believe that the Bank's approach may have led it to understate the extent of global income poverty and to infer without adequate justification that global income poverty has steeply declined in the recent period. A new methodology of global poverty assessment, focused directly on what is needed to achieve elementary human requirements, is feasible and necessary. A practical approach to implementing an alternative is described.

economics
development
poverty
thomas-pogge
sanjay-reddy
7 days ago

[1901.11373] Learning and Evaluating General Linguistic Intelligence

8 days ago

We define general linguistic intelligence as the ability to reuse previously acquired knowledge about a language's lexicon, syntax, semantics, and pragmatic conventions to adapt to new tasks quickly. Using this definition, we analyze state-of-the-art natural language understanding models and conduct an extensive empirical investigation to evaluate them against these criteria through a series of experiments that assess the task-independence of the knowledge being acquired by the learning process. In addition to task performance, we propose a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task. Our results show that while the field has made impressive progress in terms of model architectures that generalize to many tasks, these models still require a lot of in-domain training examples (e.g., for fine tuning, training task-specific modules), and are prone to catastrophic forgetting. Moreover, we find that far from solving general tasks (e.g., document question answering), our models are overfitting to the quirks of particular datasets (e.g., SQuAD). We discuss missing components and conjecture on how to make progress toward general linguistic intelligence.

evaluation
nlp
nlu
8 days ago

[1811.03188] Solving Jigsaw Puzzles By the Graph Connection Laplacian

9 days ago

We propose a novel mathematical framework to address the problem of automatically solving large jigsaw puzzles. This problem assumes a large image which is cut into equal square pieces that are arbitrarily rotated and shifted and asks to recover the original image given the transformed pieces. The main contribution of this work is a theoretically-guaranteed method for recovering the unknown orientations of the puzzle pieces by using the graph connection Laplacian associated with the puzzle. Iterative application of this method and other methods for recovering the unknown shifts result in a solution for the large jigsaw puzzle problem. This solution is not greedy, unlike many other solutions. Numerical experiments demonstrate the competitive performance of the proposed method.

jigsaw
graph
laplacian
9 days ago

Homepage — Essentia 2.1-beta5-dev documentation

9 days ago

"Essentia is a open-source C++ library for audio analysis and audio-based music information retrieval. It contains an extensive collection of algorithms including audio input/output functionality, standard digital signal processing blocks, statistical characterization of data, and a large set of spectral, temporal, tonal and high-level music descriptors. [...] The library is also wrapped in Python and includes a number of command-line tools and third-party extensions, which facilitate its use for fast prototyping and allow setting up research experiments very rapidly."

python
libs
audio
dsp
music
mir
9 days ago

Understanding the bin, sbin, usr/bin , usr/sbin split

11 days ago

"You know how Ken Thompson and Dennis Ritchie created Unix on a PDP-7 in 1969? Well around 1971 they upgraded to a PDP-11 with a pair of RK05 disk packs (1.5 megabytes each) for storage. When the operating system grew too big to fit on the first RK05 disk pack (their root filesystem) they let it leak into the second one, which is where all the user home directories lived (which is why the mount was called /usr). They replicated all the OS directories under there (/bin, /sbin, /lib, /tmp...) and wrote files to those new directories because their original disk was out of space. When they got a third disk, they mounted it on /home and relocated all the user directories to there so the OS could consume all the space on both disks and grow to THREE WHOLE MEGABYTES (ooooh!)."

unix
filesystem
history
via:jm
11 days ago

Convert video to audio, Catch up on your video backlog — Listen Later

12 days ago

"Listen Later is a free service for converting videos into an audio podcast, which makes it easier to catch up on your video backlog during chores, errands and commutes. Listen Later works with services supported by youtube-dl."

podcasts
video
audio
12 days ago

Exploring random encoders for sentence classification - Facebook Code

14 days ago

"We set out to determine what was gained, if anything, by using current state-of-the-art methods rather than random methods that combine nothing but pretrained word embeddings. The power of random features has long been known in the machine learning community, so we applied it to this NLP task. We explored three methods: bag of random embedding projections, random LSTMs, and echo state networks. Our findings indicated that much of the lifting power in sentence embeddings comes from word representations. We found that random parameterizations over pretrained word embeddings constituted a very strong baseline and sometimes even matched the performance of well-known sentence encoders such as SkipThought and InferSent. These findings impose a strong baseline for research in representation learning for sentences going forward. We also made important observations about proper experimental protocol for sentence classification evaluation, together with recommendations for future research."

nlp
embedding
sentence
random-features
via:hustwj
14 days ago

The rise of the swear nerds | The Outline

15 days ago

“Fuckbonnet” is a swear-pyrrhic compound. The double-n in the middle and stop consonant at the end make it fun to say, but — and this is crucial — the insult itself does not say anything. What is a fuckbonnet, exactly? Is it something you wear when you get…? Is it a hat that has fallen out of fashion and is now only good for…? There’s no discernible meaning behind the word; it only expresses contempt and the author’s vain originality. I submit that this aspect of the new swears is a feature, not a bug. The reason this formula has become so popular in our time is that it conveys the author’s outrage without running the risk of actually insulting anybody.

The guide to the formula embedded above points to this aspect of the new swears, describing them as “non-gendered insults” that are better than problematic old standbys like “bitch.” Coming up with insults that do not invoke gender or race or disability is good. The point of an insult is to hurt the person so insulted, not to deride an entire class. For this reason, though, the insult must describe or otherwise connect to its target. The signature feature of the new swears is that they do not carry any target-specific content."

language
words
swear
insult
The guide to the formula embedded above points to this aspect of the new swears, describing them as “non-gendered insults” that are better than problematic old standbys like “bitch.” Coming up with insults that do not invoke gender or race or disability is good. The point of an insult is to hurt the person so insulted, not to deride an entire class. For this reason, though, the insult must describe or otherwise connect to its target. The signature feature of the new swears is that they do not carry any target-specific content."

15 days ago

Backreaction: Particle physicists surprised to find I am not their cheer-leader

16 days ago

"You see, the issue they have isn’t that I say particle physics has a problem. Because that’s obvious to everyone who ever had anything to do with the field. The issue is that I publicly say it."

physics
culture
criticism
16 days ago

Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet | OpenReview

16 days ago

"Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to understand how they reach their decisions. We here introduce a high-performance DNN architecture on ImageNet whose decisions are considerably easier to explain. Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet (87.6% top-5 for 32 x 32 px features and Alexnet performance for 16 x16 px features). The constraint on local features makes it straight-forward to analyse how exactly each part of the image influences the classification. Furthermore, the BagNets behave similar to state-of-the art deep neural networks such as VGG-16, ResNet-152 or DenseNet-169 in terms of feature sensitivity, error distribution and interactions between image parts. This suggests that the improvements of DNNs over previous bag-of-feature classifiers in the last few years is mostly achieved by better fine-tuning rather than by qualitatively different decision strategies."

deep-learning
convnet
computer-vision
bagnet
imagenet
16 days ago

Understanding Convolutional Neural Networks for Text Classification

19 days ago

We present an analysis into the inner workings of Convolutional Neural Networks (CNNs) for processing text. CNNs used for computer vision can be interpreted by projecting filters into image space, but for discrete sequence inputs CNNs remain a mystery. We aim to understand the method by which the networks process and classify text. We examine common hypotheses to this problem: that filters, accompanied by global max-pooling, serve as ngram detectors. We show that filters may capture several different semantic classes of ngrams by using different activation patterns, and that global max-pooling induces behavior which separates important ngrams from the rest. Finally, we show practical use cases derived from our findings in the form of model interpretability (explaining a trained model by deriving a concrete identity for each filter, bridging the gap between visualization tools in vision tasks and NLP) and prediction interpretability (explaining predictions)

nlp
convnet
19 days ago

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis - W18-5406

19 days ago

Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings.

neural-net
nlp
text
preprocessing
19 days ago

Gender Shades

24 days ago

"How well do IBM, Microsoft, and Face++ AI services guess the gender of a face?"

machine-learning
computer-vision
facial-recognition
gender
24 days ago

[1901.08162] Causal Reasoning from Meta-reinforcement Learning

25 days ago

Discovering and exploiting the causal structure in the environment is a crucial challenge for intelligent agents. Here we explore whether causal reasoning can emerge via meta-reinforcement learning. We train a recurrent network with model-free reinforcement learning to solve a range of problems that each contain causal structure. We find that the trained agent can perform causal reasoning in novel situations in order to obtain rewards. The agent can select informative interventions, draw causal inferences from observational data, and make counterfactual predictions. Although established formal causal reasoning algorithms also exist, in this paper we show that such reasoning can arise from model-free reinforcement learning, and suggest that causal reasoning in complex settings may benefit from the more end-to-end learning-based approaches presented here. This work also offers new strategies for structured exploration in reinforcement learning, by providing agents with the ability to perform -- and interpret -- experiments.

causality
meta-learning
reinforcement-learning
25 days ago

Enhancing human learning via spaced repetition optimization | PNAS

27 days ago

"Understanding human memory has been a long-standing problem in various scientific disciplines. Early works focused on characterizing human memory using small-scale controlled experiments and these empirical studies later motivated the design of spaced repetition algorithms for efficient memorization. However, current spaced repetition algorithms are rule-based heuristics with hard-coded parameters, which do not leverage the automated fine-grained monitoring and greater degree of control offered by modern online learning platforms. In this work, we develop a computational framework to derive optimal spaced repetition algorithms, specially designed to adapt to the learners’ performance. A large-scale natural experiment using data from a popular language-learning online platform provides empirical evidence that the spaced repetition algorithms derived using our framework are significantly superior to alternatives."

point-processes
control
spaced-repetition
duolingo
memory
27 days ago

JRMeyer/multi-task-kaldi: An example directory for running Multi-Task Learning training on Kaldi neural networks. In Kaldi-speak, this is an egs dir for nnet3 training.

29 days ago

The collection of scripts in this repository represent a template for training neural networks via Multi-Task Learning in Kaldi. This repo is heavily based on the existing Kaldi multilingual Babel example directory.

asr
kaldi
transfer-learning
multi-task
29 days ago

[1811.06031] A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks

29 days ago

Much effort has been devoted to evaluate whether multi-task learning can be leveraged to learn rich representations that can be used in various Natural Language Processing (NLP) down-stream applications. However, there is still a lack of understanding of the settings in which multi-task learning has a significant effect. In this work, we introduce a hierarchical model trained in a multi-task learning setup on a set of carefully selected semantic tasks. The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low level tasks at the bottom layers of the model and more complex tasks at the top layers of the model. This model achieves state-of-the-art results on a number of tasks, namely Named Entity Recognition, Entity Mention Detection and Relation Extraction without hand-engineered features or external NLP tools like syntactic parsers. The hierarchical training supervision induces a set of shared semantic representations at lower layers of the model. We show that as we move from the bottom to the top layers of the model, the hidden states of the layers tend to represent more complex semantic information.

neural-net
nlp
multi-task
29 days ago

[1705.08142] Latent Multi-task Architecture Learning

29 days ago

Multi-task learning (MTL) allows deep neural networks to learn from related tasks by sharing parameters with other networks. In practice, however, MTL involves searching an enormous space of possible parameter sharing architectures to find (a) the layers or subspaces that benefit from sharing, (b) the appropriate amount of sharing, and (c) the appropriate relative weights of the different task losses. Recent work has addressed each of the above problems in isolation. In this work we present an approach that learns a latent multi-task architecture that jointly addresses (a)--(c). We present experiments on synthetic data and data from OntoNotes 5.0, including four different tasks and seven different domains. Our extension consistently outperforms previous approaches to learning latent architectures for multi-task problems and achieves up to 15% average error reductions over common approaches to MTL.

neural-net
nlp
multi-task
29 days ago

[1711.02257] GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks

4 weeks ago

Deep multitask networks, in which one neural network produces multiple predictive outputs, can offer better speed and performance than their single-task counterparts but are challenging to train properly. We present a gradient normalization (GradNorm) algorithm that automatically balances training in deep multitask models by dynamically tuning gradient magnitudes. We show that for various network architectures, for both regression and classification tasks, and on both synthetic and real datasets, GradNorm improves accuracy and reduces overfitting across multiple tasks when compared to single-task networks, static baselines, and other adaptive multitask loss balancing techniques. GradNorm also matches or surpasses the performance of exhaustive grid search methods, despite only involving a single asymmetry hyperparameter α. Thus, what was once a tedious search process that incurred exponentially more compute for each task added can now be accomplished within a few training runs, irrespective of the number of tasks. Ultimately, we will demonstrate that gradient manipulation affords us great control over the training dynamics of multitask networks and may be one of the keys to unlocking the potential of multitask learning.

neural-net
normalization
gradient
multi-task
4 weeks ago

Should Companies Be Allowed to Issue Stock with Unequal Voting Rights?

4 weeks ago

"While media companies, such as The New York Times Co., Comcast, DISH Network, AMC holdings, Liberty Media, News Corporation, and Viacom have traditionally had dual-class shares — arguably to maintain news independence — a more important recent development is the widespread adoption of dual-class structure by technology companies. Almost 50% of recent technology listings have a dual-class status. We explored reasons for the growing use of the dual-class structure in an HBS case study among technology companies. Our nickel summary is that their growing popularity is due to the increasing importance of intangible investments, the rise of activist investors, and the decline of other protection mechanisms available to existing management such as staggered boards and poison pills. A dual-class structure, offering immunity against proxy contests initiated by short-term investors, could be optimal if it enables founder-managers to ignore pressures from the capital markets and avoid myopic actions such as cutting research and development and delaying corporate restructuring."

corporation
stock
multi-class
4 weeks ago

NEMISIG 2019 | Brought to you by Brooklyn College

4 weeks ago

"NEMISIG (North East Music Information Special Interest Group) is a yearly informal meeting for Music Information Retrieval researchers who work at the intersection of computer science, mathematics, and music."

workshops
nemisig
music
ir
4 weeks ago

You don't know JAX

4 weeks ago

"JAX is a Python library which augments numpy and Python code with function transformations which make it trivial to perform operations common in machine learning programs. Concretely, this makes it simple to write standard Python/numpy code and immediately be able to

Compute the derivative of a function via a successor to autograd

Just-in-time compile a function to run efficiently on an accelerator via XLA

Automagically vectorize a function, so that e.g. you can process a “batch” of data in parallel"

jax
Compute the derivative of a function via a successor to autograd

Just-in-time compile a function to run efficiently on an accelerator via XLA

Automagically vectorize a function, so that e.g. you can process a “batch” of data in parallel"

4 weeks ago

Martians Build Two Immense Canals In Two Years - The New York Times

5 weeks ago

"Vast Engineering Works Accomplished in an Incredibly Short Time by Our Planetary Neighbors -Wonders of the September Sky."

nyt
mars
history
5 weeks ago

Compiler Explorer

5 weeks ago

"Compiler Explorer is an interactive online compiler which shows the assembly output of compiled C++, Rust, Go (and many more) code."

webapps
compiler
c
c++
go
rust
5 weeks ago

enkimute/ganja.js: Geometric Algebra for Javascript (with operator overloading and algebraic literals)

5 weeks ago

"Ganja.js is a Geometric Algebra code generator for javascript. It generates Clifford algebras and sub-algebras of any signature and implements operator overloading and algebraic constants.

(Mathematically, an algebra generated by ganja.js is a graded exterior (Grassmann) algebra (or one of its subalgebras) with a non-metric outer product, extended (Clifford) with geometric and contraction inner products, a Poincare duality operator and the main involutions and morphisms.)

(Technically, ganja.js is a code generator producing classes that reificate algebraic literals and expressions by using reflection, a built-in tokenizer and a simple AST translator to rewrite functions containing algebraic constructs to their procedural counterparts.)

(Practically, ganja.js enables real math syntax inside javascript, with element, vector and matrix operations over reals, complex numbers, dual numbers, hyperbolic numbers, vectors, spacetime events, quaternions, dual quaternions, biquaternions or any other Clifford Algebra.)"

javascript
libs
geometric-algebra
(Mathematically, an algebra generated by ganja.js is a graded exterior (Grassmann) algebra (or one of its subalgebras) with a non-metric outer product, extended (Clifford) with geometric and contraction inner products, a Poincare duality operator and the main involutions and morphisms.)

(Technically, ganja.js is a code generator producing classes that reificate algebraic literals and expressions by using reflection, a built-in tokenizer and a simple AST translator to rewrite functions containing algebraic constructs to their procedural counterparts.)

(Practically, ganja.js enables real math syntax inside javascript, with element, vector and matrix operations over reals, complex numbers, dual numbers, hyperbolic numbers, vectors, spacetime events, quaternions, dual quaternions, biquaternions or any other Clifford Algebra.)"

5 weeks ago

[1502.05767] Automatic differentiation in machine learning: a survey

5 weeks ago

Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply "autodiff", is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other's results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names "dynamic computational graphs" and "differentiable programming". We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms "autodiff", "automatic differentiation", and "symbolic differentiation" as these are encountered more and more in machine learning settings.

surveys
automatic-differentiation
machine-learning
5 weeks ago

[1901.03403] Mean Estimation from One-Bit Measurements

5 weeks ago

We consider the problem of estimating the mean of a symmetric log-concave distribution under the following constraint: only a single bit per sample from this distribution is available to the estimator. We study the mean squared error (MSE) risk in this estimation as a function of the number of samples, and hence the number of bits, from this distribution. Under an adaptive setting in which each bit is a function of the current sample and the previously observed bits, we show that the optimal relative efficiency compared to the sample mean is the efficiency of the median. For example, in estimating the mean of a normal distribution, a constraint of one bit per sample incurs a penalty of π/2 in sample size compared to the unconstrained case. We also consider a distributed setting where each one-bit message is only a function of a single sample. We derive lower bounds on the MSE in this setting, and show that the optimal efficiency can only be attained at a finite number of points in the parameter space. Finally, we analyze a distributed setting where the bits are obtained by comparing each sample against a prescribed threshold. Consequently, we consider the threshold density that minimizes the maximal MSE. Our results indicate that estimating the mean from one-bit measurements is equivalent to estimating the sample median from these measurements. In the adaptive case, this estimate can be done with vanishing error for any point in the parameter space. In the distributed case, this estimate can be done with vanishing error only for a finite number of possible values for the unknown mean.

statistics
estimation
5 weeks ago

[1802.07810] Manipulating and Measuring Model Interpretability

5 weeks ago

Despite a growing literature on creating interpretable machine learning methods, there have been few experimental studies of their effects on end users. We present a series of large-scale, randomized, pre-registered experiments in which participants were shown functionally identical models that varied only in two factors thought to influence interpretability: the number of input features and the model transparency (clear or black-box). Participants who were shown a clear model with a small number of features were better able to simulate the model's predictions. However, contrary to what one might expect when manipulating interpretability, we found no significant difference in multiple measures of trust across conditions. Even more surprisingly, increased transparency hampered people's ability to detect when a model has made a sizeable mistake. These findings emphasize the importance of studying how models are presented to people and empirically verifying that interpretable models achieve their intended effects on end users.

machine-learning
interpretation
5 weeks ago

Designing neural networks through neuroevolution | Nature Machine Intelligence

5 weeks ago

Much of recent machine learning has focused on deep learning, in which neural network weights are trained through variants of stochastic gradient descent. An alternative approach comes from the field of neuroevolution, which harnesses evolutionary algorithms to optimize neural networks, inspired by the fact that natural brains themselves are the products of an evolutionary process. Neuroevolution enables important capabilities that are typically unavailable to gradient-based approaches, including learning neural network building blocks (for example activation functions), hyperparameters, architectures and even the algorithms for learning themselves. Neuroevolution also differs from deep learning (and deep reinforcement learning) by maintaining a population of solutions during search, enabling extreme exploration and massive parallelization. Finally, because neuroevolution research has (until recently) developed largely in isolation from gradient-based neural network research, it has developed many unique and effective techniques that should be effective in other machine learning areas too. This Review looks at several key aspects of modern neuroevolution, including large-scale computing, the benefits of novelty and diversity, the power of indirect encoding, and the field’s contributions to meta-learning and architecture search. Our hope is to inspire renewed interest in the field as it meets the potential of the increasing computation available today, to highlight how many of its ideas can provide an exciting resource for inspiration and hybridization to the deep learning, deep reinforcement learning and machine learning communities, and to explain how neuroevolution could prove to be a critical tool in the long-term pursuit of artificial general intelligence.

surveys
neural-net
meta-learning
architecture-search
evolutionary-algorithms
5 weeks ago

Size-Independent Sample Complexity of Neural Networks

5 weeks ago

We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest.

neural-net
complexity
5 weeks ago

[1812.08951] Analysis Methods in Neural Language Processing: A Survey

5 weeks ago

"The field of natural language processing has seen impressive progress in recent years, with neural network models replacing many of the traditional systems. A plethora of new models have been proposed, many of which are thought to be opaque compared to their feature-rich counterparts. This has led researchers to analyze, interpret, and evaluate neural networks in novel and more fine-grained ways. In this survey paper, we review analysis methods in neural language processing, categorize them according to prominent research trends, highlight existing limitations, and point to potential directions for future work."

surveys
nlp
analysis
5 weeks ago

Classifying Prediction Errors

5 weeks ago

"Understanding prediction errors and determining how to fix them is critical to building effective predictive systems. In this paper, we delineate four types of prediction errors (mislabeling, representation, learner and boundary errors) and demonstrate that these four types characterize all prediction errors. In addition, we describe potential remedies and tools that can be used to reduce the uncertainty when trying to determine the source of a prediction error and when trying to take action to remove a prediction error."

machine-learning
error
interactive-learning
5 weeks ago

[1804.02476v2] Associative Compression Networks for Representation Learning

6 weeks ago

This paper introduces Associative Compression Networks (ACNs), a new framework for variational autoencoding with neural networks. The system differs from existing variational autoencoders (VAEs) in that the prior distribution used to model each code is conditioned on a similar code from the dataset. In compression terms this equates to sequentially transmitting the dataset using an ordering determined by proximity in latent space. Since the prior need only account for local, rather than global variations in the latent space, the coding cost is greatly reduced, leading to rich, informative codes. Crucially, the codes remain informative when powerful, autoregressive decoders are used, which we argue is fundamentally difficult with normal VAEs. Experimental results on MNIST, CIFAR-10, ImageNet and CelebA show that ACNs discover high-level latent features such as object class, writing style, pose and facial expression, which can be used to cluster and classify the data, as well as to generate diverse and convincing samples. We conclude that ACNs are a promising new direction for representation learning: one that steps away from IID modelling, and towards learning a structured description of the dataset as a whole.

representation-learning
acn
vae
data-ordering
6 weeks ago

The Great Suspender - Chrome Web Store

6 weeks ago

"A lightweight extension to reduce chrome's memory footprint. Perfect if you have a lot of tabs open at the same time. Tabs that have not been viewed after a configurable length of time will be automagically suspended in the background, freeing up the memory and CPU being consumed by that tab."

chrome
extensions
tab
memory
6 weeks ago

Sure thing. A few years ago, everyone switched their deep nets to "residual net... | Hacker News

9 weeks ago

resnet: "The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once.

In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) [...]

We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.

Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time. "

neural-net
ode
resnet
In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) [...]

We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.

Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time. "

9 weeks ago

[1706.04902] A Survey Of Cross-lingual Word Embedding Models

9 weeks ago

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.

surveys
nlp
word-embedding
cross-embedding
via:hustwj
9 weeks ago

[1812.03253] Counterfactuals uncover the modular structure of deep generative models

9 weeks ago

Deep generative models such as Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) are important tools to capture and investigate the properties of complex empirical data. However, the complexity of their inner elements makes their functioning challenging to assess and modify. In this respect, these architectures behave as black box models. In order to better understand the function of such networks, we analyze their modularity based on the counterfactual manipulation of their internal variables. Experiments with face images support that modularity between groups of channels is achieved to some degree within convolutional layers of vanilla VAE and GAN generators. This helps understand the functional organization of these systems and allows designing meaningful transformations of the generated images without further training.

neural-net
gan
vae
analysis
interpretation
counterfactual
9 weeks ago

Adversarial Robustness - Theory and Practice

10 weeks ago

"This web page contains materials to accompany the NeurIPS 2018 tutorial, “Adversarial Robustness: Theory and Practice”, by Zico Kolter and Aleksander Madry. The notes are in very early draft form, and we will be updating them (organizing material more, writing them in a more consistent form with the relevant citations, etc) for an official release in early 2019. Until then, however, we hope they are still a useful reference that can be used to explore some of the key ideas and methodology behind adversarial robustness, from standpoints of both generating adversarial attacks on classifiers and training classifiers that are inherently robust."

adversarial-examples
10 weeks ago

Which US cities have good and bad public transportation - Vox

10 weeks ago

"Christof Spieler, a structural engineer and urban planner from Houston, has lots of opinions about public transit in America and elsewhere. In his new book, Trains, Buses, People: An Opinionated Atlas of US Transit, he maps out 47 metro areas that have rail transit or bus rapid transit, ranks the best and worst systems, and offers advice on how to build better networks."

cities
transportation
books
10 weeks ago

Compact Representation of Uncertainty in Clustering

10 weeks ago

For many classic structured prediction problems, probability distributions over the dependent variables can be efficiently computed using widely-known algorithms and data structures (such as forward-backward, and its corresponding trellis for exact probability distributions in Markov models). However, we know of no previous work studying efficient representations of exact distributions over clusterings. This paper presents definitions and proofs for a dynamic-programming inference procedure that computes the partition function, the marginal probability of a cluster, and the MAP clustering---all exactly. Rather than the Nth Bell number, these exact solutions take time and space proportional to the substantially smaller powerset of N. Indeed, we improve upon the time complexity of the algorithm introduced by Kohonen and Corander (2016) for this problem by a factor of N. While still large, this previously unknown result is intellectually interesting in its own right, makes feasible exact inference for important real-world small data applications (such as medicine), and provides a natural stepping stone towards sparse-trellis approximations that enable further scalability (which we also explore). In experiments, we demonstrate the superiority of our approach over approximate methods in analyzing real-world gene expression data used in cancer treatment.

clustering
uncertainty
10 weeks ago

[1805.07820] Targeted Adversarial Examples for Black Box Audio Systems

10 weeks ago

The application of deep recurrent networks to audio transcription has led to impressive gains in automatic speech recognition (ASR) systems. Many have demonstrated that small adversarial perturbations can fool deep neural networks into incorrectly predicting a specified target with high confidence. Current work on fooling ASR systems have focused on white-box attacks, in which the model architecture and parameters are known. In this paper, we adopt a black-box approach to adversarial generation, combining the approaches of both genetic algorithms and gradient estimation to solve the task. We achieve a 89.25% targeted attack similarity after 3000 generations while maintaining 94.6% audio file similarity.

adversarial-examples
audio
black-box
10 weeks ago

[1803.01814] Norm matters: efficient and accurate normalization schemes in deep networks

10 weeks ago

Over the past few years batch-normalization has been commonly used in deep networks, allowing faster training and high performance for a wide variety of applications. However, the reasons behind its merits remained unanswered, with several shortcomings that hindered its use for certain tasks. In this work we present a novel view on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective. We also improve the use of weight-normalization and show the connection between practices such as normalization, weight decay and learning-rate adjustments. Finally, we suggest several alternatives to the widely used L2 batch-norm, using normalization in L1 and L∞ spaces that can substantially improve numerical stability in low-precision implementations as well as provide computational and memory benefits. We demonstrate that such methods enable the first batch-norm alternative to work for half-precision implementations.

neural-net
normalization
10 weeks ago

[1806.10909] ResNet with one-neuron hidden layers is a Universal Approximator

10 weeks ago

We demonstrate that a very deep ResNet with stacked modules with one neuron per hidden layer and ReLU activation functions can uniformly approximate any Lebesgue integrable function in d dimensions, i.e. ℓ1(ℝd). Because of the identity mapping inherent to ResNets, our network has alternating layers of dimension one and d. This stands in sharp contrast to fully connected networks, which are not universal approximators if their width is the input dimension d [Lu et al, 2017; Hanin and Sellke, 2017]. Hence, our result implies an increase in representational power for narrow deep networks by the ResNet architecture.

resnet
neural-net
universal-approximator
10 weeks ago

Modern Neural Networks Generalize on Small Data Sets

10 weeks ago

In this paper, we use a linear program to empirically decompose fitted neural networks into ensembles of low-bias sub-networks. We show that these sub-networks are relatively uncorrelated which leads to an internal regularization process, very much like a random forest, which can explain why a neural network is surprisingly resistant to overfitting. We then demonstrate this in practice by applying large neural networks, with hundreds of parameters per training observation, to a collection of 116 real-world data sets from the UCI Machine Learning Repository. This collection of data sets contains a much smaller number of training examples than the types of image classification tasks generally studied in the deep learning literature, as well as non-trivial label noise. We show that even in this setting deep neural nets are capable of achieving superior classification accuracy without overfitting.

neural-net
generalization
small-data
richard-berk
10 weeks ago

[1808.01204] Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

10 weeks ago

Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels. Furthermore, the analysis provides interesting insights into several aspects of learning neural networks and can be verified based on empirical studies on synthetic data and on the MNIST dataset.

neural-net
sgd
generalization
10 weeks ago

active-learning
advice
ai
ajax
algorithms
amazon
analysis
architecture
argumentation
art
asp.net
asr
audio
bayesian
bioinformatics
biology
blogs
book
books
browser
business
c
c++
classification
cli
clustering
code
color
comparison
compsci
computer-vision
concurrency
convnet
courses
critique
css
culture
d3
data
data-analysis
data-mining
database
datasets
debugging
deep-learning
design
dip
distcomp
django
dsp
dtw
economics
education
email
erlang
evolution
extension
facebook
finance
firefox
food
free
functional
funny
gan
genetics
geo
geometry
git
google
graph
graphical-models
graphics
gui
haskell
history
html
http
humor
image
information-theory
internet
ir
java
javascript
journalism
jquery
knn
language
latex
library
libs
links
linux
logic
mac
machine-learning
mapping
maps
markets
math
matlab
matplotlib
matrix
memory
mobile
model-selection
music
net
networks
neural-net
nlp
notes
numeric
numpy
nyc
opensource
optimization
papers
parallel
pdf
people
performance
philosophy
photos
physics
pkg
playlist
plc
plotting
plugins
politics
postgresql
privacy
probability
productivity
proglang
programming
psychology
python
r
read
rec
recipes
ref
reference
regression
regularization
reinforcement-learning
research
rest
reviews
rnn
ruby
scalability
scaling
scicomp
science
scifi
search
security
sgd
similarity
slides
social-software
software
speech
sql
startup
statcomp
statistics
stats
submodularity
surveys
swdev
talks
teaching
tech
tensorflow
testing
text
thesis
time-series
tips
tutorial
tutorials
twitter
ui
unix
utils
via:arthegall
via:chl
via:cshalizi
video
videos
vim
visualization
web
webapp
webapps
webdev
windows
writing