learning-from-data   64

« earlier    

[1111.3304] Eigenvector Synchronization, Graph Rigidity and the Molecule Problem
"The graph realization problem has received a great deal of attention in recent years, due to its importance in applications such as wireless sensor networks and structural biology.…"
algorithms  statistics  structure  learning-from-data  nudge-targets 
11 weeks ago by Vaguery
[1201.5568] Dynamic trees for streaming and massive data contexts
"Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, from vast offline databases to streaming real-time information. Learning algorithms deployed in such contexts must rely on single-pass inference, where the data history is never revisited. In streaming contexts, learning must also be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Although rapidly growing, the online Bayesian inference literature remains challenged by massive data and transient, evolving data streams. Non-parametric modelling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting standard streaming techniques, like data discarding and downweighting, into a fully Bayesian framework via the use of informative priors and active learning heuristics. We showcase our methods by augmenting a modern non-parametric modelling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favourably to the state-of-the-art."
data-analysis  learning-from-data  algorithms  drinking-from-the-firehose  nudge  data-mining 
january 2012 by Vaguery
[1109.2618] Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning
We introduce a machine learning model to predict atomization energies of a diverse set of organic molecules, based on nuclear charges and atomic positions only. The problem of solving the molecular Schr"odinger equation is mapped onto a non-linear statistical regression problem of reduced complexity. Regression models are trained on and compared to atomization energies computed with hybrid density-functional theory. Cross-validation over more than seven thousand small organic molecules yields a mean absolute error of ~10 kcal/mol. Applicability is demonstrated for the prediction of molecular atomization potential energy curves.
machine-learning  learning-from-data  biochemistry  computational-science  nudge-targets 
january 2012 by Vaguery
[1109.3248] Reconstruction of sequential data with density models
We introduce the problem of reconstructing a sequence of multidimensional real vectors where some of the data are missing. This problem contains regression and mapping inversion as particular cases where the pattern of missing data is independent of the sequence index. The problem is hard because it involves possibly multivalued mappings at each vector in the sequence, where the missing variables can take more than one value given the present variables; and the set of missing variables can vary from one vector to the next. To solve this problem, we propose an algorithm based on two redundancy assumptions: vector redundancy (the data live in a low-dimensional manifold), so that the present variables constrain the missing ones; and sequence redundancy (e.g. continuity), so that consecutive vectors constrain each other. We capture the low-dimensional nature of the data in a probabilistic way with a joint density model, here the generative topographic mapping, which results in a Gaussian mixture. Candidate reconstructions at each vector are obtained as all the modes of the conditional distribution of missing variables given present variables. The reconstructed sequence is obtained by minimising a global constraint, here the sequence length, by dynamic programming. We present experimental results for a toy problem and for inverse kinematics of a robot arm.
inverse-problems  statistics  algorithms  learning-from-data  nudge-targets 
january 2012 by Vaguery
[1112.5794] BATMAN-an R package for the automated quantification of metabolites from NMR spectra using a Bayesian Model
Motivation: NMR spectra are widely used in metabolomics to obtain metabolite profiles in complex biological mixtures. Common methods used to assign and estimate concentrations of metabolite involve either an expert manual peak fitting or extra pre-processing steps, such as peak alignment and binning. Peak fitting is very time consuming and is subject to human error. Conversely, alignment and binning can introduce artifacts and limit immediate biological interpretation of models. Results: We present the Bayesian AuTomated Metabolite Analyser for NMR spectra (BATMAN), an R package which deconvolves peaks from 1-dimensional NMR spectra, automatically assigns them to specific metabolites and obtains concentration estimates. The Bayesian model incorporates information on characteristic peak patterns of metabolites and is able to account for shifts in the position of peaks commonly seen in NMR spectra of biological samples. It applies a Markov Chain Monte Carlo (MCMC) algorithm to sample from a joint posterior distribution of the model parameters and obtains concentration estimates with reduced mean estimation error compared with conventional numerical integration methods.
learning-from-data  statistics  modeling  biochemistry  nudge-targets  image-segmentation 
january 2012 by Vaguery
[1105.2584] Workload Classification & Software Energy Measurement for Efficient Scheduling on Private Cloud Platforms
"At present there are a number of barriers to creating an energy efficient workload scheduler for a Private Cloud based data center. Firstly, the relationship between different workloads and power consumption must be investigated. Secondly, current hardware-based solutions to providing energy usage statistics are unsuitable in warehouse scale data centers where low cost and scalability are desirable properties. In this paper we discuss the effect of different workloads on server power consumption in a Private Cloud platform. We display a noticeable difference in energy consumption when servers are given tasks that dominate various resources (CPU, Memory, Hard Disk and Network). We then use this insight to develop CloudMonitor, a software utility that is capable of >95% accurate power predictions from monitoring resource consumption of workloads, after a "training phase" in which a dynamic power model is developed."
operations-research  cloud-computing  system-administration  learning-from-data  nudge-targets 
october 2011 by Vaguery
[1107.0674] "Memory foam" approach to unsupervised learning
"We propose an alternative approach to construct an artificial learning system, which naturally learns in an unsupervised manner. Its mathematical prototype is a dynamical system, which automatically shapes its vector field in response to the input signal. The vector field converges to a gradient of a multi-dimensional probability density distribution of the input process, taken with negative sign. The most probable patterns are represented by the stable fixed points, whose basins of attraction are formed automatically. The performance of this system is illustrated with musical signals."
machine-learning  classification  learning-from-data  algorithms  nudge-targets 
august 2011 by Vaguery
[1107.0550] 3D Terrestrial LiDAR data classification of complex natural scenes using a multi-scale dimensionality criterion: applications in geomorphology
"3D point clouds of natural environments relevant to geomorphology problems (rivers, cliffs...) often require to classify the data into elementary relevant classes. A typical example is the separation of riparian vegetation from soil in fluvial environments, the distinction between fresh surfaces and rockfall in cliff environments, or more generally the classification of surfaces according to their morphology (ripples, grain size...). Natural surfaces are very heterogeneous and their distinctive properties are seldom defined at a unique scale. We have thus defined a multi-scale measure of the point cloud dimensionality around each point. The dimensionality characterizes the local 3D organization of the point cloud and varies from being 1D (points set along a line) to really taking all 3D volume, at each scale. We present the technique and illustrate its efficiency in separating riparian vegetation from ground and classifying a mountain stream in vegetation, rock, gravel and water surface. The superiority of the multi-scale analysis in enhancing class separability and spatial resolution of the classification is also demonstrated. Large scenes can be classified on a commodity laptop in a reasonable time. The technique is robust to missing data and especially shadow zones. The classification is fast and accurate and can account for some degree of intra-class morphological variability such as different vegetation types. A probabilistic confidence in the classification result is given at each point allowing the user to remove the points for which the classification is uncertain. The process can be both fully automated but also fully customized by the user including a graphical definition of the classifiers if so desired. Although developed for fully 3D data, the method can be readily applied to 2.5D airborne LiDAR data."
image-analysis  image-segmentation  learning-from-data  classification  nudge-targets 
august 2011 by Vaguery
Language Log » Straw men and Bee Science
"Let me start by saying that there's a way to take all this that makes it entirely correct. The key motive of science is explanation, and it's often essential to abstract away from the complexities of raw observation, and so on. I took courses from Chomsky as an undergraduate and a graduate student, and I'm grateful for what I learned from him, and for the eminently fair way that he always treated me. But increasingly, it seems to me, he has been elevating his personal distaste for the complexities of the real world into a systematic philosophy. To the extent that others accept these views, it excludes them from participation in (what I think are) the most promising and exciting current directions in the sciences of speech and language."
Noam-Chomsky  theory-and-practice-sitting-in-a-tree  bias  science  learning-from-data 
june 2011 by Vaguery
Falkenblog: High Frequency Trading Paper
"The point is that in fast moving markets, one needs something a little better than simple historical moving averages of daily closing prices. This is better, and extending the idea of 'volume time' vs. 'chronological time' is an intriguing direction. But one can also look at bid-ask spreads directly, or the VIX futures, or its etf, the VXX, and combinations, to gauge intraday volatility as well. Further, one can better estimate 'buy volume' using the transaction price relative to the then extant bid-ask spread, rather than if the price was weakly increasing, though this then involves syncing the trade information with quote information, and for academics such data are often hard to come by (further, quote information is often 10 times as large)."
learning-from-data  financial-engineering  trading  analytics  nudge-targets 
june 2011 by Vaguery
The distribution of interestingness | (R news & tutorials)
"The longer – and far less satisfying – answer to the question of how interestingness measures should be distributed is, “it depends,” as the following discussion illustrates."
statistics  interestingness  design-of-measures  statisticians-don't-do-Pragmatism-well  learning-from-data 
may 2011 by Vaguery
Evolved Analytics' DataModeler | Evolved Analytics
The technology has been developed to withstand the challenges of real world — in addition to handling problems of too much data, too little data, correlated data, or noisy data, DataModeler respects the cost and timeliness issues associated with modeling development.
evolutionary-algorithms  genetic-programming  learning-from-data  Mathematica 
may 2011 by Vaguery
[1008.1663] Learning Residual Finite-State Automata Using Observation Tables
"We define a two-step learner for RFSAs based on an observation table by using an algorithm for minimal DFAs to build a table for the reversal of the language in question and showing that we can derive the minimal RFSA from it after some simple modifications. We compare the algorithm to two other table-based ones of which one (by Bollig et al. 2009) infers a RFSA directly, and the other is another two-step learner proposed by the author. We focus on the criterion of query complexity."
finite-state-machine  machine-learning  algorithms  nudge-targets  learning-from-data  inference 
august 2010 by Vaguery
[1003.0470] Unsupervised Supervised Learning II: Training Margin Based Classifiers without Labels
"On a more philosophical level, our approach points at novel questions that go beyond supervised and semi-supervised learning. What benefit do labels provide over unsupervised training? Can our framework be extended to semi-supervised learning where a few labels do exist? Can it be extended to non-classification scenarios such as margin based regression or margin based structured prediction? When are the assumptions likely to hold and how can we make our framework even more resistant to deviations from them? These questions and others form new and exciting open research directions."
unsupervised-learning  supervised-learning  learning-from-data  machine-learning  regression  modeling 
august 2010 by Vaguery
[0912.4473] Learning to Predict Combinatorial Structures
"The major challenge in designing a discriminative learning algorithm for predicting structured data is to address the computational issues arising from the exponential size of the output space. Existing algorithms make different assumptions to ensure efficient, polynomial time estimation of model parameters. For several combinatorial structures, including cycles, partially ordered sets, permutations and other graph classes, these assumptions do not hold. In this thesis, we address the problem of designing learning algorithms for predicting combinatorial structures by introducing two new assumptions: (i) The first assumption is that a particular counting problem can be solved efficiently. The consequence is a generalisation of the classical ridge regression for structured prediction. (ii) The second assumption is that a particular sampling problem can be solved efficiently. …"
machine-learning  prediction  combinatorics  nudge-targets  learning-from-data 
june 2010 by Vaguery
[1006.4354] Empirical Modeling of Radiative versus Magnetic Flux for the Sun-as-a-Star
"…We find that a well-defined temporal component exists and accounts for some of the variance in the data. This temporal component arises because active regions with high magnetic field strength evolve, breaking up into small-scale magnetic elements with low field strength, and radiative and magnetic fluxes are sensitive to different active-region components. We generate empirical models that relate radiative flux to magnetic flux, allowing us to predict spectral-irradiance variations from observations of disk-averaged magnetic-flux density. In most cases, the model reconstructions can account for 85-90% of the variability of the radiative flux from the chromosphere and corona. Our results are important for understanding the relationship between magnetic and radiative measures of solar and stellar variability."
astronomy  astrophysics  modeling  learning-from-data  statistics  nudge-targets 
june 2010 by Vaguery
The Berkeley Segmentation Dataset and Benchmark
"The goal of this work is to provide an empirical basis for research on image segmentation and boundary detection. To this end, we have collected 12,000 hand-labeled segmentations of 1,000 Corel dataset images from 30 human subjects. Half of the segmentations were obtained from presenting the subject with a color image; the other half from presenting a grayscale image. The public benchmark based on this data consists of all of the grayscale and color segmentations for 300 images. The images are divided into a training set of 200 images, and a test set of 100 images."
dataset  learning-from-data  training-set  machine-learning  image-segmentation  image-processing  nudge 
june 2010 by Vaguery
A Peek Into the Future: HFT and Financial News -- Seeking Alpha
"A still more realistic and subtle, but much more troublesome scenario: Financial Undetectable Journalistic Engineering (FUJE). Financial news journalists could word the reports differently and send very different signals to the robot army. Here're two actual news headlines re. the May NFP number (incidentally, both are from the same outlet, same day, different reporter -- just a random google search):

US adds 431,000 jobs in May, unemployment down to 9.7 pct
vs.

Despite Adding 431K Jobs, May Non-Farm Payroll Figures Disappoint
The first is factual; the second contains more in-depth analysis. It takes an experienced human to parse and reconcile the two. You can see how robot readers may assign opposite signs to each."
data-mining  high-frequency-trading  trading  news  learning-from-data  boy-am-I-glad-we-folded-the-startup 
june 2010 by Vaguery
[1006.1346] C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework
"Sparse modeling is a powerful framework for data analysis and processing. Traditionally, encoding in this framework is performed by solving an L1-regularized linear regression problem, commonly referred to as Lasso or Basis Pursuit. In this work we combine the sparsity-inducing property of the Lasso model at the individual feature level, with the block-sparsity property of the Group Lasso model, where sparse groups of features are jointly encoded, obtaining a sparsity pattern hierarchically structured. This results in the Hierarchical Lasso (HiLasso), which shows important practical modeling advantages.…"
numerical-methods  statistics  learning-from-data  machine-learning  image-processing  image-segmentation  nudge-targets 
june 2010 by Vaguery
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful.…Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
clustering  algorithms  statistics  models  classification  learning-from-data 
june 2010 by Vaguery

« earlier    

Copy this bookmark:



description:


tags: