**cshalizi + to_teach:statcomp + optimization**
12

Optimization Models | Cambridge University Press

october 2014 by cshalizi

"Emphasizing practical understanding over the technicalities of specific algorithms, this elegant textbook is an accessible introduction to the field of optimization, focusing on powerful and reliable convex optimization techniques. Students and practitioners will learn how to recognize, simplify, model and solve optimization problems - and apply these principles to their own projects. A clear and self-contained introduction to linear algebra demonstrates core mathematical concepts in a way that is easy to follow, and helps students to understand their practical relevance. Requiring only a basic understanding of geometry, calculus, probability and statistics, and striking a careful balance between accessibility and rigor, it enables students to quickly understand the material, without being overwhelmed by complex mathematics. Accompanied by numerous end-of-chapter problems, an online solutions manual for instructors, and relevant examples from diverse fields including engineering, data science, economics, finance, and management, this is the perfect introduction to optimization for undergraduate and graduate students."

in_NB
optimization
convexity
books:noted
to_teach:statcomp
to_teach:freshman_seminar_on_optimization
october 2014 by cshalizi

CRAN - Package alabama

december 2013 by cshalizi

"Augmented Lagrangian Adaptive Barrier Minimization Algorithm for optimizing smooth nonlinear objective functions with constraints. Linear or nonlinear equality and inequality constraints are allowed."

- Well, that solved my problem.

optimization
R
to_teach:statcomp
- Well, that solved my problem.

december 2013 by cshalizi

[1310.2059] Distributed Coordinate Descent Method for Learning with Big Data

october 2013 by cshalizi

"In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix."

to:NB
optimization
high-dimensional_statistics
computational_statistics
statistics
lasso
to_teach:statcomp
october 2013 by cshalizi

Convex Optimization in R

july 2013 by cshalizi

"Convex optimization now plays an essential role in many facets of statistics. We briefly survey some recent developments and describe some implementations of some methods in R."

- Really, there's that little support for semi-definite programming?

to:NB
optimization
convexity
statistics
to_teach:statcomp
have_read
re:small-area_estimation_by_smoothing
- Really, there's that little support for semi-definite programming?

july 2013 by cshalizi

[1306.3574] Early stopping and non-parametric regression: An optimal data-dependent stopping rule

june 2013 by cshalizi

"The strategy of early stopping is a regularization technique based on choosing a stopping time for an iterative algorithm. Focusing on non-parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient-descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(P)$ and $L^2(P_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator."

in_NB
optimization
kernel_estimators
hilbert_space
nonparametrics
regression
minimax
yu.bin
wainwright.martin_j.
to_teach:statcomp
have_read
june 2013 by cshalizi

[1306.2119] Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

june 2013 by cshalizi

"We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which includes machine learning methods based on the minimization of the empirical risk. We focus on problems without strong convexity, for which all previously known algorithms achieve a convergence rate for function values of O(1/n^{1/2}). We consider and analyze two algorithms that achieve a rate of O(1/n) for classical supervised learning problems. For least-squares regression, we show that averaged stochastic gradient descent with constant step-size achieves the desired rate. For logistic regression, this is achieved by a simple novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running time complexity as stochastic gradient descent. For these algorithms, we provide a non-asymptotic analysis of the generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments on standard machine learning benchmarks showing that they often outperform existing approaches."

in_NB
optimization
learning_theory
statistics
estimation
stochastic_approximation
to_teach:statcomp
june 2013 by cshalizi

[1306.1840] Loss-Proportional Subsampling for Subsequent ERM

june 2013 by cshalizi

"We propose a sampling scheme suitable for reducing a data set prior to selecting a hypothesis with minimum empirical risk. The sampling only considers a subset of the ultimate (unknown) hypothesis set, but can nonetheless guarantee that the final excess risk will compare favorably with utilizing the entire original data set. We demonstrate the practical benefits of our approach on a large dataset which we subsample and subsequently fit with boosted trees."

- To_teach is speculative. The trick is to pick some easy-to-compute hypothesis which can be applied to the whole data set, preferentially sample the points with high loss under this pilot model, and then do importance weighting. The excess risk they get for a sub-sample of size m from n observations is O(n^{-0.5}) + O(m^{-0.75}), as opposed to O(m^{-0.5}) for just naively drawing a sub-sample. I don't think they ever compare this to e.g. stochastic gradient descent.

in_NB
estimation
optimization
computational_statistics
statistics
learning_theory
to_teach:statcomp
have_read
- To_teach is speculative. The trick is to pick some easy-to-compute hypothesis which can be applied to the whole data set, preferentially sample the points with high loss under this pilot model, and then do importance weighting. The excess risk they get for a sub-sample of size m from n observations is O(n^{-0.5}) + O(m^{-0.75}), as opposed to O(m^{-0.5}) for just naively drawing a sub-sample. I don't think they ever compare this to e.g. stochastic gradient descent.

june 2013 by cshalizi

[1212.4174] Feature Clustering for Accelerating Parallel Coordinate Descent

december 2012 by cshalizi

"Large-scale L1-regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for L1-regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental re- sults using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale L1-regularization problems."

to:NB
optimization
lasso
regression
machine_learning
to_teach:statcomp
december 2012 by cshalizi

On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization

october 2012 by cshalizi

"It is shown that the steepest descent and Newton’s method for unconstrained nonconvex optimization under standard assumptions may be both require a number of iterations and function evaluations arbitrarily close to O(epsilon^−2) to drive the norm of the gradient below epsilon. This shows that the upper bound of O(epsilon^−2) evaluations known for the steepest descent is tight, and that Newton’s method may be as slow as steepest descent in the worst case. The improved evaluation complexity bound of O(epsilon^−3/2) evaluations known for cubically-regularised Newton methods is also shown to be tight."

- Not really "to teach", but background...

in_NB
optimization
to_teach:statcomp
- Not really "to teach", but background...

october 2012 by cshalizi

[1206.4602] Quasi-Newton Methods: A New Direction

june 2012 by cshalizi

"Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors."

--- Final conference version: http://jmlr.csail.mit.edu/papers/v14/hennig13a.html

in_NB
optimization
machine_learning
learning_theory
to_teach:statcomp
--- Final conference version: http://jmlr.csail.mit.edu/papers/v14/hennig13a.html

june 2012 by cshalizi

Introduction to Online Optimization (Bubeck)

december 2011 by cshalizi

"to_teach" tag a sudden brainstorm for how to make next year's statistical computing class either unbeatably awesome or an absolute disaster

in_NB
online_learning
regression
individual_sequence_prediction
optimization
machine_learning
learning_theory
via:mraginsky
to_read
to_teach:statcomp
re:freshman_seminar_on_optimization
december 2011 by cshalizi

**related tags**

Copy this bookmark: