cshalizi + learning_theory 155
Multiple dynamic representations in the motor cortex during sensorimotor learning : Nature : Nature Publishing Group
4 weeks ago by cshalizi
"The mechanisms linking sensation and action during learning are poorly understood. Layer 2/3 neurons in the motor cortex might participate in sensorimotor integration and learning; they receive input from sensory cortex and excite deep layer neurons, which control movement. Here we imaged activity in the same set of layer 2/3 neurons in the motor cortex over weeks, while mice learned to detect objects with their whiskers and report detection with licking. Spatially intermingled neurons represented sensory (touch) and motor behaviours (whisker movements and licking). With learning, the population-level representation of task-related licking strengthened. In trained mice, population-level representations were redundant and stable, despite dynamism of single-neuron representations. The activity of a subpopulation of neurons was consistent with touch driving licking behaviour. Our results suggest that ensembles of motor cortex neurons couple sensory input to multiple, related motor programs during learning."
to:NB
learning_theory
neuroscience
functional_connectivity
experimental_biology
4 weeks ago by cshalizi
Ockham's Razor: Foundations - Carnegie Mellon Center for Formal Epistemology
5 weeks ago by cshalizi
Despite my presence on the program, this should actually be really good.
"Scientific theory choice is guided by judgments of simplicity, a bias frequently referred to as "Ockham's Razor". But what is simplicity and how, if at all, does it help science find the truth? Should we view simple theories as means for obtaining accurate predictions, as classical statisticians recommend? Or should we believe the theories themselves, as Bayesian methods seem to justify? The aim of this workshop is to re-examine the foundations of Ockham's razor, with a firm focus on the connections, if any, between simplicity and truth. "
self-promotion
occams_razor
philosophy_of_science
epistemology
kelly.kevin_t.
kith_and_kin
mayo.deborah
vapnik.v.n.
sober.elliott
leeb.hannes
wasserman.larry
model_selection
statistics
complexity
machine_learning
learning_theory
grunwald.peter
"Scientific theory choice is guided by judgments of simplicity, a bias frequently referred to as "Ockham's Razor". But what is simplicity and how, if at all, does it help science find the truth? Should we view simple theories as means for obtaining accurate predictions, as classical statisticians recommend? Or should we believe the theories themselves, as Bayesian methods seem to justify? The aim of this workshop is to re-examine the foundations of Ockham's razor, with a firm focus on the connections, if any, between simplicity and truth. "
5 weeks ago by cshalizi
[math/0612776] Uniform error bounds for smoothing splines
6 weeks ago by cshalizi
"Almost sure bounds are established on the uniform error of smoothing spline estimators in nonparametric regression with random designs. Some results of Einmahl and Mason (2005) are used to derive uniform error bounds for the approximation of the spline smoother by an ``equivalent'' reproducing kernel regression estimator, as well as for proving uniform error bounds on the reproducing kernel regression estimator itself, uniformly in the smoothing parameter over a wide range. This admits data-driven choices of the smoothing parameter."
to:NB
splines
regression
nonparametrics
statistics
learning_theory
6 weeks ago by cshalizi
[math/0508275] Local Rademacher complexities
6 weeks ago by cshalizi
"We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular."
to:NB
learning_theory
rademacher_complexity
6 weeks ago by cshalizi
[1204.0147] Covering Numbers for Convex Functions
7 weeks ago by cshalizi
"In this paper we study the covering numbers of the space of convex and uniformly bounded functions in multi-dimension. We find optimal upper and lower bounds for the $epsilon$-covering number of $C([a, b]^d, B)$, in the $L_p$-metric, $1 le p < infty$, in terms of the relevant constants, where $d geq 1$, $a < b in mathbb{R}$, $B>0$, and $C([a,b]^d, B)$ denotes the set of all convex functions on $[a, b]^d$ that are uniformly bounded by $B$. We summarize previously known results on covering numbers for convex functions and also provide alternate proofs of some known results. Our results have direct implications in the study of rates of convergence of empirical minimization procedures as well as optimal convergence rates in the numerous convexity constrained function estimation problems."
to:NB
learning_theory
7 weeks ago by cshalizi
Structured Sparsity and Generalization
7 weeks ago by cshalizi
"We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels."
to:NB
learning_theory
regression
sparsity
statistics
lasso
7 weeks ago by cshalizi
Online Learning and Online Convex Optimization
8 weeks ago by cshalizi
"Online learning is a well established learning paradigm which has both theoretical and practical appeals. The goal of online learning is to make a sequence of accurate predictions given knowledge of the correct answer to previous prediction tasks and possibly additional available information. Online learning has been studied in several research fields including game theory, information theory, and machine learning. It also became of great interest to practitioners due the recent emergence of large scale applications such as online advertisement placement and online web ranking. In this survey we provide a modern overview of online learning. Our goal is to give the reader a sense of some of the interesting ideas and in particular to underscore the centrality of convexity in deriving efficient online learning algorithms. We do not mean to be comprehensive but rather to give a high-level, rigorous yet easy to follow, survey."
Ungated version (via shivak): http://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf
to:NB
online_learning
individual_sequence_prediction
optimization
learning_theory
machine_learning
learning_in_games
low-regret_learning
Ungated version (via shivak): http://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf
8 weeks ago by cshalizi
[1203.5245] Qualitative robustness of statistical functionals under strong mixing
9 weeks ago by cshalizi
"A new concept of qualitative robustness for plug-in estimators based on identically distributed possibly {em dependent} observations is introduced, and it is shown that Hampel's theorem for general metrics $d$ still holds. Since Hampel's theorem assumes the UGC property w.r.t. $d$, i.e. convergence in probability of the empirical probability measure to the true marginal distribution w.r.t. $d$ uniformly in the class of all admissible laws on the sample path space, this property is shown for a large class of strongly mixing laws for three different metrics $d$. For real-valued observations the UGC property is established for both the Kolomogorov $phi$-metric and the L'evy $psi$-metric, and for observations in a general locally compact and second countable Hausdorff space the UGC property is established for a certain metric generating the $psi$-weak topology. The key is a new uniform weak LLN for strongly mixing random variables. The latter is of independent interest and relies on Rio's maximal inequality."
to:NB
statistics
mixing
statistical_inference_for_stochastic_processes
learning_theory
9 weeks ago by cshalizi
[1005.0005] Extracting dynamical equations from experimental data is NP-hard
9 weeks ago by cshalizi
"The behavior of any physical system is governed by its underlying dynamical equations. Much of physics is concerned with discovering these dynamical equations and understanding their consequences. In this work, we show that, remarkably, identifying the underlying dynamical equation from any amount of experimental data, however precise, is a provably computationally hard problem (it is NP-hard), both for classical and quantum mechanical systems. As a by-product of this work, we give complexity-theoretic answers to both the quantum and classical embedding problems, two long-standing open problems in mathematics (the classical problem, in particular, dating back over 70 years)."
to:NB
learning_theory
physics
computational_complexity
9 weeks ago by cshalizi
[1203.3887] Learning Loopy Graphical Models with Latent Variables: Efficient Methods and Guarantees
9 weeks ago by cshalizi
"The problem of structure estimation in latent graphical models is considered, where some nodes are latent or hidden. A novel method is proposed which attempts to locally reconstruct latent trees and outputs a loopy graph structure with hidden variables. Correctness of the method is established when the underlying graph has a large girth and the model is in the regime of correlation decay, and PAC guarantees for the method are also derived. For the special case of the Ising model, the number of samples $n$ required for structural consistency scales as $n = Omega(theta_{min}^{-2delta eta(eta+1)-2}log p)$, where $theta_{min}$ is the minimum edge potential, $delta$ is the depth (i.e., distance from a hidden node to the nearest observed nodes), and $eta$ is a parameter which depends on the bounds on node and edge potentials in the Ising model. The results are further specialized for the case when the observed nodes are uniformly sampled from the model. Finally, necessary conditions for structural consistency under any algorithm are derived."
to:NB
graphical_models
learning_theory
machine_learning
statistics
ising_model
inference_to_latent_objects
9 weeks ago by cshalizi
Reliable Reasoning: Induction and Statistical Learning Theory // Reviews // Philosophical Reviews // University of Notre Dame
10 weeks ago by cshalizi
Have I really never bookmarked this before?
book_reviews
learning_theory
philosophy_of_science
kith_and_kin
kelly.kevin_t.
mayo-wilson.conor
induction
10 weeks ago by cshalizi
[1202.3482] The local geometry of finite mixtures
11 weeks ago by cshalizi
"We introduce a technique to obtain local (bracketing) metric entropy bounds for subsets of a normed vector space from global entropy bounds. Using this method, we establish that for q>=1, the class of convex combinations of q translates of a probability density has finite local doubling dimension under a smoothness assumption. The proof requires a detailed investigation of the local geometry of mixture classes, which is of independent interest."
in_NB
statistics
mixture_models
learning_theory
11 weeks ago by cshalizi
[1203.0193] Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts
12 weeks ago by cshalizi
Huh: "The Vapnik-Chervonenkis dimension (VC dimension) of the set of half-spaces of R^d with frontiers parallel to the axes is computed exactly. It is shown that this VC dimension is smaller than the intuitive value of d. An additional approximation using the Stirling's formula is given. This result may be used to evaluate the performance of classifiers or regressors based on dyadic partitioning of R^d for instance. Algorithms using axis-parallel cuts to partition R^d are often used to reduce the computational time of such estimators when d is large."
to:NB
learning_theory
vc-dimension
classifiers
12 weeks ago by cshalizi
[0806.4140] Optimal oracle inequalities for model selection
12 weeks ago by cshalizi
"Model selection is often performed by empirical risk minimization. The quality of selection in a given situation can be assessed by risk bounds, which require assumptions both on the margin and the tails of the losses used. Starting with examples from the 3 basic estimation problems, regression, classification and density estimation, we formulate risk bounds for empirical risk minimization under successively weakening conditions and prove them at a very general level, for general margin and power tail behavior of the excess losses."
in_NB
statistics
learning_theory
cross-validation
model_selection
van_de_geer.sara
12 weeks ago by cshalizi
[1202.4294] Prediction of quantiles by statistical learning and application to GDP forecasting
february 2012 by cshalizi
"In this paper, we tackle the problem of prediction and confidence intervals for time series using a statistical learning approach and quantile loss functions. In a first time, we show that the Gibbs estimator (also known as Exponentially Weighted aggregate) is able to predict as well as the best predictor in a given family for a wide set of loss functions. In particular, using the quantile loss function of Koenker and Bassett (1978), this allows to build confidence intervals. We apply these results to the problem of prediction and confidence regions for the French Gross Domestic Product (GDP) growth, with promising results."
in_NB
to_read
prediction
confidence_sets
learning_theory
re:your_favorite_dsge_sucks
re:growing_ensemble_project
february 2012 by cshalizi
[1202.4283] Fast rates in learning with dependent observations
february 2012 by cshalizi
"In this paper we tackle the problem of fast rates in time series forecasting from a statistical learning perspective. In a serie of papers (e.g. Meir 2000, Modha and Masry 1998, Alquier and Wintenberger 2012) it is shown that the main tools used in learning theory with iid observations can be extended to the prediction of time series. The main message of these papers is that, given a family of predictors, we are able to build a new predictor that predicts the series as well as the best predictor in the family, up to a remainder of order $1/sqrt{n}$. It is known that this rate cannot be improved in general. In this paper, we show that in the particular case of the least square loss, and under a strong assumption on the time series (phi-mixing) the remainder is actually of order $1/n$. Thus, the optimal rate for iid variables, see e.g. Tsybakov 2003, and individual sequences, see cite{lugosi} is, for the first time, achieved for uniformly mixing processes. We also show that our method is optimal for aggregating sparse linear combinations of predictors."
--- Assumes observations are in the interval [-B,B] and gets a bound which is O(B^3), and so useless for our purposes.
in_NB
learning_theory
mixing
ergodic_theory
re:your_favorite_dsge_sucks
re:XV_for_mixing
have_read
--- Assumes observations are in the interval [-B,B] and gets a bound which is O(B^3), and so useless for our purposes.
february 2012 by cshalizi
Weakly Universally Consistent Forecasting of Stationary and Ergodic Time Series
february 2012 by cshalizi
"Static forecasting of stationary and ergodic time series is considered, i.e., inference of the conditional expectation of the response variable at time zero given the infinite past. It is shown that the mean squared error of a combination of suitably defined localized least squares estimates converges to zero for all distributions where the response variable is square integrable."
to:NB
universal_prediction
stochastic_processes
ergodic_theory
statistical_inference_for_stochastic_processes
learning_theory
february 2012 by cshalizi
Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
february 2012 by cshalizi
"We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic.
"We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2) per trial, where n is the dimension of the instances."
to:NB
online_learning
dimension_reduction
machine_learning
learning_theory
warmuth.manfred
principal_components
low-regret_learning
"We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2) per trial, where n is the dimension of the instances."
february 2012 by cshalizi
Giné , Nickl : Rates of contraction for posterior distributions in Lr-metrics, 1 ≤ r ≤ ∞
january 2012 by cshalizi
"The frequentist behavior of nonparametric Bayes estimates, more specifically, rates of contraction of the posterior distributions to shrinking Lr-norm neighborhoods, 1 ≤ r ≤ ∞, of the unknown parameter, are studied. A theorem for nonparametric density estimation is proved under general approximation-theoretic assumptions on the prior. The result is applied to a variety of common examples, including Gaussian process, wavelet series, normal mixture and histogram priors. The rates of contraction are minimax-optimal for 1 ≤ r ≤ 2, but deteriorate as r increases beyond 2. In the case of Gaussian nonparametric regression a Gaussian prior is devised for which the posterior contracts at the optimal rate in all Lr-norms, 1 ≤ r ≤ ∞."
in_NB
bayesian_consistency
statistics
nonparametrics
learning_theory
re:bayes_as_evol
density_estimation
regression
january 2012 by cshalizi
[0712.1698] PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers
january 2012 by cshalizi
"The aim of this paper is to generalize the PAC-Bayesian theorems proved by Catoni in the classification setting to more general problems of statistical inference. We show how to control the deviations of the risk of randomized estimators. A particular attention is paid to randomized estimators drawn in a small neighborhood of classical estimators, whose study leads to control the risk of the latter. These results allow to bound the risk of very general estimation procedures, as well as to perform model selection."
to:NB
learning_theory
statistics
machine_learning
january 2012 by cshalizi
Introduction to Online Optimization (Bubeck)
december 2011 by cshalizi
"to_teach" tag a sudden brainstorm for how to make next year's statistical computing class either unbeatably awesome or an absolute disaster
to:NB
online_learning
regression
individual_sequence_prediction
optimization
machine_learning
learning_theory
via:mraginsky
to_read
to_teach:statcomp
december 2011 by cshalizi
Audibert , Catoni : Robust linear least squares regression
december 2011 by cshalizi
"We consider the problem of robustly predicting as well as the best linear combination of d given functions in least squares regression, and variants of this problem including constraints on the parameters of the linear combination. For the ridge estimator and the ordinary least squares estimator, and their variants, we provide new risk bounds of order d/n without logarithmic factor unlike some standard results, where n is the size of the training data. We also provide a new estimator with better deviations in the presence of heavy-tailed noise. It is based on truncating differences of losses in a min–max framework and satisfies a d/n risk bound both in expectation and in deviations. The key common surprising factor of these results is the absence of exponential moment condition on the output distribution while achieving exponential deviations. All risk bounds are obtained through a PAC-Bayesian analysis on truncated differences of losses. Experimental results strongly back up our truncated min–max estimator."
in_NB
regression
statistics
linear_regression
learning_theory
catoni.olivier
december 2011 by cshalizi
[1111.5648] Falsification and future performance
december 2011 by cshalizi
"We information-theoretically reformulate two measures of capacity from statistical learning theory: empirical VC-entropy and empirical Rademacher complexity. We show these capacity measures count the number of hypotheses about a dataset that a learning algorithm falsifies when it finds the classifier in its repertoire minimizing empirical risk. It then follows from that the future performance of predictors on unseen data is controlled in part by how many hypotheses the learner falsifies. As a corollary we show that empirical VC-entropy quantifies the message length of the true hypothesis in the optimal code of a particular probability distribution, the so-called actual repertoire."
to:NB
to_read
information_theory
learning_theory
falsification
balduzzi.david
december 2011 by cshalizi
[1111.6410] Adaptive Semisupervised Inference
december 2011 by cshalizi
"Semisupervised methods inevitably invoke some assumption that links the marginal distribution of the features to the regression function of the label. Most commonly, the cluster or manifold assumptions are used which imply that the regression function is smooth over high-density clusters or manifolds supporting the data. A generalization of these assumptions is that the regression function is smooth with respect to some density sensitive distance. This motivates the use of a density based metric for semisupervised learning. We analyze this setting and make the following contributions - (a) we propose a semi-supervised learner that uses a density-sensitive kernel and show that it provides better performance than any supervised learner if the density support set has a small condition number and (b) we show that it is possible to adapt to the degree of semi-supervisedness using data-dependent choice of a parameter that controls sensitivity of the distance metric to the density. This ensures that the semisupervised learner never performs worse than a supervised learner even if the assumptions fail to hold."
to:NB
machine_learning
learning_theory
semi-supervised_learning
singh.aarti
wasserman.larry
december 2011 by cshalizi
[1111.6337] Regret Bound by Variation for Online Convex Optimization
december 2011 by cshalizi
"In citep{Hazan-2008-extract}, the authors showed that the regret of online linear optimization can be bounded by the total variation of the cost vectors. In this paper, we extend this result to general online convex optimization. We first analyze the limitations of the algorithm in citep{Hazan-2008-extract} when applied it to online convex optimization. We then present two algorithms for online convex optimization whose regrets are bounded by the variation of cost functions. We finally consider the bandit setting, and present a randomized algorithm for online bandit convex optimization with a variation-based regret bound. We show that the regret bound for online bandit convex optimization is optimal when the variation of cost functions is independent of the number of trials."
in_NB
to_read
re:growing_ensemble_project
learning_theory
individual_sequence_prediction
december 2011 by cshalizi
[1111.4470] Efficient Regression in Metric Spaces via Approximate Lipschitz Extension
november 2011 by cshalizi
"We present a framework for performing efficient regression in general metric spaces. Roughly speaking, our regressor predicts the value at a new point by computing a Lipschitz extension -- the smoothest function consistent with the observed data -- while performing an optimized structural risk minimization to avoid overfitting. The offline (learning) and online (inference) stages can be solved by convex programming, but this naive approach has runtime complexity O(n^3), which is prohibitive for large datasets. We design instead an algorithm that is fast when the the doubling dimension, which measures the "intrinsic" dimensionality of the metric space, is low.
We use the doubling dimension multiple times; first, on the statistical front, to bound fat-shattering dimension of the class of Lipschitz functions (and obtain risk bounds); and second, on the computational front, to quickly compute a hypothesis function and a prediction based on Lipschitz extension. Our resulting regressor is both asymptotically strongly consistent and comes with finite-sample risk bounds, while making minimal structural and noise assumptions."
in_NB
heard_the_talk
kith_and_kin
regression
learning_theory
statistics
kontorovich.aryeh
We use the doubling dimension multiple times; first, on the statistical front, to bound fat-shattering dimension of the class of Lipschitz functions (and obtain risk bounds); and second, on the computational front, to quickly compute a hypothesis function and a prediction based on Lipschitz extension. Our resulting regressor is both asymptotically strongly consistent and comes with finite-sample risk bounds, while making minimal structural and noise assumptions."
november 2011 by cshalizi
[1111.3404] Estimated VC dimension for risk bounds
november 2011 by cshalizi
"Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the generalization capacity of learning algorithms. However, apart from a few special cases, it is hard or impossible to calculate analytically. Vapnik et al. [10] proposed a technique for estimating the VC dimension empirically. While their approach behaves well in simulations, it could not be used to bound the generalization risk of classifiers, because there were no bounds for the estimation error of the VC dimension itself. We rectify this omission, providing high probability concentration results for the proposed estimator and deriving corresponding generalization bounds."
self-promotion
learning_theory
vc-dimension
machine_learning
re:your_favorite_dsge_sucks
november 2011 by cshalizi
Generalization Bound for Infinitely Divisible Empirical Process
november 2011 by cshalizi
"In this paper, we study the generalization bound for an empirical process of samples independently drawn from an infinitely divisible (ID) distribution, which is termed as the ID empirical process. In particular, based on a martingale method, we develop deviation inequalities for the sequence of random variables of an ID distribution. By applying the obtained deviation inequalities, we then show the generalization bound for ID empirical process based on the annealed Vapnik- Chervonenkis (VC) entropy. Afterward, according to Sauer’s lemma, we get the generalization bound for ID empirical process based on the VC dimension. Finally, by using a resulted result bound, we analyze the asymptotic convergence of ID empirical process and show that the convergence rate of ID empirical process can reach O((frac{Lambda_mathcal{F}(2N)}{N})^{frac{1}{1.3}}) and it is faster than the results of the generic i.i.d. empirical process (Vapnik, 1999) "
in_NB
learning_theory
empirical_processes
stochastic_processes
levy_processes
martingales
re:almost_none
november 2011 by cshalizi
Improved Regret Guarantees for Online Smooth Convex Optimization with Bandit Feedback
november 2011 by cshalizi
The study of online convex optimization in the bandit setting was initiated by Kleinberg (2004) and Flaxman et al. (2005). Such a setting models a decision maker that has to make decisions in the face of adversarially chosen convex loss functions. Moreover, the only information the decision maker receives are the losses. The identity of the loss functions themselves is not revealed. In this setting, we reduce the gap between the best known lower and upper bounds for the class of smooth convex functions, i.e. convex functions with a Lipschitz continuous gradient. Building upon existing work on self-concordant regularizers and one-point gradient estimation, we give the first algorithm whose expected regret, ignoring constant and logarithmic factors, is O(T^{2/3}).
to:NB
decision_theory
learning_theory
machine_learning
bandit_problems
individual_sequence_prediction
november 2011 by cshalizi
Adaptive Bandits: Towards the Best History-Dependent Strategy
november 2011 by cshalizi
"We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e.~he provides rewards that are stochastic functions of equivalence classes defined by some model theta* in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t.~the best strategy among all mappings from classes to actions (i.e.~the best history-class-based strategy) for the best model in Theta. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Theta={theta}, i.e.~only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by tilde O(sqrt{TAC}), where C is the number of classes of theta. Now, when many models are available, all known algorithms achieving a nice regret O(sqrt{T}) are unfortunately not tractable and scale poorly with the number of models $|Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by T^{2/3}C^{1/3}log(|Theta|)^{1/2}. "
to:NB
decision_theory
learning_theory
machine_learning
bandit_problems
individual_sequence_prediction
november 2011 by cshalizi
[1110.6886] PAC-Bayesian Inequalities for Martingales
november 2011 by cshalizi
"We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. We also present a comparison inequality that bounds expectation of a convex function of martingale difference type variables by expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma inequality." --- For the record: I hate martingales.
to:NB
re:almost_none
learning_theory
concentration_of_measure
martingales
november 2011 by cshalizi
Banff blog « An Ergodic Walk
october 2011 by cshalizi
Sounds delightful (and Banff is beautiful).
conferences
statistics
learning_theory
statistical_inference_for_stochastic_processes
track_down_references
markov_models
network_data_analysis
estimation
van_handel.ramon
nonparametrics
re:smoothing_adjacency_matrices
information_theory
october 2011 by cshalizi
[1110.3592] Information, learning and falsification
october 2011 by cshalizi
"Broadly speaking, there are two approaches to quantifying information. The first, Shannon information, takes events as belonging to ensembles and quantifies the information resulting from observing the given event in terms of the number of alternate events that have been ruled out. The second, algorithmic information or Kolmogorov complexity, takes events as strings and, given a universal Turing machine, quantifies the information content of a string as the length of the shortest program producing it. This note describes a new method of quantifying information, effective information, that links algorithmic information to Shannon information, and also links both to capacities arising in statistical learning theory. After introducing the measure, we show that it provides a non-universal analog of algorithmic information. We then apply it to derive basic capacities in statistical learning theory: empirical VC-entropy and empirical Rademacher complexity. A nice byproduct of our approach is an interpretation of the explanatory power of a learning algorithm in terms of the number of hypotheses it falsifies (counted in two different ways for the two different capacities). We also discuss how effective information relates to information gain, Shannon and mutual information."
to:NB
to_read
learning_theory
information_theory
balduzzi.david
october 2011 by cshalizi
[1110.2755] Efficient Tracking of Large Classes of Experts
october 2011 by cshalizi
"In the framework for prediction of individual sequences, sequential prediction methods are to be constructed that perform nearly as well as the best expert from a given class. We consider prediction strategies that compete with the class of switching strategies that can segment a given sequence into several blocks, and follow the advice of a different "base" expert in each block. As usual, the performance of the algorithm is measured by the regret defined as the excess loss relative to the best switching strategy %(with an arbitrary number of switches) selected in hindsight for the particular sequence to be predicted. In this paper we construct %strongly sequential (i.e., horizon-independent) prediction strategies of low computational cost for the case where the set of base experts is large. In particular we derive a family of efficient tracking algorithms that, for any prediction algorithm $A$ designed for the base class, can be implemented with time and space complexity $O(n^{gamma} log n)$ times larger than that of $A$, where $n$ is the time horizon and $gamma ge 0$ is a parameter of the algorithm. With $A$ properly chosen, our algorithm achieves a regret bound of optimal order for $gamma>0$, and only $O(log n)$ times larger than the optimal order for $gamma=0$ for all typical regret bound types we examined. For example, for predicting binary sequences with switching parameters, our method achieves the optimal $O(log n)$ regret rate with time complexity $O(n^{1+gamma}log n)$ for any $gammain (0,1)$."
to:NB
to_read
re:growing_ensemble_project
learning_theory
individual_sequence_prediction
lugosi.gabor
october 2011 by cshalizi
[1110.2529] The Generalization Ability of Online Algorithms for Dependent Data
october 2011 by cshalizi
"We study the generalization performance of arbitrary online learning algorithms trained on samples coming from a dependent source of data. We show that the generalization error of any stable online algorithm concentrates around its regret--an easily computable statistic of the online performance of the algorithm--when the underlying ergodic process is $beta$- or $phi$-mixing. We show high probability error bounds assuming the loss function is convex, and we also establish sharp convergence rates and deviation bounds for strongly convex losses and several linear prediction problems such as linear and logistic regression, least-squares SVM, and boosting on dependent data. In addition, our results have straightforward applications to stochastic optimization with dependent data, and our analysis requires only martingale convergence arguments; we need not rely on more powerful statistical tools such as empirical process theory."
in_NB
learning_theory
individual_sequence_prediction
ergodic_theory
mixing
re:growing_ensemble_project
re:XV_for_mixing
stability_of_learning
concentration_of_measure
have_read
re:your_favorite_dsge_sucks
october 2011 by cshalizi
IEEE Xplore - Computational Limits to Nonparametric Estimation for Ergodic Processes
october 2011 by cshalizi
"A new negative result for nonparametric distribution estimation of binary ergodic processes is shown. The problem of estimation of distribution with any degree of accuracy is studied. Then it is shown that for any countable class of estimators there is a zero-entropy binary ergodic process that is inconsistent with the class of estimators. Our result is different from other negative results for universal forecasting scheme of ergodic processes."
to:NB
universal_prediction
ergodic_theory
statistics
statistical_inference_for_stochastic_processes
learning_theory
october 2011 by cshalizi
A statistical physics approach for the analysis of machine learning algorithms on real data
august 2011 by cshalizi
"We combine the replica approach of statistical physics with a variational technique to make it applicable for the analysis of machine learning algorithms on real data. The method is applied to Gaussian process models and their relative, the support vector machine. We discuss the quality of our theoretical results in comparison to experiments. As a key result, we apply our theory on real world benchmark data and show its potential for practical applications by deriving approximate expressions for data averaged performance measures which hold for general data distributions and allow us to optimize the performance of the learning algorithm."
to_be_shot_after_a_fair_trial
machine_learning
statistical_mechanics
learning_theory
in_NB
august 2011 by cshalizi
[1012.0621] The Convex Geometry of Linear Inverse Problems
august 2011 by cshalizi
For the reading group on Wednesday
learning_theory
inverse_problems
sparsity
convexity
to_read
to:NB
august 2011 by cshalizi
[1108.0775] Optimization with Sparsity-Inducing Penalties
august 2011 by cshalizi
"Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate non-smooth norms. The goal of this paper is to present from a general perspective optimization tools and techniques dedicated to such sparsity-inducing penalties. We cover proximal methods, block-coordinate descent, reweighted $\ell_2$-penalized techniques, working-set and homotopy methods, as well as non-convex formulations and extensions, and provide an extensive set of experiments to compare various algorithms from a computational point of view."
learning_theory
sparsity
optimization
lasso
to:NB
machine_learning
statistics
august 2011 by cshalizi
[1107.4080] On the Universality of Online Mirror Descent
july 2011 by cshalizi
"We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee."
optimization
learning_theory
online_learning
low-regret_learning
july 2011 by cshalizi
Differentially Private Empirical Risk Minimization
april 2011 by cshalizi
"Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). ... a new method, objective perturbation, for privacy-preserving machine learning algorithm design ... perturbing the objective function before optimizing ...our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. ... a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees ... encouraging results from ... performance on real demographic and benchmark data sets... objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance."
learning_theory
stability_of_learning
privacy
sarwate.anand
to:NB
april 2011 by cshalizi
Cross-Validation and Mean-Square Stability
march 2011 by cshalizi
It's a little boggling that they don't cite any of the modern (2000--) work on theoretical properties of CV, but oh well...
cross-validation
learning_theory
stability_of_learning
statistics
re:your_favorite_dsge_sucks
re:XV_for_mixing
re:XV_for_networks
to_read
via:nikete
march 2011 by cshalizi
[1102.5750] Neyman-Pearson classification, convexity and stochastic constraints
march 2011 by cshalizi
"Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its probability of type I error is below a pre-specified level and (ii), it has probability of type II error close to the minimum possible. The proposed classifier is obtained by solving an optimization problem with an empirical objective and an empirical constraint. New techniques to handle such problems are developed and have consequences on chance constrained programming."
Final version: http://jmlr.csail.mit.edu/papers/v12/rigollet11a.html
in_NB
learning_theory
statistics
hypothesis_testing
convexity
machine_learning
optimization
have_read
rigollet.philippe
Final version: http://jmlr.csail.mit.edu/papers/v12/rigollet11a.html
march 2011 by cshalizi
[0810.4752] Statistical Learning Theory: Models, Concepts, and Results
january 2011 by cshalizi
"Statistical learning theory provides the theoretical basis for many of today's machine learning algorithms. In this article we attempt to give a gentle, non-technical overview over the key ideas and insights of statistical learning theory. We target at a broad audience, not necessarily machine learning researchers. This paper can serve as a starting point for people who want to get an overview on the field before diving into technical details."
learning_theory
machine_learning
statistics
january 2011 by cshalizi
[1011.3396] PAC-Bayesian aggregation and multi-armed bandits
november 2010 by cshalizi
"This habilitation thesis presents several contributions to (1) the PAC-Bayesian analysis of statistical learning, (2) the three aggregation problems: given d functions, how to predict as well as (i) the best of these d functions (model selection type aggregation), (ii) the best convex combination of these d functions, (iii) the best linear combination of these d functions, (3) the multi-armed bandit problems."
statistics
learning_theory
pac-bayesian
model_selection
ensemble_methods
to:NB
november 2010 by cshalizi
Giné , Nickl : Adaptive estimation of a distribution function and its density in sup-norm loss by wavelet and spline projections
november 2010 by cshalizi
"Given an i.i.d. sample from a distribution F on ℝ with uniformly continuous density p0, purely data-driven estimators are constructed that efficiently estimate F in sup-norm loss and simultaneously estimate p0 at the best possible rate of convergence over Hölder balls, also in sup-norm loss. The estimators are obtained by applying a model selection procedure close to Lepski’s method with random thresholds to projections of the empirical measure onto spaces spanned by wavelets or B-splines. The random thresholds are based on suprema of Rademacher processes indexed by wavelet or spline projection kernels. This requires Bernstein-type analogs of the inequalities in Koltchinskii [Ann. Statist. 34 (2006) 2593–2656] for the deviation of suprema of empirical processes from their Rademacher symmetrizations."
learning_theory
density_estimation
empirical_processes
november 2010 by cshalizi
Learnability, Stability, and Uniform Convergence
november 2010 by cshalizi
"characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. ... the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression."
learning_theory
stability_of_learning
have_read
re:your_favorite_dsge_sucks
november 2010 by cshalizi
[1010.2286] Divergence-based characterization of fundamental limitations of adaptive dynamical systems
october 2010 by cshalizi
"general problem of adaptively controlling and/or identifying a stochastic dynamical system, where our {\em a priori} knowledge allows us to place the system in a subset of a metric space (the uncertainty set). We present an information-theoretic meta-theorem that captures the trade-off between the metric complexity (or richness) of the uncertainty set, the amount of information acquired online in the process of controlling and observing the system, and the residual uncertainty remaining after the observations have been collected. Following the approach of Zames, we quantify {\em a priori} information by the Kolmogorov (metric) entropy of the uncertainty set, while the information acquired online is expressed as a sum of information divergences. The general theory is used to derive new minimax lower bounds on the metric identification error, as well as to give a simple derivation of the minimum time needed to stabilize an uncertain stochastic linear system."
systems_identification
control_theory
estimation
learning_theory
minimax
heard_the_talk
raginsky.maxim
to:NB
october 2010 by cshalizi
[1009.4434] The universal Glivenko-Cantelli property
september 2010 by cshalizi
I wonder if this could be used to give a more comprehensible sufficient condition for posterior convergence in the setting of my Bayes-is-a-special-case-of-replicator-dynamics paper?
learning_theory
probability
glivenko-cantelli
re:bayes_as_evol
have_read
re:almost_none
september 2010 by cshalizi
[1009.3896] Smoothness, Low-Noise and Fast Rates
september 2010 by cshalizi
Sounded more interesting from the abstract, but still worthwhile.
learning_theory
re:your_favorite_dsge_sucks
have_read
september 2010 by cshalizi
Typical, just typical « The Information Structuralist
september 2010 by cshalizi
Maxim blogs about his own new paper, and about the "Coordination Capacity" paper, so I don't have to.
information_theory
method_of_types
learning_theory
raginsky.maxim
september 2010 by cshalizi
The Information Structuralist
september 2010 by cshalizi
Maxim's new blog. Probably of interest if mine is, or this feed is.
raginsky.maxim
information_theory
cybernetics
control_theory
stochastic_processes
kith_and_kin
via:ded-maxim
learning_theory
september 2010 by cshalizi
Risk Bounds for Levy Processes in the PAC-Learning Framework
august 2010 by cshalizi
"Levy processes play an important role in the stochastic process theory. However, since samples are non-i.i.d., statistical learning results based on the i.i.d. scenarios cannot be utilized to study the risk bounds for Levy processes. In this paper, we present risk bounds for non-i.i.d. samples drawn from Levy processes in the PAC-learning framework. In particular, by using a concentration inequality for infinitely divisible distributions, we first prove that the function of risk error is Lipschitz continuous with a high probability, and then by using a specific concentration inequality for Levy processes, we obtain the risk bounds for non-i.i.d. samples drawn from Levy processes without Gaussian components. Based on the resulted risk bounds, we analyze the factors that affect the convergence of the risk bounds and then prove the convergence."
learning_theory
stochastic_processes
levy_processes
statistical_inference_for_stochastic_processes
via:djm1107
august 2010 by cshalizi
[1007.4037] Uniform Approximation and Bracketing Properties of VC classes
july 2010 by cshalizi
"We show that the sets in a family with finite VC dimension can be uniformly approximated within a given error by a finite partition. Immediate corollaries include the fact that VC classes have finite bracketing numbers, satisfy uniform laws of averages under strong dependence, and exhibit uniform mixing. Our results are based on recent work concerning uniform laws of averages for VC classes under ergodic sampling." I.e., they repeat their previous trick: if you don't have uniform approximation, they can construct arbitrarily large shatterable sets.
learning_theory
ergodic_theory
nobel.andrew
adams.terrence
have_read
july 2010 by cshalizi
Extracting certainty from uncertainty: regret bounded by variation in costs
july 2010 by cshalizi
Cool, but not sure it works really for our setting.
learning_theory
individual_sequence_prediction
ensemble_methods
re:growing_ensemble_project
have_read
july 2010 by cshalizi
related tags
aaronson.scott ⊕ adams.terrence ⊕ analysis ⊕ asymptotics ⊕ balduzzi.david ⊕ bandit_problems ⊕ bayesianism ⊕ bayesian_consistency ⊕ biau.gerard ⊕ blanchard.gilles ⊕ blogged ⊕ books:noted ⊕ books:recommended ⊕ book_reviews ⊕ boosting ⊕ bootstrap ⊕ boucheron.stephane ⊕ bousquet.olivier ⊕ breiman.leo ⊕ calibration ⊕ carnegie_mellon ⊕ catoni.olivier ⊕ central_limit_theorem ⊕ classifiers ⊕ clustering ⊕ cognitive_science ⊕ community_discovery ⊕ complexity ⊕ compressed_sensing ⊕ computational_complexity ⊕ computational_statistics ⊕ concentration_of_measure ⊕ conferences ⊕ confidence_sets ⊕ consistency ⊕ control_theory ⊕ convexity ⊕ cross-validation ⊕ cybernetics ⊕ data_mining ⊕ decision_theory ⊕ dembo.amir ⊕ density_estimation ⊕ deviation_bounds ⊕ deviation_inequalities ⊕ devroye.luc ⊕ dimension_reduction ⊕ diversity ⊕ economics ⊕ empirical_processes ⊕ ensemble_methods ⊕ epistemology ⊕ ergodic_theory ⊕ estimation ⊕ evolutionary_search ⊕ exchangeable_sequences ⊕ experimental_biology ⊕ experimental_psychology ⊕ exploitation-exploration_tradeoff ⊕ falsification ⊕ first_draw_your_curve_then_plot_your_data ⊕ freund.yoav ⊕ functional_connectivity ⊕ function_approximation ⊕ funny:geeky ⊕ glivenko-cantelli ⊕ grants ⊕ graphical_models ⊕ grunwald.peter ⊕ harman.gilbert ⊕ have_read ⊕ heard_the_talk ⊕ hierarchical_structure ⊕ hilbert_space ⊕ histograms ⊕ hypothesis_testing ⊕ individual_sequence_prediction ⊕ induction ⊕ inference_to_latent_objects ⊕ information_criteria ⊕ information_theory ⊕ inverse_problems ⊕ in_NB ⊕ ising_model ⊕ jiang.wenxin ⊕ jordan.michael_i. ⊕ k-means ⊕ kakade.sham ⊕ kearns.michael ⊕ kelly.kevin_t. ⊕ kernel_methods ⊕ kith_and_kin ⊕ kontorovich.aryeh ⊕ kulkarni.sanjeev ⊕ large_deviations ⊕ lasso ⊕ learning_in_games ⊕ learning_theory ⊖ leeb.hannes ⊕ levy_processes ⊕ linear_regression ⊕ liu.han ⊕ low-regret_learning ⊕ lugosi.gabor ⊕ machine_learning ⊕ macroeconomics ⊕ manifold_learning ⊕ markov_models ⊕ martingales ⊕ massart.pascal ⊕ mayo-wilson.conor ⊕ mayo.deborah ⊕ measure_theory ⊕ methodology ⊕ method_of_types ⊕ minimax ⊕ mixing ⊕ mixture_models ⊕ model_selection ⊕ nadler.boaz ⊕ network_data_analysis ⊕ neural_modeling ⊕ neural_networks ⊕ neuroscience ⊕ niyogi.partha ⊕ nobel.andrew ⊕ nonparametrics ⊕ occams_razor ⊕ online_learning ⊕ optimization ⊕ oracle_inequalities ⊕ pac-bayesian ⊕ paradoxes ⊕ perturbation_theory ⊕ phase_transitions ⊕ philosophy_of_science ⊕ photos ⊕ physics ⊕ predesignation ⊕ prediction ⊕ principal_components ⊕ privacy ⊕ probability ⊕ probably_approximately_correct ⊕ quantum_computing_since_democritus ⊕ quantum_mechanics ⊕ rademacher_complexity ⊕ raginsky.maxim ⊕ random_forests ⊕ re:almost_none ⊕ re:AoS_project ⊕ re:bayes_as_evol ⊕ re:growing_ensemble_project ⊕ re:naive-semi-supervised ⊕ re:phil-of-bayes_paper ⊕ re:smoothing_adjacency_matrices ⊕ re:stacs ⊕ re:XV_for_mixing ⊕ re:XV_for_networks ⊕ re:your_favorite_dsge_sucks ⊕ regression ⊕ relational_learning ⊕ resampling ⊕ review_papers ⊕ rigollet.philippe ⊕ risk_vs_uncertainty ⊕ sarwate.anand ⊕ self-centered ⊕ self-promotion ⊕ semi-supervised_learning ⊕ singh.aarti ⊕ smola.alex ⊕ sober.elliott ⊕ sparsity ⊕ splines ⊕ stability_of_learning ⊕ state-space_models ⊕ statistical_inference_for_stochastic_processes ⊕ statistical_learning ⊕ statistical_mechanics ⊕ statistics ⊕ stochastic_differential_equations ⊕ stochastic_processes ⊕ structural_risk_minimization ⊕ support_vector_machines ⊕ systems_identification ⊕ thesis_proposals ⊕ time_series ⊕ to:blog ⊕ to:NB ⊕ to_be_shot_after_a_fair_trial ⊕ to_read ⊕ to_teach:data-mining ⊕ to_teach:statcomp ⊕ to_teach:undergrad-ADA ⊕ track_down_references ⊕ universal_prediction ⊕ van_der_vaart.aad ⊕ van_de_geer.sara ⊕ van_handel.ramon ⊕ vapnik.v.n. ⊕ vapnik.vladimir ⊕ vazirani.umesh ⊕ vc-dimension ⊕ via:? ⊕ via:arthegall ⊕ via:ded-maxim ⊕ via:djm1107 ⊕ via:klk ⊕ via:mraginsky ⊕ via:mreid ⊕ via:nikete ⊕ via:scotte ⊕ via:shivak ⊕ vishwanathan.s.v.n. ⊕ warmuth.manfred ⊕ wasserman.larry ⊕ weak_dependence ⊕ wellner.jon ⊕ zhang.tong ⊕Copy this bookmark: