cshalizi + learning_theory   155

Multiple dynamic representations in the motor cortex during sensorimotor learning : Nature : Nature Publishing Group
"The mechanisms linking sensation and action during learning are poorly understood. Layer 2/3 neurons in the motor cortex might participate in sensorimotor integration and learning; they receive input from sensory cortex and excite deep layer neurons, which control movement. Here we imaged activity in the same set of layer 2/3 neurons in the motor cortex over weeks, while mice learned to detect objects with their whiskers and report detection with licking. Spatially intermingled neurons represented sensory (touch) and motor behaviours (whisker movements and licking). With learning, the population-level representation of task-related licking strengthened. In trained mice, population-level representations were redundant and stable, despite dynamism of single-neuron representations. The activity of a subpopulation of neurons was consistent with touch driving licking behaviour. Our results suggest that ensembles of motor cortex neurons couple sensory input to multiple, related motor programs during learning."
to:NB  learning_theory  neuroscience  functional_connectivity  experimental_biology 
4 weeks ago by cshalizi
Ockham's Razor: Foundations - Carnegie Mellon Center for Formal Epistemology
Despite my presence on the program, this should actually be really good.

"Scientific theory choice is guided by judgments of simplicity, a bias frequently referred to as "Ockham's Razor". But what is simplicity and how, if at all, does it help science find the truth?  Should we view simple theories as means for obtaining accurate predictions, as classical statisticians recommend?  Or should we believe the theories themselves, as Bayesian methods seem to justify?  The aim of this workshop is to re-examine the foundations of Ockham's razor, with a firm focus on the connections, if any, between simplicity and truth. "
self-promotion  occams_razor  philosophy_of_science  epistemology  kelly.kevin_t.  kith_and_kin  mayo.deborah  vapnik.v.n.  sober.elliott  leeb.hannes  wasserman.larry  model_selection  statistics  complexity  machine_learning  learning_theory  grunwald.peter 
5 weeks ago by cshalizi
[math/0612776] Uniform error bounds for smoothing splines
"Almost sure bounds are established on the uniform error of smoothing spline estimators in nonparametric regression with random designs. Some results of Einmahl and Mason (2005) are used to derive uniform error bounds for the approximation of the spline smoother by an ``equivalent'' reproducing kernel regression estimator, as well as for proving uniform error bounds on the reproducing kernel regression estimator itself, uniformly in the smoothing parameter over a wide range. This admits data-driven choices of the smoothing parameter."
to:NB  splines  regression  nonparametrics  statistics  learning_theory 
6 weeks ago by cshalizi
[math/0508275] Local Rademacher complexities
"We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular."
to:NB  learning_theory  rademacher_complexity 
6 weeks ago by cshalizi
[1204.0147] Covering Numbers for Convex Functions
"In this paper we study the covering numbers of the space of convex and uniformly bounded functions in multi-dimension. We find optimal upper and lower bounds for the $epsilon$-covering number of $C([a, b]^d, B)$, in the $L_p$-metric, $1 le p < infty$, in terms of the relevant constants, where $d geq 1$, $a < b in mathbb{R}$, $B>0$, and $C([a,b]^d, B)$ denotes the set of all convex functions on $[a, b]^d$ that are uniformly bounded by $B$. We summarize previously known results on covering numbers for convex functions and also provide alternate proofs of some known results. Our results have direct implications in the study of rates of convergence of empirical minimization procedures as well as optimal convergence rates in the numerous convexity constrained function estimation problems."
to:NB  learning_theory 
7 weeks ago by cshalizi
Structured Sparsity and Generalization
"We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels."
to:NB  learning_theory  regression  sparsity  statistics  lasso 
7 weeks ago by cshalizi
Online Learning and Online Convex Optimization
"Online learning is a well established learning paradigm which has both theoretical and practical appeals. The goal of online learning is to make a sequence of accurate predictions given knowledge of the correct answer to previous prediction tasks and possibly additional available information. Online learning has been studied in several research fields including game theory, information theory, and machine learning. It also became of great interest to practitioners due the recent emergence of large scale applications such as online advertisement placement and online web ranking. In this survey we provide a modern overview of online learning. Our goal is to give the reader a sense of some of the interesting ideas and in particular to underscore the centrality of convexity in deriving efficient online learning algorithms. We do not mean to be comprehensive but rather to give a high-level, rigorous yet easy to follow, survey."

Ungated version (via shivak): http://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf
to:NB  online_learning  individual_sequence_prediction  optimization  learning_theory  machine_learning  learning_in_games  low-regret_learning 
8 weeks ago by cshalizi
[1203.5245] Qualitative robustness of statistical functionals under strong mixing
"A new concept of qualitative robustness for plug-in estimators based on identically distributed possibly {em dependent} observations is introduced, and it is shown that Hampel's theorem for general metrics $d$ still holds. Since Hampel's theorem assumes the UGC property w.r.t. $d$, i.e. convergence in probability of the empirical probability measure to the true marginal distribution w.r.t. $d$ uniformly in the class of all admissible laws on the sample path space, this property is shown for a large class of strongly mixing laws for three different metrics $d$. For real-valued observations the UGC property is established for both the Kolomogorov $phi$-metric and the L'evy $psi$-metric, and for observations in a general locally compact and second countable Hausdorff space the UGC property is established for a certain metric generating the $psi$-weak topology. The key is a new uniform weak LLN for strongly mixing random variables. The latter is of independent interest and relies on Rio's maximal inequality."
to:NB  statistics  mixing  statistical_inference_for_stochastic_processes  learning_theory 
9 weeks ago by cshalizi
[1005.0005] Extracting dynamical equations from experimental data is NP-hard
"The behavior of any physical system is governed by its underlying dynamical equations. Much of physics is concerned with discovering these dynamical equations and understanding their consequences. In this work, we show that, remarkably, identifying the underlying dynamical equation from any amount of experimental data, however precise, is a provably computationally hard problem (it is NP-hard), both for classical and quantum mechanical systems. As a by-product of this work, we give complexity-theoretic answers to both the quantum and classical embedding problems, two long-standing open problems in mathematics (the classical problem, in particular, dating back over 70 years)."
to:NB  learning_theory  physics  computational_complexity 
9 weeks ago by cshalizi
[1203.3887] Learning Loopy Graphical Models with Latent Variables: Efficient Methods and Guarantees
"The problem of structure estimation in latent graphical models is considered, where some nodes are latent or hidden. A novel method is proposed which attempts to locally reconstruct latent trees and outputs a loopy graph structure with hidden variables. Correctness of the method is established when the underlying graph has a large girth and the model is in the regime of correlation decay, and PAC guarantees for the method are also derived. For the special case of the Ising model, the number of samples $n$ required for structural consistency scales as $n = Omega(theta_{min}^{-2delta eta(eta+1)-2}log p)$, where $theta_{min}$ is the minimum edge potential, $delta$ is the depth (i.e., distance from a hidden node to the nearest observed nodes), and $eta$ is a parameter which depends on the bounds on node and edge potentials in the Ising model. The results are further specialized for the case when the observed nodes are uniformly sampled from the model. Finally, necessary conditions for structural consistency under any algorithm are derived."
to:NB  graphical_models  learning_theory  machine_learning  statistics  ising_model  inference_to_latent_objects 
9 weeks ago by cshalizi
[1202.3482] The local geometry of finite mixtures
"We introduce a technique to obtain local (bracketing) metric entropy bounds for subsets of a normed vector space from global entropy bounds. Using this method, we establish that for q>=1, the class of convex combinations of q translates of a probability density has finite local doubling dimension under a smoothness assumption. The proof requires a detailed investigation of the local geometry of mixture classes, which is of independent interest."
in_NB  statistics  mixture_models  learning_theory 
11 weeks ago by cshalizi
[1203.0193] Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts
Huh: "The Vapnik-Chervonenkis dimension (VC dimension) of the set of half-spaces of R^d with frontiers parallel to the axes is computed exactly. It is shown that this VC dimension is smaller than the intuitive value of d. An additional approximation using the Stirling's formula is given. This result may be used to evaluate the performance of classifiers or regressors based on dyadic partitioning of R^d for instance. Algorithms using axis-parallel cuts to partition R^d are often used to reduce the computational time of such estimators when d is large."
to:NB  learning_theory  vc-dimension  classifiers 
12 weeks ago by cshalizi
[0806.4140] Optimal oracle inequalities for model selection
"Model selection is often performed by empirical risk minimization. The quality of selection in a given situation can be assessed by risk bounds, which require assumptions both on the margin and the tails of the losses used. Starting with examples from the 3 basic estimation problems, regression, classification and density estimation, we formulate risk bounds for empirical risk minimization under successively weakening conditions and prove them at a very general level, for general margin and power tail behavior of the excess losses."
in_NB  statistics  learning_theory  cross-validation  model_selection  van_de_geer.sara 
12 weeks ago by cshalizi
[1202.4294] Prediction of quantiles by statistical learning and application to GDP forecasting
"In this paper, we tackle the problem of prediction and confidence intervals for time series using a statistical learning approach and quantile loss functions. In a first time, we show that the Gibbs estimator (also known as Exponentially Weighted aggregate) is able to predict as well as the best predictor in a given family for a wide set of loss functions. In particular, using the quantile loss function of Koenker and Bassett (1978), this allows to build confidence intervals. We apply these results to the problem of prediction and confidence regions for the French Gross Domestic Product (GDP) growth, with promising results."
in_NB  to_read  prediction  confidence_sets  learning_theory  re:your_favorite_dsge_sucks  re:growing_ensemble_project 
february 2012 by cshalizi
[1202.4283] Fast rates in learning with dependent observations
"In this paper we tackle the problem of fast rates in time series forecasting from a statistical learning perspective. In a serie of papers (e.g. Meir 2000, Modha and Masry 1998, Alquier and Wintenberger 2012) it is shown that the main tools used in learning theory with iid observations can be extended to the prediction of time series. The main message of these papers is that, given a family of predictors, we are able to build a new predictor that predicts the series as well as the best predictor in the family, up to a remainder of order $1/sqrt{n}$. It is known that this rate cannot be improved in general. In this paper, we show that in the particular case of the least square loss, and under a strong assumption on the time series (phi-mixing) the remainder is actually of order $1/n$. Thus, the optimal rate for iid variables, see e.g. Tsybakov 2003, and individual sequences, see cite{lugosi} is, for the first time, achieved for uniformly mixing processes. We also show that our method is optimal for aggregating sparse linear combinations of predictors."

--- Assumes observations are in the interval [-B,B] and gets a bound which is O(B^3), and so useless for our purposes.
in_NB  learning_theory  mixing  ergodic_theory  re:your_favorite_dsge_sucks  re:XV_for_mixing  have_read 
february 2012 by cshalizi
Weakly Universally Consistent Forecasting of Stationary and Ergodic Time Series
"Static forecasting of stationary and ergodic time series is considered, i.e., inference of the conditional expectation of the response variable at time zero given the infinite past. It is shown that the mean squared error of a combination of suitably defined localized least squares estimates converges to zero for all distributions where the response variable is square integrable."
to:NB  universal_prediction  stochastic_processes  ergodic_theory  statistical_inference_for_stochastic_processes  learning_theory 
february 2012 by cshalizi
Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
"We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic.
"We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2) per trial, where n is the dimension of the instances."
to:NB  online_learning  dimension_reduction  machine_learning  learning_theory  warmuth.manfred  principal_components  low-regret_learning 
february 2012 by cshalizi
Giné , Nickl : Rates of contraction for posterior distributions in Lr-metrics, 1 ≤ r ≤ ∞
"The frequentist behavior of nonparametric Bayes estimates, more specifically, rates of contraction of the posterior distributions to shrinking Lr-norm neighborhoods, 1 ≤ r ≤ ∞, of the unknown parameter, are studied. A theorem for nonparametric density estimation is proved under general approximation-theoretic assumptions on the prior. The result is applied to a variety of common examples, including Gaussian process, wavelet series, normal mixture and histogram priors. The rates of contraction are minimax-optimal for 1 ≤ r ≤ 2, but deteriorate as r increases beyond 2. In the case of Gaussian nonparametric regression a Gaussian prior is devised for which the posterior contracts at the optimal rate in all Lr-norms, 1 ≤ r ≤ ∞."
in_NB  bayesian_consistency  statistics  nonparametrics  learning_theory  re:bayes_as_evol  density_estimation  regression 
january 2012 by cshalizi
[0712.1698] PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers
"The aim of this paper is to generalize the PAC-Bayesian theorems proved by Catoni in the classification setting to more general problems of statistical inference. We show how to control the deviations of the risk of randomized estimators. A particular attention is paid to randomized estimators drawn in a small neighborhood of classical estimators, whose study leads to control the risk of the latter. These results allow to bound the risk of very general estimation procedures, as well as to perform model selection."
to:NB  learning_theory  statistics  machine_learning 
january 2012 by cshalizi
Introduction to Online Optimization (Bubeck)
"to_teach" tag a sudden brainstorm for how to make next year's statistical computing class either unbeatably awesome or an absolute disaster
to:NB  online_learning  regression  individual_sequence_prediction  optimization  machine_learning  learning_theory  via:mraginsky  to_read  to_teach:statcomp 
december 2011 by cshalizi
Audibert , Catoni : Robust linear least squares regression
"We consider the problem of robustly predicting as well as the best linear combination of d given functions in least squares regression, and variants of this problem including constraints on the parameters of the linear combination. For the ridge estimator and the ordinary least squares estimator, and their variants, we provide new risk bounds of order d/n without logarithmic factor unlike some standard results, where n is the size of the training data. We also provide a new estimator with better deviations in the presence of heavy-tailed noise. It is based on truncating differences of losses in a min–max framework and satisfies a d/n risk bound both in expectation and in deviations. The key common surprising factor of these results is the absence of exponential moment condition on the output distribution while achieving exponential deviations. All risk bounds are obtained through a PAC-Bayesian analysis on truncated differences of losses. Experimental results strongly back up our truncated min–max estimator."
in_NB  regression  statistics  linear_regression  learning_theory  catoni.olivier 
december 2011 by cshalizi
[1111.5648] Falsification and future performance
"We information-theoretically reformulate two measures of capacity from statistical learning theory: empirical VC-entropy and empirical Rademacher complexity. We show these capacity measures count the number of hypotheses about a dataset that a learning algorithm falsifies when it finds the classifier in its repertoire minimizing empirical risk. It then follows from that the future performance of predictors on unseen data is controlled in part by how many hypotheses the learner falsifies. As a corollary we show that empirical VC-entropy quantifies the message length of the true hypothesis in the optimal code of a particular probability distribution, the so-called actual repertoire."
to:NB  to_read  information_theory  learning_theory  falsification  balduzzi.david 
december 2011 by cshalizi
[1111.6410] Adaptive Semisupervised Inference
"Semisupervised methods inevitably invoke some assumption that links the marginal distribution of the features to the regression function of the label. Most commonly, the cluster or manifold assumptions are used which imply that the regression function is smooth over high-density clusters or manifolds supporting the data. A generalization of these assumptions is that the regression function is smooth with respect to some density sensitive distance. This motivates the use of a density based metric for semisupervised learning. We analyze this setting and make the following contributions - (a) we propose a semi-supervised learner that uses a density-sensitive kernel and show that it provides better performance than any supervised learner if the density support set has a small condition number and (b) we show that it is possible to adapt to the degree of semi-supervisedness using data-dependent choice of a parameter that controls sensitivity of the distance metric to the density. This ensures that the semisupervised learner never performs worse than a supervised learner even if the assumptions fail to hold."
to:NB  machine_learning  learning_theory  semi-supervised_learning  singh.aarti  wasserman.larry 
december 2011 by cshalizi
[1111.6337] Regret Bound by Variation for Online Convex Optimization
"In citep{Hazan-2008-extract}, the authors showed that the regret of online linear optimization can be bounded by the total variation of the cost vectors. In this paper, we extend this result to general online convex optimization. We first analyze the limitations of the algorithm in citep{Hazan-2008-extract} when applied it to online convex optimization. We then present two algorithms for online convex optimization whose regrets are bounded by the variation of cost functions. We finally consider the bandit setting, and present a randomized algorithm for online bandit convex optimization with a variation-based regret bound. We show that the regret bound for online bandit convex optimization is optimal when the variation of cost functions is independent of the number of trials."
in_NB  to_read  re:growing_ensemble_project  learning_theory  individual_sequence_prediction 
december 2011 by cshalizi
[1111.4470] Efficient Regression in Metric Spaces via Approximate Lipschitz Extension
"We present a framework for performing efficient regression in general metric spaces. Roughly speaking, our regressor predicts the value at a new point by computing a Lipschitz extension -- the smoothest function consistent with the observed data -- while performing an optimized structural risk minimization to avoid overfitting. The offline (learning) and online (inference) stages can be solved by convex programming, but this naive approach has runtime complexity O(n^3), which is prohibitive for large datasets. We design instead an algorithm that is fast when the the doubling dimension, which measures the "intrinsic" dimensionality of the metric space, is low.
We use the doubling dimension multiple times; first, on the statistical front, to bound fat-shattering dimension of the class of Lipschitz functions (and obtain risk bounds); and second, on the computational front, to quickly compute a hypothesis function and a prediction based on Lipschitz extension. Our resulting regressor is both asymptotically strongly consistent and comes with finite-sample risk bounds, while making minimal structural and noise assumptions."
in_NB  heard_the_talk  kith_and_kin  regression  learning_theory  statistics  kontorovich.aryeh 
november 2011 by cshalizi
[1111.3404] Estimated VC dimension for risk bounds
"Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the generalization capacity of learning algorithms. However, apart from a few special cases, it is hard or impossible to calculate analytically. Vapnik et al. [10] proposed a technique for estimating the VC dimension empirically. While their approach behaves well in simulations, it could not be used to bound the generalization risk of classifiers, because there were no bounds for the estimation error of the VC dimension itself. We rectify this omission, providing high probability concentration results for the proposed estimator and deriving corresponding generalization bounds."
self-promotion  learning_theory  vc-dimension  machine_learning  re:your_favorite_dsge_sucks 
november 2011 by cshalizi
Generalization Bound for Infinitely Divisible Empirical Process
"In this paper, we study the generalization bound for an empirical process of samples independently drawn from an infinitely divisible (ID) distribution, which is termed as the ID empirical process. In particular, based on a martingale method, we develop deviation inequalities for the sequence of random variables of an ID distribution. By applying the obtained deviation inequalities, we then show the generalization bound for ID empirical process based on the annealed Vapnik- Chervonenkis (VC) entropy. Afterward, according to Sauer’s lemma, we get the generalization bound for ID empirical process based on the VC dimension. Finally, by using a resulted result bound, we analyze the asymptotic convergence of ID empirical process and show that the convergence rate of ID empirical process can reach O((frac{Lambda_mathcal{F}(2N)}{N})^{frac{1}{1.3}}) and it is faster than the results of the generic i.i.d. empirical process (Vapnik, 1999) "
in_NB  learning_theory  empirical_processes  stochastic_processes  levy_processes  martingales  re:almost_none 
november 2011 by cshalizi
Improved Regret Guarantees for Online Smooth Convex Optimization with Bandit Feedback
The study of online convex optimization in the bandit setting was initiated by Kleinberg (2004) and Flaxman et al. (2005). Such a setting models a decision maker that has to make decisions in the face of adversarially chosen convex loss functions. Moreover, the only information the decision maker receives are the losses. The identity of the loss functions themselves is not revealed. In this setting, we reduce the gap between the best known lower and upper bounds for the class of smooth convex functions, i.e. convex functions with a Lipschitz continuous gradient. Building upon existing work on self-concordant regularizers and one-point gradient estimation, we give the first algorithm whose expected regret, ignoring constant and logarithmic factors, is O(T^{2/3}).
to:NB  decision_theory  learning_theory  machine_learning  bandit_problems  individual_sequence_prediction 
november 2011 by cshalizi
Adaptive Bandits: Towards the Best History-Dependent Strategy
"We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e.~he provides rewards that are stochastic functions of equivalence classes defined by some model theta* in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t.~the best strategy among all mappings from classes to actions (i.e.~the best history-class-based strategy) for the best model in Theta. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Theta={theta}, i.e.~only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by tilde O(sqrt{TAC}), where C is the number of classes of theta. Now, when many models are available, all known algorithms achieving a nice regret O(sqrt{T}) are unfortunately not tractable and scale poorly with the number of models $|Theta|$. Our contribution here is to provide tractable algorithms with regret bounded by T^{2/3}C^{1/3}log(|Theta|)^{1/2}. "
to:NB  decision_theory  learning_theory  machine_learning  bandit_problems  individual_sequence_prediction 
november 2011 by cshalizi
[1110.6886] PAC-Bayesian Inequalities for Martingales
"We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. We also present a comparison inequality that bounds expectation of a convex function of martingale difference type variables by expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma inequality." --- For the record: I hate martingales.
to:NB  re:almost_none  learning_theory  concentration_of_measure  martingales 
november 2011 by cshalizi
[1110.3592] Information, learning and falsification
"Broadly speaking, there are two approaches to quantifying information. The first, Shannon information, takes events as belonging to ensembles and quantifies the information resulting from observing the given event in terms of the number of alternate events that have been ruled out. The second, algorithmic information or Kolmogorov complexity, takes events as strings and, given a universal Turing machine, quantifies the information content of a string as the length of the shortest program producing it. This note describes a new method of quantifying information, effective information, that links algorithmic information to Shannon information, and also links both to capacities arising in statistical learning theory. After introducing the measure, we show that it provides a non-universal analog of algorithmic information. We then apply it to derive basic capacities in statistical learning theory: empirical VC-entropy and empirical Rademacher complexity. A nice byproduct of our approach is an interpretation of the explanatory power of a learning algorithm in terms of the number of hypotheses it falsifies (counted in two different ways for the two different capacities). We also discuss how effective information relates to information gain, Shannon and mutual information."
to:NB  to_read  learning_theory  information_theory  balduzzi.david 
october 2011 by cshalizi
[1110.2755] Efficient Tracking of Large Classes of Experts
"In the framework for prediction of individual sequences, sequential prediction methods are to be constructed that perform nearly as well as the best expert from a given class. We consider prediction strategies that compete with the class of switching strategies that can segment a given sequence into several blocks, and follow the advice of a different "base" expert in each block. As usual, the performance of the algorithm is measured by the regret defined as the excess loss relative to the best switching strategy %(with an arbitrary number of switches) selected in hindsight for the particular sequence to be predicted. In this paper we construct %strongly sequential (i.e., horizon-independent) prediction strategies of low computational cost for the case where the set of base experts is large. In particular we derive a family of efficient tracking algorithms that, for any prediction algorithm $A$ designed for the base class, can be implemented with time and space complexity $O(n^{gamma} log n)$ times larger than that of $A$, where $n$ is the time horizon and $gamma ge 0$ is a parameter of the algorithm. With $A$ properly chosen, our algorithm achieves a regret bound of optimal order for $gamma>0$, and only $O(log n)$ times larger than the optimal order for $gamma=0$ for all typical regret bound types we examined. For example, for predicting binary sequences with switching parameters, our method achieves the optimal $O(log n)$ regret rate with time complexity $O(n^{1+gamma}log n)$ for any $gammain (0,1)$."
to:NB  to_read  re:growing_ensemble_project  learning_theory  individual_sequence_prediction  lugosi.gabor 
october 2011 by cshalizi
[1110.2529] The Generalization Ability of Online Algorithms for Dependent Data
"We study the generalization performance of arbitrary online learning algorithms trained on samples coming from a dependent source of data. We show that the generalization error of any stable online algorithm concentrates around its regret--an easily computable statistic of the online performance of the algorithm--when the underlying ergodic process is $beta$- or $phi$-mixing. We show high probability error bounds assuming the loss function is convex, and we also establish sharp convergence rates and deviation bounds for strongly convex losses and several linear prediction problems such as linear and logistic regression, least-squares SVM, and boosting on dependent data. In addition, our results have straightforward applications to stochastic optimization with dependent data, and our analysis requires only martingale convergence arguments; we need not rely on more powerful statistical tools such as empirical process theory."
in_NB  learning_theory  individual_sequence_prediction  ergodic_theory  mixing  re:growing_ensemble_project  re:XV_for_mixing  stability_of_learning  concentration_of_measure  have_read  re:your_favorite_dsge_sucks 
october 2011 by cshalizi
IEEE Xplore - Computational Limits to Nonparametric Estimation for Ergodic Processes
"A new negative result for nonparametric distribution estimation of binary ergodic processes is shown. The problem of estimation of distribution with any degree of accuracy is studied. Then it is shown that for any countable class of estimators there is a zero-entropy binary ergodic process that is inconsistent with the class of estimators. Our result is different from other negative results for universal forecasting scheme of ergodic processes."
to:NB  universal_prediction  ergodic_theory  statistics  statistical_inference_for_stochastic_processes  learning_theory 
october 2011 by cshalizi
A statistical physics approach for the analysis of machine learning algorithms on real data
"We combine the replica approach of statistical physics with a variational technique to make it applicable for the analysis of machine learning algorithms on real data. The method is applied to Gaussian process models and their relative, the support vector machine. We discuss the quality of our theoretical results in comparison to experiments. As a key result, we apply our theory on real world benchmark data and show its potential for practical applications by deriving approximate expressions for data averaged performance measures which hold for general data distributions and allow us to optimize the performance of the learning algorithm."
to_be_shot_after_a_fair_trial  machine_learning  statistical_mechanics  learning_theory  in_NB 
august 2011 by cshalizi
[1108.0775] Optimization with Sparsity-Inducing Penalties
"Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate non-smooth norms. The goal of this paper is to present from a general perspective optimization tools and techniques dedicated to such sparsity-inducing penalties. We cover proximal methods, block-coordinate descent, reweighted $\ell_2$-penalized techniques, working-set and homotopy methods, as well as non-convex formulations and extensions, and provide an extensive set of experiments to compare various algorithms from a computational point of view."
learning_theory  sparsity  optimization  lasso  to:NB  machine_learning  statistics 
august 2011 by cshalizi
[1107.4080] On the Universality of Online Mirror Descent
"We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee."
optimization  learning_theory  online_learning  low-regret_learning 
july 2011 by cshalizi
Differentially Private Empirical Risk Minimization
"Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). ... a new method, objective perturbation, for privacy-preserving machine learning algorithm design ... perturbing the objective function before optimizing ...our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. ... a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees ... encouraging results from ... performance on real demographic and benchmark data sets... objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance."
learning_theory  stability_of_learning  privacy  sarwate.anand  to:NB 
april 2011 by cshalizi
Cross-Validation and Mean-Square Stability
It's a little boggling that they don't cite any of the modern (2000--) work on theoretical properties of CV, but oh well...
cross-validation  learning_theory  stability_of_learning  statistics  re:your_favorite_dsge_sucks  re:XV_for_mixing  re:XV_for_networks  to_read  via:nikete 
march 2011 by cshalizi
[1102.5750] Neyman-Pearson classification, convexity and stochastic constraints
"Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its probability of type I error is below a pre-specified level and (ii), it has probability of type II error close to the minimum possible. The proposed classifier is obtained by solving an optimization problem with an empirical objective and an empirical constraint. New techniques to handle such problems are developed and have consequences on chance constrained programming."
Final version: http://jmlr.csail.mit.edu/papers/v12/rigollet11a.html
in_NB  learning_theory  statistics  hypothesis_testing  convexity  machine_learning  optimization  have_read  rigollet.philippe 
march 2011 by cshalizi
[0810.4752] Statistical Learning Theory: Models, Concepts, and Results
"Statistical learning theory provides the theoretical basis for many of today's machine learning algorithms. In this article we attempt to give a gentle, non-technical overview over the key ideas and insights of statistical learning theory. We target at a broad audience, not necessarily machine learning researchers. This paper can serve as a starting point for people who want to get an overview on the field before diving into technical details."
learning_theory  machine_learning  statistics 
january 2011 by cshalizi
[1011.3396] PAC-Bayesian aggregation and multi-armed bandits
"This habilitation thesis presents several contributions to (1) the PAC-Bayesian analysis of statistical learning, (2) the three aggregation problems: given d functions, how to predict as well as (i) the best of these d functions (model selection type aggregation), (ii) the best convex combination of these d functions, (iii) the best linear combination of these d functions, (3) the multi-armed bandit problems."
statistics  learning_theory  pac-bayesian  model_selection  ensemble_methods  to:NB 
november 2010 by cshalizi
Giné , Nickl : Adaptive estimation of a distribution function and its density in sup-norm loss by wavelet and spline projections
"Given an i.i.d. sample from a distribution F on ℝ with uniformly continuous density p0, purely data-driven estimators are constructed that efficiently estimate F in sup-norm loss and simultaneously estimate p0 at the best possible rate of convergence over Hölder balls, also in sup-norm loss. The estimators are obtained by applying a model selection procedure close to Lepski’s method with random thresholds to projections of the empirical measure onto spaces spanned by wavelets or B-splines. The random thresholds are based on suprema of Rademacher processes indexed by wavelet or spline projection kernels. This requires Bernstein-type analogs of the inequalities in Koltchinskii [Ann. Statist. 34 (2006) 2593–2656] for the deviation of suprema of empirical processes from their Rademacher symmetrizations."
learning_theory  density_estimation  empirical_processes 
november 2010 by cshalizi
Learnability, Stability, and Uniform Convergence
"characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. ... the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression."
learning_theory  stability_of_learning  have_read  re:your_favorite_dsge_sucks 
november 2010 by cshalizi
[1010.2286] Divergence-based characterization of fundamental limitations of adaptive dynamical systems
"general problem of adaptively controlling and/or identifying a stochastic dynamical system, where our {\em a priori} knowledge allows us to place the system in a subset of a metric space (the uncertainty set). We present an information-theoretic meta-theorem that captures the trade-off between the metric complexity (or richness) of the uncertainty set, the amount of information acquired online in the process of controlling and observing the system, and the residual uncertainty remaining after the observations have been collected. Following the approach of Zames, we quantify {\em a priori} information by the Kolmogorov (metric) entropy of the uncertainty set, while the information acquired online is expressed as a sum of information divergences. The general theory is used to derive new minimax lower bounds on the metric identification error, as well as to give a simple derivation of the minimum time needed to stabilize an uncertain stochastic linear system."
systems_identification  control_theory  estimation  learning_theory  minimax  heard_the_talk  raginsky.maxim  to:NB 
october 2010 by cshalizi
[1009.4434] The universal Glivenko-Cantelli property
I wonder if this could be used to give a more comprehensible sufficient condition for posterior convergence in the setting of my Bayes-is-a-special-case-of-replicator-dynamics paper?
learning_theory  probability  glivenko-cantelli  re:bayes_as_evol  have_read  re:almost_none 
september 2010 by cshalizi
Typical, just typical « The Information Structuralist
Maxim blogs about his own new paper, and about the "Coordination Capacity" paper, so I don't have to.
information_theory  method_of_types  learning_theory  raginsky.maxim 
september 2010 by cshalizi
Risk Bounds for Levy Processes in the PAC-Learning Framework
"Levy processes play an important role in the stochastic process theory. However, since samples are non-i.i.d., statistical learning results based on the i.i.d. scenarios cannot be utilized to study the risk bounds for Levy processes. In this paper, we present risk bounds for non-i.i.d. samples drawn from Levy processes in the PAC-learning framework. In particular, by using a concentration inequality for infinitely divisible distributions, we first prove that the function of risk error is Lipschitz continuous with a high probability, and then by using a specific concentration inequality for Levy processes, we obtain the risk bounds for non-i.i.d. samples drawn from Levy processes without Gaussian components. Based on the resulted risk bounds, we analyze the factors that affect the convergence of the risk bounds and then prove the convergence."
learning_theory  stochastic_processes  levy_processes  statistical_inference_for_stochastic_processes  via:djm1107 
august 2010 by cshalizi
[1007.4037] Uniform Approximation and Bracketing Properties of VC classes
"We show that the sets in a family with finite VC dimension can be uniformly approximated within a given error by a finite partition. Immediate corollaries include the fact that VC classes have finite bracketing numbers, satisfy uniform laws of averages under strong dependence, and exhibit uniform mixing. Our results are based on recent work concerning uniform laws of averages for VC classes under ergodic sampling." I.e., they repeat their previous trick: if you don't have uniform approximation, they can construct arbitrarily large shatterable sets.
learning_theory  ergodic_theory  nobel.andrew  adams.terrence  have_read 
july 2010 by cshalizi
« earlier      

related tags

aaronson.scott  adams.terrence  analysis  asymptotics  balduzzi.david  bandit_problems  bayesianism  bayesian_consistency  biau.gerard  blanchard.gilles  blogged  books:noted  books:recommended  book_reviews  boosting  bootstrap  boucheron.stephane  bousquet.olivier  breiman.leo  calibration  carnegie_mellon  catoni.olivier  central_limit_theorem  classifiers  clustering  cognitive_science  community_discovery  complexity  compressed_sensing  computational_complexity  computational_statistics  concentration_of_measure  conferences  confidence_sets  consistency  control_theory  convexity  cross-validation  cybernetics  data_mining  decision_theory  dembo.amir  density_estimation  deviation_bounds  deviation_inequalities  devroye.luc  dimension_reduction  diversity  economics  empirical_processes  ensemble_methods  epistemology  ergodic_theory  estimation  evolutionary_search  exchangeable_sequences  experimental_biology  experimental_psychology  exploitation-exploration_tradeoff  falsification  first_draw_your_curve_then_plot_your_data  freund.yoav  functional_connectivity  function_approximation  funny:geeky  glivenko-cantelli  grants  graphical_models  grunwald.peter  harman.gilbert  have_read  heard_the_talk  hierarchical_structure  hilbert_space  histograms  hypothesis_testing  individual_sequence_prediction  induction  inference_to_latent_objects  information_criteria  information_theory  inverse_problems  in_NB  ising_model  jiang.wenxin  jordan.michael_i.  k-means  kakade.sham  kearns.michael  kelly.kevin_t.  kernel_methods  kith_and_kin  kontorovich.aryeh  kulkarni.sanjeev  large_deviations  lasso  learning_in_games  learning_theory  leeb.hannes  levy_processes  linear_regression  liu.han  low-regret_learning  lugosi.gabor  machine_learning  macroeconomics  manifold_learning  markov_models  martingales  massart.pascal  mayo-wilson.conor  mayo.deborah  measure_theory  methodology  method_of_types  minimax  mixing  mixture_models  model_selection  nadler.boaz  network_data_analysis  neural_modeling  neural_networks  neuroscience  niyogi.partha  nobel.andrew  nonparametrics  occams_razor  online_learning  optimization  oracle_inequalities  pac-bayesian  paradoxes  perturbation_theory  phase_transitions  philosophy_of_science  photos  physics  predesignation  prediction  principal_components  privacy  probability  probably_approximately_correct  quantum_computing_since_democritus  quantum_mechanics  rademacher_complexity  raginsky.maxim  random_forests  re:almost_none  re:AoS_project  re:bayes_as_evol  re:growing_ensemble_project  re:naive-semi-supervised  re:phil-of-bayes_paper  re:smoothing_adjacency_matrices  re:stacs  re:XV_for_mixing  re:XV_for_networks  re:your_favorite_dsge_sucks  regression  relational_learning  resampling  review_papers  rigollet.philippe  risk_vs_uncertainty  sarwate.anand  self-centered  self-promotion  semi-supervised_learning  singh.aarti  smola.alex  sober.elliott  sparsity  splines  stability_of_learning  state-space_models  statistical_inference_for_stochastic_processes  statistical_learning  statistical_mechanics  statistics  stochastic_differential_equations  stochastic_processes  structural_risk_minimization  support_vector_machines  systems_identification  thesis_proposals  time_series  to:blog  to:NB  to_be_shot_after_a_fair_trial  to_read  to_teach:data-mining  to_teach:statcomp  to_teach:undergrad-ADA  track_down_references  universal_prediction  van_der_vaart.aad  van_de_geer.sara  van_handel.ramon  vapnik.v.n.  vapnik.vladimir  vazirani.umesh  vc-dimension  via:?  via:arthegall  via:ded-maxim  via:djm1107  via:klk  via:mraginsky  via:mreid  via:nikete  via:scotte  via:shivak  vishwanathan.s.v.n.  warmuth.manfred  wasserman.larry  weak_dependence  wellner.jon  zhang.tong 

Copy this bookmark:



description:


tags: