cshalizi + cross-validation   28

[0803.2963] Consistency of cross validation for comparing regression procedures
"Theoretical developments on cross validation (CV) have mainly focused on selecting one among a list of finite-dimensional models (e.g., subset or order selection in linear regression) or selecting a smoothing parameter (e.g., bandwidth for kernel smoothing). However, little is known about consistency of cross validation when applied to compare between parametric and nonparametric methods or within nonparametric methods. We show that under some conditions, with an appropriate choice of data splitting ratio, cross validation is consistent in the sense of selecting the better procedure with probability approaching 1. Our results reveal interesting behavior of cross validation. When comparing two models (procedures) converging at the same nonparametric rate, in contrast to the parametric case, it turns out that the proportion of data used for evaluation in CV does not need to be dominating in size. Furthermore, it can even be of a smaller order than the proportion for estimation while not affecting the consistency property."
to:NB  statistics  to_read  cross-validation  model_selection  nonparametrics  to_teach:undergrad-ADA  re:stacs 
11 weeks ago by cshalizi
[0806.4140] Optimal oracle inequalities for model selection
"Model selection is often performed by empirical risk minimization. The quality of selection in a given situation can be assessed by risk bounds, which require assumptions both on the margin and the tails of the losses used. Starting with examples from the 3 basic estimation problems, regression, classification and density estimation, we formulate risk bounds for empirical risk minimization under successively weakening conditions and prove them at a very general level, for general margin and power tail behavior of the excess losses."
in_NB  statistics  learning_theory  cross-validation  model_selection  van_de_geer.sara 
12 weeks ago by cshalizi
Model Selection in Kernel Based Regression using the Influence Function
"Recent results about the robustness of kernel methods involve the analysis of influence functions. By definition the influence function is closely related to leave-one-out criteria. In statistical learning, the latter is often used to assess the generalization of a method. In statistics, the influence function is used in a similar way to analyze the statistical efficiency of a method. Links between both worlds are explored. The influence function is related to the first term of a Taylor expansion. Higher order influence functions are calculated. A recursive relation between these terms is found characterizing the full Taylor expansion. It is shown how to evaluate influence functions at a specific sample distribution to obtain an approximation of the leave-one-out error. A specific implementation is proposed using a L1 loss in the selection of the hyperparameters and a Huber loss in the estimation procedure. The parameter in the Huber loss controlling the degree of robustness is optimized as well. The resulting procedure gives good results, even when outliers are present in the data."
to:NB  statistics  regression  kernel_estimators  model_selection  robustness  nonparametrics  cross-validation 
february 2012 by cshalizi
Shen , Welch , Hughes-Oliver : Efficient, adaptive cross-validation for tuning and comparing models, with application to drug discovery
"Cross-validation (CV) is widely used for tuning a model with respect to user-selected parameters and for selecting a “best” model. For example, the method of k-nearest neighbors requires the user to choose k, the number of neighbors, and a neural network has several tuning parameters controlling the network complexity. Once such parameters are optimized for a particular data set, the next step is often to compare the various optimized models and choose the method with the best predictive performance. Both tuning and model selection boil down to comparing models, either across different values of the tuning parameters or across different classes of statistical models and/or sets of explanatory variables. For multiple large sets of data, like the PubChem drug discovery cheminformatics data which motivated this work, reliable CV comparisons are computationally demanding, or even infeasible. In this paper we develop an efficient sequential methodology for model comparison based on CV. It also takes into account the randomness in CV. The number of models is reduced via an adaptive, multiplicity-adjusted sequential algorithm, where poor performers are quickly eliminated. By exploiting matching of individual observations, it is sometimes even possible to establish the statistically significant inferiority of some models with just one execution of CV."
in_NB  model_selection  statistics  cross-validation  machine_learning 
december 2011 by cshalizi
Variance estimation using refitted cross-validation in ultrahigh dimensional regression - Fan - 2011 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library
"Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to a serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the refitted cross-validation method proposed."
statistics  regression  variable_selection  cross-validation  estimation  to:NB  fan.jianqing 
october 2011 by cshalizi
Cross-Validation and Mean-Square Stability
It's a little boggling that they don't cite any of the modern (2000--) work on theoretical properties of CV, but oh well...
cross-validation  learning_theory  stability_of_learning  statistics  re:your_favorite_dsge_sucks  re:XV_for_mixing  re:XV_for_networks  to_read  via:nikete 
march 2011 by cshalizi
[1010.6202] Sequential Data-Adaptive Bandwidth Selection by Cross-Validation for Nonparametric Prediction
"We consider the problem of bandwidth selection by cross-validation from a sequential point of view in a nonparametric regression model. Having in mind that in applications one often aims at estimation, prediction and change detection simultaneously, we investigate that approach for sequential kernel smoothers in order to base these tasks on a single statistic. We provide uniform weak laws of large numbers and weak consistency results for the cross-validated bandwidth. Extensions to weakly dependent error terms are discussed as well. The errors may be {\alpha}-mixing or L2-near epoch dependent, which guarantees that the uniform convergence of the cross validation sum and the consistency of the cross-validated bandwidth hold true for a large class of time series. The method is illustrated by analyzing photovoltaic data."
cross-validation  prediction  time_series  model_selection  to_read 
november 2010 by cshalizi
Commenges: Statistical models: Conventional, penalized and hierarchical likelihood
"We give an overview of statistical models and likelihood, together with two of its variants: penalized and hierarchical likelihood. The Kullback-Leibler divergence is referred to repeatedly in the literature, for defining the misspecification risk of a model and for grounding the likelihood and the likelihood cross-validation, which can be used for choosing weights in penalized likelihood. Families of penalized likelihood and particular sieves estimators are shown to be equivalent. The similarity of these likelihoods with a posteriori distributions in a Bayesian approach is considered."
statistics  likelihood  cross-validation  re:phil-of-bayes_paper  to_read 
december 2009 by cshalizi
Arlot, Blanchard, Roquain: Some nonasymptotic results on resampling in high dimension, I: Confidence regions
"We study generalized bootstrap confidence regions for the mean of a random vector whose coordinates have an unknown dependency structure. The random vector is supposed to be either Gaussian or to have a symmetric and bounded distribution. The dimensionality of the vector can possibly be much larger than the number of observations and we focus on a nonasymptotic control of the confidence level, following ideas inspired by recent results in learning theory. We consider two approaches, the first based on a concentration principle (valid for a large class of resampling weights) and the second on a resampled quantile, specifically using Rademacher weights. Several intermediate results established in the approach based on concentration principles are of interest in their own right. We also discuss the question of accuracy when using Monte Carlo approximations of the resampled quantities."
statistics  resampling  bootstrap  cross-validation  confidence_sets  to_read  re:XV_for_mixing  concentration_of_measure  learning_theory 
december 2009 by cshalizi
Sensitivity Analysis of k-fold Cross-Validation in Prediction Error Estimation
Apparently IEEE makes this available solely to tease me, since, while we have a fully paid-up electronic subscription, I can't get access.
machine_learning  statistics  cross-validation  to_read  re:XV_for_mixing  re:XV_for_networks 
november 2009 by cshalizi
A Cross-Validation Filter for Time Series Models (Piet De Jong, 1988)
"A filter is presented which computes cross-validation errors and associated statistics for an arbitrary state space model. The procedure is more efficient than an existing approach. Diffuse initial conditions are easily handled using a minor extension. The relationship to the fixed interval smoothing algorithm is investigated."
cross-validation  state-space_models  markov_models  time_series  have_read 
december 2008 by cshalizi
Cross-Validation and the Estimation of Conditional Probability Densities
Nice. Definitely needs to be included next time I teach data-mining. (The method is implemented in the "np" package on CRAN.) In particular worth comparing to logistic regression and logistic GAMs for binary conditional probability estimation/classification.
statistics  density_estimation  kernel_methods  cross-validation  to_teach:data-mining  have_read  to_teach:undergrad-ADA 
october 2008 by cshalizi

Copy this bookmark:



description:


tags: