cshalizi + ensemble_methods   50

Rigollet : Kullback–Leibler aggregation and misspecified generalized linear models
"In a regression setup with deterministic design, we study the pure aggregation problem and introduce a natural extension from the Gaussian distribution to distributions in the exponential family. While this extension bears strong connections with generalized linear models, it does not require identifiability of the parameter or even that the model on the systematic component is true. It is shown that this problem can be solved by constrained and/or penalized likelihood maximization and we derive sharp oracle inequalities that hold both in expectation and with high probability. Finally all the bounds are proved to be optimal in a minimax sense."
to:NB  regression  ensemble_methods  statistics 
9 days ago by cshalizi
[1203.0697] Learning High-Dimensional Mixtures of Graphical Models
"We consider the problem of learning mixtures of discrete graphical models in high dimensions and propose a novel method for estimating the mixture components with provable guarantees. The method proceeds mainly in three stages. In the first stage, it estimates the union of the Markov graphs of the mixture components (referred to as the union graph) via a series of rank tests. It then uses this estimated union graph to compute the mixture components via a spectral decomposition method. The spectral decomposition method was originally proposed for latent class models, and we adapt this method for learning the more general class of graphical model mixtures. In the end, the method produces tree approximations of the mixture components via the Chow-Liu algorithm. Our output is thus a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. When the union graph has sparse node separators, we prove that our method has sample and computational complexities scaling as poly(p, d, r), for an r-component mixture of p-variate graphical models, where d is the cardinality of the sample space of each node variable. We also extend our results to the case when the union graph has sparse local separators, which is a weaker criterion than having sparse exact separators, and when the mixture components are in the regime of correlation decay. The computational and sample complexities of our method for this class are significantly improved, since they involve an upper bound on the cardinality of local separators (as opposed to exact separators). Our results push the realm of tractable model classes for high-dimensional learning, which includes the class of tree mixtures."
in_NB  mixture_models  ensemble_methods  graphical_models  machine_learning  to_read  chow-liu_algorithm 
7 weeks ago by cshalizi
[1203.5829] Ensemble estimators for multivariate entropy estimation
"The problem of estimation of density functionals like entropy and mutual information has received much attention in the statistics and information theory communities. A large class of estimators of functionals of the probability density suffer from the curse of dimensionality, wherein the exponent in the MSE rate of convergence decays increasingly slowly as the dimension $d$ of the samples increases. In particular, the rate is often glacially slow of order $O(T^{-{gamma}/{d}})$, where $T$ is the number of samples, and $gamma>0$ is a rate parameter. Examples of such estimators include kernel density estimators, $k$-NN density estimators, $k$-NN entropy estimators, intrinsic dimension estimators and other examples. In this paper, we propose a weighted convex combination of an ensemble of such estimators, where optimal weights can be chosen such that the weighted estimator converges at a much faster dimension invariant rate of $O(T^{-1})$. Furthermore, we show that these optimal weights can be determined by solving a convex optimization problem which can be performed offline and does not require training data. We illustrate the superior performance of our weighted estimator for two important applications: (i) estimating the Panter-Dite distortion-rate factor and (ii) estimating the Shannon entropy for testing the probability distribution of a random sample."
in_NB  ensemble_methods  entropy_estimation  statistics 
8 weeks ago by cshalizi
[1202.3716] Boosting as a Product of Experts
"In this paper, we derive a novel probabilistic model of boosting as a Product of Experts. We re-derive the boosting algorithm as a greedy incremental model selection procedure which ensures that addition of new experts to the ensemble does not decrease the likelihood of the data. These learning rules lead to a generic boosting algorithm - POE- Boost which turns out to be similar to the AdaBoost algorithm under certain assumptions on the expert probabilities. The paper then extends the POEBoost algorithm to POEBoost.CS which handles hypothesis that produce probabilistic predictions. This new algorithm is shown to have better generalization performance compared to other state of the art algorithms."
to:NB  boosting  ensemble_methods  machine_learning  re:democratic_cognition 
12 weeks ago by cshalizi
[0805.3949] A Few Results about the Geometry of Model Averages
"Given a collection of computational models that all estimate values of the same natural process, we compare the performance of the average of the collection to the individual member whose estimates are nearest a given set of observations. Performance is the ability of a model, or average, to reproduce a sequence of observations of the process. We identify a condition that determines if a single model performs better than the average. That result also yields a necessary condition for when the average performs better than any individual model. We also give sharp bounds for the performance of the average on a given interval. Since the observation interval is fixed, performance is evaluated in a vector space, and we can add intuition to our results by explaining them geometrically. We conclude with some comments on directions statistical tests of performance might take."
to:NB  machine_learning  ensemble_methods 
12 weeks ago by cshalizi
[0810.5288] Aggregation of penalized empirical risk minimizers in regression
"We give a general result concerning the rates of convergence of penalized empirical risk minimizers (PERM) in the regression model. Then, we consider the problem of agnostic learning of the regression, and give in this context an oracle inequality and a lower bound for PERM over a finite class. These results hold for a general multivariate random design, the only assumption being the compactness of the support of its law (allowing discrete distributions for instance). Then, using these results, we construct adaptive estimators. We consider as examples adaptive estimation over anisotropic Besov spaces or reproductive kernel Hilbert spaces. Finally, we provide an empirical evidence that aggregation leads to more stable estimators than more standard cross-validation or generalized cross-validation methods for the selection of the smoothing parameter, when the number of observation is small."
to:NB  statistics  ensemble_methods  model_selection 
12 weeks ago by cshalizi
The Asymmetric Business Cycle
"The business cycle is a fundamental yet elusive concept in macroeconomics. In this paper, we consider the problem of measuring the business cycle. First, we argue for the output-gap view that the business cycle corresponds to transitory deviations in economic activity away from a permanent, or trend, level. Then we investigate the extent to which a general model-based approach to estimating trend and cycle for the U.S. economy leads to measures of the business cycle that reflect models versus the data. We find empirical support for a nonlinear time series model that produces a business cycle measure with an asymmetric shape across NBER expansion and recession phases. Specifically, this business cycle measure suggests that recessions are periods of relatively large and negative transitory fluctuations in output. However, several close competitors to the nonlinear model produce business cycle measures of widely differing shapes and magnitudes. Given this model-based uncertainty, we construct a model-averaged measure of the business cycle. This measure also displays an asymmetric shape and is closely related to other measures of economic slack such as the unemployment rate and capacity utilization."
--- Worthy, but at the same time makes me want to lock them in a room with a copy of Li and Racine's _Nonparametric Econometrics_, or even _The Elements of Statistical Learning_, and not let them out until they understand it.
in_NB  time_series  statistics  economics  macroeconomics  inference_to_latent_objects  re:your_favorite_dsge_sucks  morley.james  have_read  ensemble_methods  model_selection 
february 2012 by cshalizi
[1112.6390] Early Warning with Calibrated and Sharper Probabilistic Forecasts
"Given a nonlinear model, a probabilistic forecast may be obtained by Monte Carlo simulations. At a given forecast horizon, Monte Carlo simulations yield sets of discrete forecasts, which can be converted to density forecasts. The resulting density forecasts will inevitably be downgraded by model mis-specification. In order to enhance the quality of the density forecasts, one can mix them with the unconditional density. This paper examines the value of combining conditional density forecasts with the unconditional density. The findings have positive implications for issuing early warnings in different disciplines including economics and meteorology, but UK inflation forecasts are considered as an example." --- Better than conformal predictors?
to:NB  prediction  statistics  ensemble_methods  density_estimation 
december 2011 by cshalizi
[1111.6174] Resolving conflicts between statistical methods by probability combination: Application to empirical Bayes analyses of genomic data
"In the typical analysis of a data set, a single method is selected for statistical reporting even when equally applicable methods yield very different results. Examples of equally applicable methods can correspond to those of different ancillary statistics in frequentist inference and of different prior distributions in Bayesian inference. More broadly, choices are made between parametric and nonparametric methods and between frequentist and Bayesian methods.
Rather than choosing a single method, it can be safer, in a game-theoretic sense, to combine those that are equally appropriate in light of the available information. Since methods of combining subjectively assessed probability distributions are not objective enough for that purpose, this paper introduces a method of distribution combination that does not require any assignment of distribution weights. It does so by formalizing a hedging strategy in terms of a game between three players: nature, a statistician combining distributions, and a statistician refusing to combine distributions. The optimal move of the first statistician reduces to the solution of a simpler problem of selecting an estimating distribution that minimizes the Kullback-Leibler loss maximized over the plausible distributions to be combined. The resulting combined distribution is a linear combination of the most extreme of the distributions to be combined that are scientifically plausible. The optimal weights are close enough to each other that no extreme distribution dominates the others.
The new methodology is illustrated by combining conflicting empirical Bayes methodologies in the context of gene expression data analysis."
in_NB  ensemble_methods  statistics  prediction  bickel.david 
december 2011 by cshalizi
[1111.5312] Representations and Ensemble Methods for Dynamic Relational Classification
"Temporal networks are ubiquitous and evolve over time by the addition, deletion, and changing of links, nodes, and attributes. Although many relational datasets contain temporal information, the majority of existing techniques in relational learning focus on static snapshots and ignore the temporal dynamics. We propose a framework for discovering temporal representations of relational data to increase the accuracy of statistical relational learning algorithms. The temporal relational representations serve as a basis for classification, ensembles, and pattern mining in evolving domains. The framework includes (1) selecting the time-varying relational components (links, attributes, nodes), (2) selecting the temporal granularity, (3) predicting the temporal influence of each time-varying relational component, and (4) choosing the weighted relational classifier. Additionally, we propose temporal ensemble methods that exploit the temporal-dimension of relational data. These ensembles outperform traditional and more sophisticated relational ensembles while avoiding the issue of learning the most optimal representation. Finally, the space of temporal-relational models are evaluated using a sample of classifiers. In all cases, the proposed temporal-relational classifiers outperform competing models that ignore the temporal information. The results demonstrate the capability and necessity of the temporal-relational representations for classification, ensembles, and for mining temporal datasets."
in_NB  to_read  relational_learning  network_data_analysis  transaction_networks  neville.jennifer  machine_learning  ensemble_methods  time_series  classifiers 
november 2011 by cshalizi
Boosting - The MIT Press
"Boosting is an approach to machine learning based on the idea of creating a highly accurate predictor by combining many weak and inaccurate “rules of thumb.” A remarkably rich theory has evolved around boosting, with connections to a range of topics, including statistics, game theory, convex optimization, and information geometry. Boosting algorithms have also enjoyed practical success in such fields as biology, vision, and speech processing. At various times in its history, boosting has been perceived as mysterious, controversial, even paradoxical.

This book, written by the inventors of the method, brings together, organizes, simplifies, and substantially extends two decades of research on boosting, presenting both theory and applications in a way that is accessible to readers from diverse backgrounds while also providing an authoritative reference for advanced researchers. With its introductory treatment of all material and its inclusion of exercises in every chapter, the book is appropriate for course use as well.

The book begins with a general introduction to machine learning algorithms and their analysis; then explores the core theory of boosting, especially its ability to generalize; examines some of the myriad other theoretical viewpoints that help to explain and understand boosting; provides practical extensions of boosting for more complex learning problems; and finally presents a number of advanced theoretical topics. Numerous applications and practical illustrations are offered throughout."
in_NB  books:noted  coveted  machine_learning  ensemble_methods  re:democratic_cognition  collective_cognition  classifiers  regression 
november 2011 by cshalizi
[1111.2664] A Collaborative Mechanism for Crowdsourcing Prediction Problems
"Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of "crowdsourcing" prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively "learn" a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set."
to:NB  ensemble_methods  collective_cognition  machine_learning  mechanism_design 
november 2011 by cshalizi
CAKE: Convex Adaptive Kernel Density Estimation
"In this paper we present a generalization of kernel density estimation called Convex Adaptive Kernel Density Estimation (CAKE) that replaces single bandwidth se- lection by a convex aggregation of kernels at all scales, where the convex aggregation is allowed to vary from one training point to another, treating the fundamental problem of heterogeneous smoothness in a novel way. Learning the CAKE estimator given a training set reduces to solving a single con- vex quadratic programming problem. We derive rates of convergence of CAKE like estimator to the true underlying density under smoothness assumptions on the class and show that given a sufficiently large sample the mean squared error of such estimators is optimal in a minimax sense. We also give a risk bound of the CAKE estimator in terms of its empirical risk. We empirically compare CAKE to other density estimators proposed in the statistics literature for handling heterogeneous smoothness on different synthetic and natural distributions. "
to:NB  have_read  density_estimation  ensemble_methods  kernel_estimators  statistics 
november 2011 by cshalizi
[1011.3396] PAC-Bayesian aggregation and multi-armed bandits
"This habilitation thesis presents several contributions to (1) the PAC-Bayesian analysis of statistical learning, (2) the three aggregation problems: given d functions, how to predict as well as (i) the best of these d functions (model selection type aggregation), (ii) the best convex combination of these d functions, (iii) the best linear combination of these d functions, (3) the multi-armed bandit problems."
statistics  learning_theory  pac-bayesian  model_selection  ensemble_methods  to:NB 
november 2010 by cshalizi
IEEE Xplore - On the generalization ability of on-line learning algorithms
"how to extract a hypothesis with small risk from the ensemble of hypotheses generated by an arbitrary on-line learning algorithm run on [IID data]. ... a simple large deviation argument [proves] tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic Mn associated with the on-line performance of the ensemble. Via sharp pointwise bounds on Mn, we then obtain risk tail bounds for kernel perceptron algorithms in terms of the spectrum of the empirical kernel matrix. ... A distinctive feature of our approach is that the key tools for our analysis come from the model of prediction of individual sequences; i.e., a model making no probabilistic assumptions on the source generating the data. In fact, these tools turn out to be so powerful that we only need very elementary statistical facts to obtain our final risk bounds." Bounced off this 2004; try again.
have_read  learning_theory  large_deviations  online_learning  individual_sequence_prediction  via:djm1107  re:your_favorite_dsge_sucks  re:XV_for_mixing  ensemble_methods  low-regret_learning 
july 2010 by cshalizi
[0704.2500] A universal procedure for aggregating estimators
"Assume that we have a family of estimators $\mathcal{F}$ built on the basis of available observations. The goal is to construct a new estimator whose risk is as close as possible to that of the best estimator in the family. We propose a general aggregation scheme that is universal in the following sense: it applies for families of arbitrary estimators and a wide variety of models and global risk measures. The procedure is based on comparison of empirical estimates of certain linear functionals with estimates induced by the family $\mathcal{F}$. We derive oracle inequalities and show that they are unimprovable in some sense. Numerical results demonstrate good practical behavior of the procedure."
statistics  ensemble_methods  model_averaging  to_read 
december 2009 by cshalizi
Aggregation via Empirical Risk Minimization
"Given a finite set F of estimators, the problem of aggregation is to construct a new estimator whose risk is as close as possible to the risk of the best estimator in F. It was conjectured that empirical minimization performed in the convex hull of F is an optimal aggregation method, but we show that this conjecture is false. Despite that, we prove that empirical minimization in the convex hull of a well chosen, empirically determined subset of F is an optimal aggregation method."
ensemble_methods  learning_theory 
august 2009 by cshalizi
[0906.3590] Forest Garrote
We have got to do something about the nams of techniques in this area. I don't mind the whimsy, it's just that combinations like this don't work, metaphorically.
ensemble_methods  classifiers  statistics  machine_learning  sparsity  variable_selection  lasso 
june 2009 by cshalizi
Splines for Financial Volatility
"We propose a flexible generalized auto-regressive conditional heteroscedasticity type of model for the prediction of volatility in financial time series. The approach relies on the idea of using multivariate B-splines of lagged observations and volatilities. Estimation of such a B-spline basis expansion is constructed within the likelihood framework for non-Gaussian observations. As the dimension of the B-spline basis is large, i.e. many parameters, we use regularized and sparse model fitting with a boosting algorithm"
splines  statistics  finance  stochastic_volatility  in_NB  ensemble_methods  boosting  buhlmann.peter  have_read  time_series 
june 2009 by cshalizi
Consistency of Random Forests and Other Averaging Classifiers (Biau, Devroye and Lugosi)
"In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent."

--- The actual nuts and bolts here are too complex for teaching in 350, but I should look carefully at what version of random forests I teach them.
ensemble_methods  random_forests  classifiers  machine_learning  learning_theory  statistics  breiman.leo  biau.gerard  devroye.luc  lugosi.gabor  to_read  to_teach:data-mining 
may 2009 by cshalizi
Stacked generalization
I read this a long time ago, and then forgot about it (except for vague comments to students).
ensemble_methods  machine_learning  wolpert.david  via:arthegall  to_teach:data-mining 
february 2009 by cshalizi
natural language processing blog: Boosting, neural networks and deep learning
That's an ... interesting thought. (Not flagged for 350 because if I tried to teach it to undergrads their heads would explode, and not in a good way.)
boosting  neural_networks  ensemble_methods  machine_learning 
february 2009 by cshalizi
Introduction to Machine Learning - Ethem Alpaydin [@Labyrinth]
Useful introductory survey/reference; mathematically undemanding. Nice that it has chapters on hidden Markov models and various model-combination methods.
books:recommended  machine_learning  classifiers  regression  clustering  markov_models  ensemble_methods 
august 2008 by cshalizi
[0804.0650] Storms prediction : Logistic regression vs random forest for unbalanced data
"satellite measurements of cloud systems which are to be classified either in convective or non convective systems. Convective cloud systems correspond to lightning and detecting such systems is of main importance for thunderstorm monitoring and warning"
have_read  machine_learning  weather_prediction  decision_trees  to_teach:data-mining  ensemble_methods 
april 2008 by cshalizi

related tags

biau.gerard  bickel.david  books:noted  books:recommended  boosting  bootstrap  breiman.leo  buhlmann.peter  calibration  CART  caruana.rich  chow-liu_algorithm  classifiers  clustering  collective_cognition  coveted  cross-validation  data_mining  decision_trees  density_estimation  devroye.luc  dimension_reduction  econometrics  economics  ensemble_methods  entropy_estimation  estimation  finance  fink.daniel  freund.yoav  graphical_models  hansen.bruce  have_read  individual_sequence_prediction  inference_to_latent_objects  in_NB  jordan.michael_i.  kernel_estimators  kith_and_kin  lafferty.john  large_deviations  lasso  learning_theory  liu.han  low-regret_learning  lugosi.gabor  machine_learning  macroeconomics  map-reduce  markov_models  mechanism_design  mixture_models  model_averaging  model_selection  morley.james  network_data_analysis  neural_networks  neville.jennifer  non-stationarity  online_learning  pac-bayesian  parallel_computing  prediction  racine.jeffrey  random_forests  re:AoS_project  re:democratic_cognition  re:growing_ensemble_project  re:XV_for_mixing  re:XV_for_networks  re:your_favorite_dsge_sucks  regression  relational_learning  riedewald.mirek  rigollet.philippe  sarkar.purnamrita  self-promotion  sorokina.daria  sparsity  splines  statistical_interaction  statistics  stochastic_volatility  time_series  to:NB  to_read  to_teach:complexity-and-inference  to_teach:data-mining  transaction_networks  variable_selection  via:arthegall  via:djm1107  via:shreejoy  wasserman.larry  weather_prediction  wolpert.david  zhang.tong 

Copy this bookmark:



description:


tags: