cshalizi + to_teach:undergrad-ada   151

[1205.3208] A New Family of Generalized 3D Cat Maps
"Since the 1990s chaotic cat maps are widely used in data encryption, for their very complicated dynamics within a simple model and desired characteristics related to requirements of cryptography. The number of cat map parameters and the map period length after discretization are two major concerns in many applications for security reasons. In this paper, we propose a new family of 36 distinctive 3D cat maps with different spatial configurations taking existing 3D cat maps [1]-[4] as special cases. Our analysis and comparisons show that this new 3D cat maps family has more independent map parameters and much longer averaged period lengths than existing 3D cat maps. The presented cat map family can be extended to higher dimensional cases."

(to_teach tags for clsses which use the cat map as an example)
to:NB  cat_map  dynamical_systems  cryptography  to_teach:complexity-and-inference  to_teach:statcomp  to_teach:undergrad-ADA 
12 days ago by cshalizi
Likelihood inference for discriminating between long-memory and change-point models - Yau - 2012 - Journal of Time Series Analysis - Wiley Online Library
"We develop a likelihood ratio (LR) test procedure for discriminating between a short-memory time series with a change-point (CP) and a long-memory (LM) time series. Under the null hypothesis, the time series consists of two segments of short-memory time series with different means and possibly different covariance functions. The location of the shift in the mean is unknown. Under the alternative, the time series has no shift in mean but rather is LM. The LR statistic is defined as the normalized log-ratio of the Whittle likelihood between the CP model and the LM model, which is asymptotically normally distributed under the null. The LR test provides a parametric alternative to the CUSUM test proposed by Berkes et al. (2006). Moreover, the LR test is more general than the CUSUM test in the sense that it is applicable to changes in other marginal or dependence features other than a change-in-mean. We show its good performance in simulations and apply it to two data examples."
to:NB  time_series  change-point_problem  long-range_dependence  statistics  to_teach:undergrad-ADA  hypothesis_testing 
13 days ago by cshalizi
[1203.3504] On Measurement Bias in Causal Inference
"This paper addresses the problem of measurement errors in causal inference and highlights several algebraic and graphical methods for eliminating systematic bias induced by such errors. In particulars, the paper discusses the control of partially observable confounders in parametric and non parametric models and the computational problem of obtaining bias-free effect estimates in such models."
to:NB  causal_inference  inference_to_latent_objects  pearl.judea  to_teach:undergrad-ADA  statistics  error_in_variables  via:arthegall 
18 days ago by cshalizi
Clarke , Clarke : Prediction in several conventional contexts
"We review predictive techniques from several traditional branches of statistics. Starting with prediction based on the normal model and on the empirical distribution function, we proceed to techniques for various forms of regression and classification. Then, we turn to time series, longitudinal data, and survival analysis. Our focus throughout is on the mechanics of prediction more than on the properties of predictors."

(to_teach tags are tentative.)
to:NB  prediction  statistics  classifiers  regression  to_teach:undergrad-ADA  to_teach:data-mining 
20 days ago by cshalizi
Testing parametric conditional distributions using the nonparametric smoothing method
"This paper proposes a new goodness-of-fit test for parametric conditional probability distributions using the nonparametric smoothing methodology. An asymptotic normal distribution is established for the test statistic under the null hypothesis of correct specification of the parametric distribution. The test is shown to have power against local alternatives converging to the null at certain rates. The test can be applied to testing for possible misspecifications in a wide variety of parametric models. A bootstrap procedure is provided for obtaining more accurate critical values for the test. Monte Carlo simulations show that the test has good power against some common alternatives."
to:NB  misspecification  density_estimation  smoothing  statistics  to_teach:undergrad-ADA 
22 days ago by cshalizi
Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
"We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets. This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed algorithms make predictions that are better correlated with the sample estimates of the unknown parameters on test data ; this is particularly the case when the number of commonly measured variables is low.
"The enabling idea behind the methods is to induce one or all causal models that are simultaneously consistent with (fit) all available data sets and prior knowledge and reason with them. This allows constraints stemming from causal assumptions (e.g., Causal Markov Condition, Faithfulness) to propagate. Several methods have been developed based on this idea, for which we propose the unifying name Integrative Causal Analysis (INCA). A contrived example is presented demonstrating the theoretical potential to develop more general methods for co-analyzing heterogeneous data sets. The computational experiments with the novel methods provide evidence that causally-inspired assumptions such as Faithfulness often hold to a good degree of approximation in many real systems and could be exploited for statistical inference. Code, scripts, and data are available at www.mensxmachina.org."
to:NB  to_read  causal_inference  graphical_models  to_teach:undergrad-ADA 
25 days ago by cshalizi
"The huge Package for High-dimensional Undirected Graph Estimation in R"
"We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efficiency."
to:NB  to_teach:undergrad-ADA  graphical_models  statistics  kith_and_kin  wasserman.larry  roeder.kathryn  liu.han 
25 days ago by cshalizi
README: installing Rgraphviz
Install graphviz, then Rgraphviz, then (?) re-start R. Or at least that worked with the student in office hours. (I swear it's painless on a Mac.)
to_teach:undergrad-ADA 
27 days ago by cshalizi
Assessing gross domestic product and inflation probability forecasts derived from Bank of England fan charts - Galbraith - 2011 - Journal of the Royal Statistical Society: Series A (Statistics in Society) - Wiley Online Library
"Density forecasts, including the pioneering Bank of England ‘fan charts’, are often used to produce forecast probabilities of a particular event. We use the Bank of England's forecast densities to calculate the forecast probability that annual rates of inflation and output growth exceed given thresholds. We subject these implicit probability forecasts to graphical and numerical diagnostic checks. We measure both their calibration and their resolution, providing both statistical and graphical interpretations of the results. The results reinforce earlier evidence on limitations of these forecasts and provide new evidence on their information content and on the relative performance of inflation and gross domestic product growth forecasts. In particular, gross domestic product forecasts show little or no ability to predict periods of low growth beyond the current quarter, in part because of the important role of data revisions."
to:NB  prediction  statistics  calibration  macroeconomics  to_teach:undergrad-ADA 
6 weeks ago by cshalizi
[math/0603130] Nonparametric methods for inference in the presence of instrumental variables
"We suggest two nonparametric approaches, based on kernel methods and orthogonal series to estimating regression functions in the presence of instrumental variables. For the first time in this class of problems, we derive optimal convergence rates, and show that they are attained by particular estimators. In the presence of instrumental variables the relation that identifies the regression function also defines an ill-posed inverse problem, the ``difficulty'' of which depends on eigenvalues of a certain integral operator which is determined by the joint density of endogenous and instrumental variables. We delineate the role played by problem difficulty in determining both the optimal convergence rate and the appropriate choice of smoothing parameter."
to:NB  to_read  regression  statistics  instrumental_variables  nonparametrics  to_teach:undergrad-ADA 
6 weeks ago by cshalizi
Colombo , Maathuis , Kalisch , Richardson : Learning high-dimensional directed acyclic graphs with latent and selection variables
"We consider the problem of learning causal information between random variables in directed acyclic graphs (DAGs) when allowing arbitrarily many latent and selection variables. The FCI (Fast Causal Inference) algorithm has been explicitly designed to infer conditional independence and causal information in such settings. However, FCI is computationally infeasible for large graphs. We therefore propose the new RFCI algorithm, which is much faster than FCI. In some situations the output of RFCI is slightly less informative, in particular with respect to conditional independence information. However, we prove that any causal information in the output of RFCI is correct in the asymptotic limit. We also define a class of graphs on which the outputs of FCI and RFCI are identical. We prove consistency of FCI and RFCI in sparse high-dimensional settings, and demonstrate in simulations that the estimation performances of the algorithms are very similar. All software is implemented in the R-package pcalg."

--- To complicated to actually teach, but should be mentioned in the lecture notes on causal discovery, along with FCI.
in_NB  have_read  statistics  graphical_models  causal_inference  sparsity  to_teach:undergrad-ADA 
7 weeks ago by cshalizi
The benchden Package: Benchmark Densities for Nonparametric Density Estimation
"This article describes the benchden package which implements a set of 28 example densities for nonparametric density estimation in R. In addition to the usual functions that evaluate the density, distribution and quantile functions or generate random variates, a function designed to be specifically useful for larger simulation studies has been added. After describing the set of densities and the usage of the package, a small toy example of a simulation study conducted using the benchden package is given."
to:NB  computational_statistics  R  density_estimation  nonparametrics  to_teach:undergrad-ADA 
7 weeks ago by cshalizi
[no title]
"Conditional independence relations involving latent variables do not necessarily imply observable independences. They may imply inequality constraints on observable parameters and causal bounds, which can be used for falsification and identification. The literature on computing such constraints often involve a deterministic underlying data generating process in a counterfactual framework. If an analyst is ignorant of the nature of the underlying mechanisms then they may wish to use a model which allows the underlying mechanisms to be probabilistic. A method of computation for a weaker model without any determinism is given here and demonstrated for the instrumental variable model, though applicable to other models. The approach is based on the analysis of mappings with convex polytopes in a decision theoretic framework and can be implemented in readily available polyhedral computation software. Well known constraints and bounds are replicated in a probabilistic model and novel ones are computed for instrumental variable models without non-deterministic versions of the randomization, exclusion restriction and monotonicity assumptions respectively."

(From a quick scan, this looks too heavy to actually teach in ADAfaEPoV, but it's so tagged to remind me to include a reference.)
to:NB  causal_inference  partial_identification  statistics  instrumental_variables  to_teach:undergrad-ADA 
7 weeks ago by cshalizi
[0803.0402] A note on sensitivity of principal component subspaces and the efficient detection of influential observations in high dimensions
"In this paper we introduce an influence measure based on second order expansion of the RV and GCD measures for the comparison between unperturbed and perturbed eigenvectors of a symmetric matrix estimator. Example estimators are considered to highlight how this measure compliments recent influence analysis. Importantly, we also show how a sample based version of this measure can be used to accurately and efficiently detect influential observations in practice."
to:NB  principal_components  statistics  to_teach:undergrad-ADA 
8 weeks ago by cshalizi
Taylor & Francis Online :: Graphical Diagnostics for Markov Models for Categorical Data - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
"Markov models are widely used as a method for describing categorical data that exhibit stationary and nonstationary autocorrelation. However, diagnostic methods are a largely overlooked topic for Markov models. We introduce two types of residuals for this purpose: one for assessing the length of runs between state changes, and the other for assessing the frequency with which the process moves from any given state to the other states. Methods for calculating the sampling distribution of both types of residuals are presented, enabling objective interpretation through graphical summaries. The graphical summaries are formed using a modification of the probability integral transformation that is applicable for discrete data. Residuals from simulated datasets are presented to demonstrate when the model is, and is not, adequate for the data. The two types of residuals are used to highlight inadequacies of a model posed for real data on seabed fauna from the marine environment."
to:NB  visual_display_of_quantitative_information  statistics  markov_models  to_teach:undergrad-ADA 
8 weeks ago by cshalizi
Stock Market Behavior Predicted by Rat Neurons
"We here report for the first time, to the best of our knowledge, rat motor cortex neurons predicting the behavior of the American stock market. We implanted the motor cortex of the brains of rats with silicon electrodes. Using the correlation technique, we monitored the activity of neurons in our rats while simultaneously tracking the activity of stocks in the U.S. stock market."
have_read  to:NB  neuroscience  finance  statistics  prediction  multiple_testing  bad_data_analysis  funny:geeky  funny:malicious  via:mejn  to:blog  to_teach:undergrad-ADA 
8 weeks ago by cshalizi
Greetings, Philosophers - Kieran Healy
But what _kind_ of bootstrap? It's clustered data (raters x schools), which raises interesting technical issues!
philosophy  academia  data_analysis  healy.kieran  bootstrap  to_teach:undergrad-ADA 
10 weeks ago by cshalizi
[0803.2963] Consistency of cross validation for comparing regression procedures
"Theoretical developments on cross validation (CV) have mainly focused on selecting one among a list of finite-dimensional models (e.g., subset or order selection in linear regression) or selecting a smoothing parameter (e.g., bandwidth for kernel smoothing). However, little is known about consistency of cross validation when applied to compare between parametric and nonparametric methods or within nonparametric methods. We show that under some conditions, with an appropriate choice of data splitting ratio, cross validation is consistent in the sense of selecting the better procedure with probability approaching 1. Our results reveal interesting behavior of cross validation. When comparing two models (procedures) converging at the same nonparametric rate, in contrast to the parametric case, it turns out that the proportion of data used for evaluation in CV does not need to be dominating in size. Furthermore, it can even be of a smaller order than the proportion for estimation while not affecting the consistency property."
to:NB  statistics  to_read  cross-validation  model_selection  nonparametrics  to_teach:undergrad-ADA  re:stacs 
11 weeks ago by cshalizi
[0803.2984] Conditional density estimation in a regression setting
"Regression problems are traditionally analyzed via univariate characteristics like the regression function, scale function and marginal density of regression errors. These characteristics are useful and informative whenever the association between the predictor and the response is relatively simple. More detailed information about the association can be provided by the conditional density of the response given the predictor. For the first time in the literature, this article develops the theory of minimax estimation of the conditional density for regression settings with fixed and random designs of predictors, bounded and unbounded responses and a vast set of anisotropic classes of conditional densities. The study of fixed design regression is of special interest and novelty because the known literature is devoted to the case of random predictors. For the aforementioned models, the paper suggests a universal adaptive estimator which (i) matches performance of an oracle that knows both an underlying model and an estimated conditional density; (ii) is sharp minimax over a vast class of anisotropic conditional densities; (iii) is at least rate minimax when the response is independent of the predictor and thus a bivariate conditional density becomes a univariate density; (iv) is adaptive to an underlying design (fixed or random) of predictors."
in_NB  statistics  nonparametrics  regression  density_estimation  minimax  to_read  to_teach:undergrad-ADA 
11 weeks ago by cshalizi
Rainfall and Conflict - Heather Sarsons
"Starting with Miguel, Satyanath, and Sergenti (2004), a large literature has used rainfall variation as an instrument to study the impacts of income shocks on civil war and conáict. These studies argue that in agriculturally-dependent regions, negative rain shocks lower income levels, which in turn incites violence. This identiÖcation strategy relies on the assumption that rainfall shocks a§ect conáict only through their impacts on income. I evaluate this exclusion restriction by identifying districts that are downstream from dams in India. In downstream districts, income is much less sensitive to rainfall áuctuations. However, rain shocks remain equally strong predictors of riot incidence in these districts. These results suggest that rainfall a§ects rioting through a channel other than income and cast doubt on the conclusion that income shocks incite riots."

Cute.
to:NB  have_read  instrumental_variables  causal_inference  statistics  to_teach:undergrad-ADA  sociology  to:blog 
11 weeks ago by cshalizi
Analyzing Released NYC Value-Added Data Part 3 | Gary Rubinstein's Blog
This actually looks more like a job for nonparametric regression, or even relative distribution comparisons, but still...
bad_data_analysis  education  evisceration  to_teach:undergrad-ADA  via:mathbabe 
11 weeks ago by cshalizi
Analyzing Released NYC Value-Added Data Part 2 | Gary Rubinstein's Blog
It's the comparison of the same teacher in the same year on the same subject but in different grades which clinches the model being an EPIC FAIL.
bad_data_analysis  education  evisceration  to_teach:undergrad-ADA  via:mathbabe 
11 weeks ago by cshalizi
Analyzing Released NYC Value-Added Data Part 1 | Gary Rubinstein's Blog
To be clear, the bad data analysis is on the part of whatever hacks came p with the value added model being used here. These results are insane.
bad_data_analysis  evisceration  education  via:mathbabe  to_teach:undergrad-ADA 
11 weeks ago by cshalizi
[0805.2490] Using statistical smoothing to date medieval manuscripts
"We discuss the use of multivariate kernel smoothing methods to date manuscripts dating from the 11th to the 15th centuries, in the English county of Essex. The dataset consists of some 3300 dated and 5000 undated manuscripts, and the former are used as a training sample for imputing dates for the latter. It is assumed that two manuscripts that are ``close'', in a sense that may be defined by a vector of measures of distance for documents, will have close dates. Using this approach, statistical ideas are used to assess ``similarity'', by smoothing among distance measures, and thus to estimate dates for the 5000 undated manuscripts by reference to the dated ones."

Can we get data?
to:NB  statistics  smoothing  kernel_estimators  medieval_european_history  text_mining  to_teach:undergrad-ADA 
12 weeks ago by cshalizi
[1202.3775] Kernel-based Conditional Independence Test and Application in Causal Discovery
"Conditional independence testing is an important problem, especially in Bayesian network learning and causal discovery. Due to the curse of dimensionality, testing for conditional independence of continuous variables is particularly challenging. We propose a Kernel-based Conditional Independence test (KCI-test), by constructing an appropriate test statistic and deriving its asymptotic distribution under the null hypothesis of conditional independence. The proposed method is computationally efficient and easy to implement. Experimental results show that it outperforms other methods, especially when the conditioning set is large or the sample size is not very large, in which case other methods encounter difficulties."
statistics  kernel_estimators  independence_testing  hypothesis_testing  causal_inference  in_NB  have_read  to:blog  to_teach:undergrad-ADA 
12 weeks ago by cshalizi
[0808.1010] Confidence bands in nonparametric time series regression
"We consider nonparametric estimation of mean regression and conditional variance (or volatility) functions in nonlinear stochastic regression models. Simultaneous confidence bands are constructed and the coverage probabilities are shown to be asymptotically correct. The imposed dependence structure allows applications in many linear and nonlinear auto-regressive processes. The results are applied to the S&P 500 Index data."
to:NB  statistics  regression  time_series  confidence_sets  to_teach:undergrad-ADA 
12 weeks ago by cshalizi
[0805.3032] Testing earthquake predictions
"Statistical tests of earthquake predictions require a null hypothesis to model occasional chance successes. To define and quantify `chance success' is knotty. Some null hypotheses ascribe chance to the Earth: Seismicity is modeled as random. The null distribution of the number of successful predictions -- or any other test statistic -- is taken to be its distribution when the fixed set of predictions is applied to random seismicity. Such tests tacitly assume that the predictions do not depend on the observed seismicity. Conditioning on the predictions in this way sets a low hurdle for statistical significance. Consider this scheme: When an earthquake of magnitude 5.5 or greater occurs anywhere in the world, predict that an earthquake at least as large will occur within 21 days and within an epicentral distance of 50 km. We apply this rule to the Harvard centroid-moment-tensor (CMT) catalog for 2000--2004 to generate a set of predictions. The null hypothesis is that earthquake times are exchangeable conditional on their magnitudes and locations and on the predictions--a common ``nonparametric'' assumption in the literature. We generate random seismicity by permuting the times of events in the CMT catalog. We consider an event successfully predicted only if (i) it is predicted and (ii) there is no larger event within 50 km in the previous 21 days. The $P$-value for the observed success rate is $<0.001$: The method successfully predicts about 5% of earthquakes, far better than `chance,' because the predictor exploits the clustering of earthquakes -- occasional foreshocks -- which the null hypothesis lacks. Rather than condition on the predictions and use a stochastic model for seismicity, it is preferable to treat the observed seismicity as fixed, and to compare the success rate of the predictions to the success rate of simple-minded predictions like those just described. If the proffered predictions do no better than a simple scheme, they have little value."
have_read  to:NB  statistics  geology  prediction  earthquakes  to_teach:undergrad-ADA  to_teach:data-mining 
12 weeks ago by cshalizi
[0801.0327] Nonparametric sequential prediction of time series
"Time series prediction covers a vast field of every-day statistical applications in medical, environmental and economic domains. In this paper we develop nonparametric prediction strategies based on the combination of a set of 'experts' and show the universal consistency of these strategies under a minimum of conditions. We perform an in-depth analysis of real-world data sets and show that these nonparametric strategies are more flexible, faster and generally outperform ARMA methods in terms of normalized cumulative prediction error."
in_NB  time_series  nonparametrics  prediction  statistics  to_teach:undergrad-ADA  re:growing_ensemble_project 
february 2012 by cshalizi
Bootstrapping clustered data - Field - 2007 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library
"Various bootstraps have been proposed for bootstrapping clustered data from one-way arrays. The simulation results in the literature suggest that some of these methods work quite well in practice; the theoretical results are limited and more mixed in their conclusions. For example, McCullagh reached negative conclusions about the use of non-parametric bootstraps for one-way arrays. The purpose of this paper is to extend our understanding of the issues by discussing the effect of different ways of modelling clustered data, the criteria for successful bootstraps used in the literature and extending the theory from functions of the sample mean to include functions of the between and within sums of squares and non-parametric bootstraps to include model-based bootstraps. We determine that the consistency of variance estimates for a bootstrap method depends on the choice of model with the residual bootstrap giving consistency under the transformation model whereas the cluster bootstrap gives consistent estimates under both the transformation and the random-effect model. In addition we note that the criteria based on the distribution of the bootstrap observations are not really useful in assessing consistency."
in_NB  have_read  statistics  bootstrap  to_teach:undergrad-ADA  hierarchical_models 
february 2012 by cshalizi
An Alternative Asymptotic Analysis of Residual-Based Statistics
"This paper presents an alternative method to derive the limiting distribution of residual-based statistics. Our method does not impose an explicit assumption of (asymptotic) smoothness of the statistic of interest with respect to the model's parameters and thus is especially useful in cases where such smoothness is difficult to establish. Instead, we use a locally uniform convergence in distribution condition, which is automatically satisfied by residual-based specification test statistics. To illustrate, we derive the limiting distribution of a new functional form specification test for discrete choice models, as well as a runs-based tests for conditional symmetry in dynamic volatility models." (To-teach tag is tentative.)
in_NB  statistics  regression  model-checking  to_teach:undergrad-ADA 
february 2012 by cshalizi
Plausibly Exogenous
"Instrumental variable (IV) methods are widely used to identify causal effects in models with endogenous explanatory variables. Often the instrument exclusion restriction that underlies the validity of the usual IV inference is suspect; that is, instruments are only plausibly exogenous. We present practical methods for performing inference while relaxing the exclusion restriction. We illustrate the approaches with empirical examples that examine the effect of 401(k) participation on asset accumulation, price elasticity of demand for margarine, and returns to schooling. We find that inference is informative even with a substantial relaxation of the exclusion restriction in two of the three cases."
to:NB  to_read  causal_inference  regression  statistics  economics  social_science_methodology  instrumental_variables  to_teach:undergrad-ADA  hansen.christian 
february 2012 by cshalizi
Empirical Legal Studies: How the "Cravath System" Created the Bi-Modal Distribution
See if the analysis holds up after tracking down paper and if data is available; if so may make it an assignment (or even an exam?) for uADA.
law  inequality  economics  track_down_references  to_teach:undergrad-ADA  via:unfogged 
february 2012 by cshalizi
On a New Method of Graduation
Whittaker introduces spline smoothing in 1922, complete with the Bayesian derivation. Does not use the word "spline", however --- when did that come in?
in_NB  to_teach:undergrad-ADA  splines  smoothing  regression  statistics  have_read 
january 2012 by cshalizi
[1201.0224] Estimation of Treatment Effects with High-Dimensional Controls
"We propose methods for inference on the average effect of a treatment on a scalar outcome in the presence of very many controls. Our setting is a partially linear regression model containing the treatment/policy variable and a large number $p$ of controls or series terms, with $p$ that is possibly much larger than the sample size $n$, but where only $s < n$ unknown controls or series terms are needed to approximate the regression function accurately. The latter sparsity condition makes it possible to estimate the entire regression function as well as the average treatment effect by selecting an approximately the right set of controls using Lasso and related methods. We develop estimation and inference methods for the average treatment effect in this setting, proposing a novel "post double selection" method that provides attractive inferential and estimation properties. In our analysis, in order to cover realistic applications, we expressly allow for imperfect selection of the controls and account for the impact of selection errors on estimation and inference. In order to cover typical applications in economics, we employ the selection methods designed to deal with non-Gaussian and heteroscedastic disturbances. We illustrate the use of new methods with numerical simulations and an application to the effect of abortion on crime rates."
to:NB  to_teach:undergrad-ADA  regression  causal_inference  lasso  sparsity  econometrics  instrumental_variables  hansen.christian 
january 2012 by cshalizi
[1201.0220] Inference for High-Dimensional Sparse Econometric Models
"This article is about estimation and inference methods for high dimensional sparse (HDS) regression models in econometrics. High dimensional sparse models arise in situations where many regressors (or series terms) are available and the regression function is well-approximated by a parsimonious, yet unknown set of regressors. The latter condition makes it possible to estimate the entire regression function effectively by searching for approximately the right set of regressors. We discuss methods for identifying this set of regressors and estimating their coefficients based on $ell_1$-penalization and describe key theoretical results. In order to capture realistic practical situations, we expressly allow for imperfect selection of regressors and study the impact of this imperfect selection on estimation and inference results. We focus the main part of the article on the use of HDS models and methods in the instrumental variables model and the partially linear model. We present a set of novel inference results for these models and illustrate their use with applications to returns to schooling and growth regression."
to:NB  regression  sparsity  instrumental_variables  econometrics  to_teach:undergrad-ADA  lasso  hansen.christian 
january 2012 by cshalizi
A Method of Handling Curvilinear Correlation for Any Number of Variables (Ezekiel, 1924)
Additive regression models as a general statistical method, complete with a successive-approximation algorithm that's really damn close to modern back-fitting, and a plea for economists to use it. In 1924!
in_NB  to_teach:undergrad-ADA  regression  additive_models  statistics  have_read 
january 2012 by cshalizi
"Sinners in the hands of an angry God": Jonathan Edwards, 1741
"The God that holds you over the pit of hell, much as one holds a spider, or some loathsome insect over the fire, abhors you, and is dreadfully provoked: his wrath towards you burns like fire; he looks upon you as worthy of nothing else, but to be cast into the fire; he is of purer eyes than to bear to have you in his sight; you are ten thousand times more abominable in his eyes, than the most hateful venomous serpent is in ours. You have offended him infinitely more than ever a stubborn rebel did his prince; and yet it is nothing but his hand that holds you from falling into the fire every moment. It is to be ascribed to nothing else, that you did not go to hell the last night; that you was suffered to awake again in this world, after you closed your eyes to sleep. And there is no other reason to be given, why you have not dropped into hell since you arose in the morning, but that God's hand has held you up. There is no other reason to be given why you have not gone to hell, since you have sat here in the house of God, provoking his pure eyes by your sinful wicked manner of attending his solemn worship. Yea, there is nothing else that is to be given as a reason why you do not this very moment drop down into hell."
christianity  edwards.jonathan  something_about_america  preaching_to_the_choir  to_teach:undergrad-ADA 
january 2012 by cshalizi
Nonlinear Models of Measurement Errors
"Measurement errors in economic data are pervasive and nontrivial in size. The presence of measurement errors causes biased and inconsistent parameter estimates and leads to erroneous conclusions to various degrees in economic analysis. While linear errors-in-variables models are usually handled with well-known instrumental variable methods, this article provides an overview of recent research papers that derive estimation methods that provide consistent estimates for nonlinear models with measurement errors. We review models with both classical and nonclassical measurement errors, and with misclassification of discrete variables. For each of the methods surveyed, we describe the key ideas for identification and estimation, and discuss its application whenever it is currently available." (Not read, reconsider to_teach tag later.)
to:NB  statistics  latent_variables  inference_to_latent_objects  instrumental_variables  econometrics  to_teach:undergrad-ADA 
december 2011 by cshalizi
Instruments, Randomization, and Learning about Development (Deaton, 2010)
"There is currently much debate about the effectiveness of foreign aid and about what kind of projects can engender economic development. There is skepticism about the ability of econometric analysis to resolve these issues or of development agencies to learn from their own experience. In response, there is increasing use in development economics of randomized controlled trials (RCTs) to accumulate credible knowl- edge of what works, without overreliance on questionable theory or statistical meth- ods. When RCTs are not possible, the proponents of these methods advocate quasi- randomization through instrumental variable (IV) techniques or natural experiments. I argue that many of these applications are unlikely to recover quantities that are use- ful for policy or understanding: two key issues are the misunderstanding of exogeneity and the handling of heterogeneity. I illustrate from the literature on aid and growth. Actual randomization faces similar problems as does quasi-randomization, notwith- standing rhetoric to the contrary. I argue that experiments have no special ability to produce more credible knowledge than other methods, and that actual experiments are frequently subject to practical problems that undermine any claims to statisti- cal or epistemic superiority. I illustrate using prominent experiments in development and elsewhere. As with IV methods, RCT-based evaluation of projects, without guid- ance from an understanding of underlying mechanisms, is unlikely to lead to scientific progress in the understanding of economic development. I welcome recent trends in development experimentation away from the evaluation of projects and toward the evaluation of theoretical mechanisms."
causal_inference  experimental_economics  experimental_sociology  economics  development_economics  social_science_methodology  explanation_by_mechanisms  to_teach:undergrad-ADA  instrumental_variables  have_read  evisceration  in_NB  randomization  to:blog 
december 2011 by cshalizi
Improving Causal Inference: Strengths and Limitations of Natural Experiments (Dunning, 2008)
"Social scientists increasingly exploit natural experiments in their research. This article surveys recent applications in political science, with the goal of illustrating the inferential advantages provided by this research design. When treat- ment assignment is less than “as if” random, studies may be something less than natural experiments, and familiar threats to valid causal inference in observational settings can arise. The author proposes a continuum of plausibility for natural experiments, defined by the extent to which treatment assignment is plausibly “as if” random, and locates several leading studies along this continuum."
in_NB  causal_inference  social_science_methodology  to_teach:undergrad-ADA  instrumental_variables 
december 2011 by cshalizi
[1111.6201] Learning a Factor Model via Regularized PCA
"We consider the problem of learning a linear factor model with an unknown number of factors. We propose a regularized form of principal component analysis (PCA) and demonstrate through experiments with synthetic and real data the superiority of resulting estimates to those produced by pre-existing factor analysis approaches. We also establish theoretical results that elucidate the manner in which our algorithm corrects biases induced by conventional PCA. An important feature of our algorithm is its computational efficiency, which is close to that of PCA, which enjoys wide use in large part due to its efficiency."
to:NB  factor_analysis  principal_components  statistics  have_read  to_teach:undergrad-ADA  van_roy.benjamin 
december 2011 by cshalizi
Prediction-based regularization using data augmented regression - Statistics and Computing, Volume 22, Number 1
"The role of regularization is to control fitted model complexity and variance by penalizing (or constraining) models to be in an area of model space that is deemed reasonable, thus facilitating good predictive performance. This is typically achieved by penalizing a parametric or non-parametric representation of the model. In this paper we advocate instead the use of prior knowledge or expectations about the predictions of models for regularization. This has the twofold advantage of allowing a more intuitive interpretation of penalties and priors and explicitly controlling model extrapolation into relevant regions of the feature space. This second point is especially critical in high-dimensional modeling situations, where the curse of dimensionality implies that new prediction points usually require extrapolation. We demonstrate that prediction-based regularization can, in many cases, be stochastically implemented by simply augmenting the dataset with Monte Carlo pseudo-data. We investigate the range of applicability of this implementation. An asymptotic analysis of the performance of Data Augmented Regression (DAR) in parametric and non-parametric linear regression, and in nearest neighbor regression, clarifies the regularizing behavior of DAR. We apply DAR to simulated and real data, and show that it is able to control the variance of extrapolation, while maintaining, and often improving, predictive accuracy."
in_NB  to_read  statistics  prediction  estimation  hooker.giles  regression  to_teach:undergrad-ADA  to_teach:data-mining  curse_of_dimensionality 
december 2011 by cshalizi
Nonparametric estimation of the link function including variable selection - Gerhard Tutz and Sebastian Petry - Statistics and Computing, Volume 22, Number 2
"Nonparametric methods for the estimation of the link function in generalized linear models are able to avoid bias in the regression parameters. But for the estimation of the link typically the full model, which includes all predictors, has been used. When the number of predictors is large these methods fail since the full model cannot be estimated. In the present article a boosting type method is proposed that simultaneously selects predictors and estimates the link function. The method performs quite well in simulations and real data examples." (The "to teach" tag is conjectural.)
in_NB  regression  variable_selection  statistics  nonparametrics  to_read  to_teach:undergrad-ADA 
december 2011 by cshalizi
Lai , Gross , Shen : Evaluating probability forecasts
"Probability forecasts of events are routinely used in climate predictions, in forecasting default probabilities on bank loans or in estimating the probability of a patient’s positive response to treatment. Scoring rules have long been used to assess the efficacy of the forecast probabilities after observing the occurrence, or nonoccurrence, of the predicted events. We develop herein a statistical theory for scoring rules and propose an alternative approach to the evaluation of probability forecasts. This approach uses loss functions relating the predicted to the actual probabilities of the events and applies martingale theory to exploit the temporal structure between the forecast and the subsequent occurrence or nonoccurrence of the event."
in_NB  statistics  prediction  calibration  to_read  to_teach:undergrad-ADA 
november 2011 by cshalizi
The World Top Incomes Database - G-MonD, PSE-Paris School of Economics
Possible computational project: code up estimating a Pareto tail for income (all sources) from these statistics, and tracking evolution over time (and perhaps across countries).

Or, an ADA project, suggested by conversation with John B.: look for correlation between (lack of) progressive taxation and job creation, as predicted by the usual right-wing suspects.
inequality  economics  data_sets  to_teach:undergrad-ADA  to_teach:statcomp 
october 2011 by cshalizi
Population Value Decomposition, a Framework for the Analysis of Image Populations - Journal of the American Statistical Association - 106(495):775
"Images, often stored in multidimensional arrays, are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most significant challenge is the massive size of the datasets incorporating images recorded for hundreds or thousands of subjects at multiple visits. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can be seamlessly incorporated into statistical modeling, leading to a new, transparent, and rapid inferential framework. Our PVD methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits. This article has supplementary material online." --- Presumably just some form of SVD for higher-dimensional arrays.
to:NB  principal_components  data_analysis  to_read  to_teach:data-mining  to_teach:undergrad-ADA 
october 2011 by cshalizi
Density Estimation in Several Populations With Uncertain Population Membership
"We devise methods to estimate probability density functions of several populations using observations with uncertain population membership, meaning from which population an observation comes is unknown. The probability of an observation being sampled from any given population can be calculated. We develop general estimation procedures and bandwidth selection methods for our setting. We establish large-sample properties and study finite-sample performance using simulation studies. We illustrate our methods with data from a nutrition study."
in_NB  density_estimation  mixture_models  to_teach:undergrad-ADA  to_teach:data-mining 
october 2011 by cshalizi
Robustification of the PC Algorithm for Directed Acyclic Graphs
"The PC-algorithm was shown to be a powerful method for estimating the equivalence class of a potentially very high-dimensional acyclic directed graph (DAG) with the corresponding Gaussian distribution. Here we propose a computationally eficient robustification of the PC-algorithm and prove its consistency. Furthermore, we compare the robustified and standard version of the PC-algorithm on simulated data using the new corresponding R package pcalg."
statistics  causal_inference  graphical_models  buhlmann.peter  in_NB  to_read  to_teach:data-mining  to_teach:undergrad-ADA  kalisch.markus 
october 2011 by cshalizi
Draw - Google Correlate
So cool: draw a curve free-hand, get the keywords whose time series correlate best with it.  I can't go below a correlation of 0.70.
google  information_retrieval  spurious_correlations  to_teach:undergrad-ADA  to_teach:data-mining  to:blog  via:vqv  rademacher_complexity 
october 2011 by cshalizi
Reality Checks and Comparisons of Nested Predictive Models - Journal of Business and Economic Statistics - 0(0):1
"This article develops a simple bootstrap method for simulating asymptotic critical values for tests of equal forecast accuracy and encompassing among many nested models. Our method combines elements of fixed regressor and wild bootstraps. We first derive the asymptotic distributions of tests of equal forecast accuracy and encompassing applied to forecasts from multiple models that nest the benchmark model—that is, reality check tests. We then prove the validity of the bootstrap for these tests. Monte Carlo experiments indicate that our proposed bootstrap has better finite-sample size and power than other methods designed for comparison of nonnested models."
statistics  model_checking  model_selection  time_series  bootstrap  to_read  to_teach:undergrad-ADA  encompassing 
september 2011 by cshalizi
[1104.5617] Learning high-dimensional directed acyclic graphs with latent and selection variables
"We consider the problem of learning causal information between random variables in directed acyclic graph (DAGs) when allowing arbitrarily many latent and selection variables. The FCI algorithm (Spirtes et al., 1999) has been explicitly designed to infer conditional independence and causal information in such settings. However, FCI is computationally infeasible for large graphs. We therefore propose a new algorithm, the RFCI algorithm, which is much faster than FCI. In some situations the output of RFCI is slightly less informative, in particular with respect to conditional independence information. However, we prove that any causal information in the output of RFCI is correct. We also define a class of graphs on which the outputs of FCI and RFCI are identical. We prove consistency of FCI and RFCI in sparse high-dimensional settings, and demonstrate in simulations that the estimation performances of the algorithms are very similar. All software is implemented in the R-package pcalg."
have_read  to_teach:undergrad-ADA  graphical_models  causal_inference  in_NB  kalisch.markus  richardson.thomas_s. 
september 2011 by cshalizi
http://marketing.wharton.upenn.edu/documents/research/Adoption_Velocity.pdf
Superficial comment, from glancing through the paper: Why oh why would you look at a cloud of data like the scatter plot in Figure 3, and say "This looks like a job for ordinary least squares"?  Use a kernel smoother and bootstrap to get confidence bands.
names  diffusion_of_innovations  to_read  sociology  via:gelman  to_teach:undergrad-ADA 
august 2011 by cshalizi
"Smooth Regression Analysis" (G. S. Watson, 1964) JSTOR: Sankhyā: The Indian Journal of Statistics, Series A, Vol. 26, No. 4 (Dec., 1964), pp. 359-372
The abstract is great: "Few would deny that the most powerful statistical tool is graph paper. When however there are many observations (and/or many variables) graphical procedures become tedious. It seems to the author that the most characteristic problem for statisticians at the moment is the development of methods for analyzing the data poured out by electronic observing systems. The present paper gives a simple computer method for obtaining a "graph" from a large number of observations."
smoothing  regression  kernel_estimators  data_mining  to_teach:undergrad-ADA  to_teach:data-mining  via:gmg 
june 2011 by cshalizi
Principles of Applied Statistics - Academic and Professional Books - Cambridge University Press
"Applied statistics is more than data analysis, but it is easy to lose sight of the big picture. David Cox and Christl Donnelly distil decades of scientific experience into usable principles for the successful application of statistics, showing how good statistical strategy shapes every stage of an investigation. As you advance from research or policy question, to study design, through modelling and interpretation, and finally to meaningful conclusions, this book will be a valuable guide. Over a hundred illustrations from a wide variety of real applications make the conceptual points concrete, illuminating your path and deepening your understanding. This book is essential reading for anyone who makes extensive use of statistical methods in their work."
books:recommended  statistics  data_analysis  to:NB  to_teach:undergrad-ADA  coveted  cox.david_r. 
may 2011 by cshalizi
Statistical Prediction Analysis (Aitchison and Dunsmore, 1980)
Ancient, but I should see if there are examples or simple tools worth stealing for ADA.
books:noted  statistics  prediction  to:NB  to_teach:undergrad-ADA 
may 2011 by cshalizi
Reason Foundation - No Booze? You May Lose
Exercise for the student: Devise at least two reasons why the causality might run from high income to frequent social drinking, rather than vice versa.  (This is I think too elementary to make a good problem for ADA.)
bad_data_analysis  booze  via:tony_lin  causal_inference  to_teach:undergrad-ADA 
april 2011 by cshalizi
Dani Rodrik-Research
The "Real Exchange Rate and Economic Growth" paper would make a good exam for undergraduate ADA, but I don't have the time this year to prepare it suitably.  Next year.
rodrik.dani  economics  economic_policy  economic_growth  trade  to_teach:undergrad-ADA 
april 2011 by cshalizi
Western on Strikes
Missing the union density variable.  Wrote to ask about it.  Referenced paper is http://www.jstor.org/stable/271022, which seems to me exactly the kind of thing Andy and I should mention in "Philosophy and Practice".  --- ETA: Prof. Western wrote back within hours with the union density data, but I'm not sure I can make it public...
to_teach:undergrad-ADA  strikes  data_sets 
april 2011 by cshalizi
[0812.2749] Nonparametric inference of a trend using functional data
I guess I've been more or less presuming this was true.  (And I'd have been wrong about the form of the simultaneous CI, actually.)  Worth trying to work into the final exam for The Kids?
curve_fitting  gaussian_processes  time_series  statistics  nonparametrics  have_read  confidence_sets  to_teach:undergrad-ADA 
april 2011 by cshalizi
« earlier      

related tags

academia  additive_models  ahmed.amr  airoldi.edo  aligheri.dante  allometric_scaling  anderson.norm  anthropology  arlot.sylvain  astrology  autism  backfitting  bad_data_analysis  blattman.chris  books:noted  books:recommended  bootstrap  booze  branching_processes  buhlmann.peter  burns.patrick  calibration  cat_map  causality  causal_inference  cavalli-sforza  celisse.alain  census  change-point_problem  christianity  cities  classifiers  clustering  cobb_douglas_production_function  computational_statistics  confidence_sets  coveted  cox.david_r.  cross-validation  cryptography  cultural_criticism  curse_of_dimensionality  curve_fitting  data  data_analysis  data_mining  data_sets  debunking  decision-making  delong.brad  density_estimation  development_economics  didelez.vanessa  diffusion_of_innovations  dimension_reduction  dynamical_systems  earthquakes  econometrics  economics  economic_growth  economic_history  economic_policy  education  edwards.jonathan  em_algorithm  encompassing  epidemiology  error_in_variables  error_statistics  estimation  evisceration  expectation-maximization  experimental_economics  experimental_political_science  experimental_psychology  experimental_sociology  explanation_by_mechanisms  factor_analysis  finance  fisher_information  freedman.david  freese.jeremy  funny:academic  funny:because_its_true  funny:geeky  funny:laughing_instead_of_screaming  funny:malicious  gailey.jeannine_hall  gaussian_processes  generalized_linear_models  genetics  geology  geometry  goodness-of-fit  google  gordon.geoff  gore.al  grading  graphical_models  great_depression  handcock.mark  hansen.bruce  hansen.christian  have_read  hayfield.tristen  healy.kieran  heard_the_talk  heteroskedasticity  hierarchical_models  hooker.giles  human_genetics  hypothesis_testing  independence_testing  indonesia  inequality  inference_to_latent_objects  information_retrieval  instrumental_variables  internet  intro_stats  in_NB  kafadar.karen  kalisch.markus  kernel_estimators  kernel_methods  kith_and_kin  kolmogorov-smirnov-test  lafferty.john  lang.kevin  lasso  latent_variables  law  learning_theory  levy.ferdinand  liberman.mark  linear_regression  literary_homage  liu.han  logistic_regression  long-range_dependence  low-rank_approximation  machine_learning  macroeconomics  markov_models  matching  mathematics  medieval_european_history  methodological_advice  methodology  minimax  misspecification  mixture_models  model-checking  model_checking  model_discovery  model_selection  morris.martina  mortgage_crisis  multiple_testing  music  names  natural_history_of_truthiness  neuroscience  neyman_smooth_tests  nonparametrics  no_really_via:warrenellis  occupy_wall_street  official_statistics  optimization  p-values  partial_identification  pearl.judea  philosophy  photos  plagiarism  please_give_me_strength  poetry  political_economy  political_science  preaching_to_the_choir  prediction  principal_components  programming  R  racine.jeffrey  rademacher_complexity  randomization  rauchway.eric  ravikumar.pradeep  re:growing_ensemble_project  re:g_paper  re:neutral_model_of_inquiry  re:social-networks-as-sensor-networks  re:stacs  reference  regression  relative_distributions  review_papers  richardson.thomas_s.  rodrik.dani  roeder.kathryn  satire  selection_bias  self-promotion  shanteau.james  simon.herbert  sleep  smola.alex  smoothing  snoqualmie_falls  social_science_methodology  sociology  something_about_america  sparsity  spatial_statistics  spectral_methods  splines  spurious_correlations  stability_of_learning  stark.philip  statistical_inference_for_stochastic_processes  statistics  stepping_stone_model  stochastic_processes  strikes  structural_equations  survival_analysis  teaching  television  text_mining  the_continuing_crises  tibshirani.robert  tibshirani.ryan  time_series  to:blog  to:NB  to_read  to_teach  to_teach:complexity-and-inference  to_teach:data-mining  to_teach:statcomp  to_teach:undergrad-ADA  to_teach:undergrad-research  track_down_references  trade  turbulence  tutorials  unemployment  urban_economics  us_politics  van_roy.benjamin  variable_selection  verzani.john  via:?  via:arthegall  via:erindanielson  via:fionajay  via:gelman  via:gmg  via:henry_farrell  via:jhofman  via:mathbabe  via:mejn  via:moritz-heene  via:nikete  via:rocha  via:slaniel  via:tony_lin  via:unfogged  via:vqv  via:warrenellis  violence  visual_display_of_quantitative_information  volcano  voting  war  wasserman.larry  world_bank  yellowstone 

Copy this bookmark:



description:


tags: