cshalizi + to_teach:data-mining   217

Clarke , Clarke : Prediction in several conventional contexts
"We review predictive techniques from several traditional branches of statistics. Starting with prediction based on the normal model and on the empirical distribution function, we proceed to techniques for various forms of regression and classification. Then, we turn to time series, longitudinal data, and survival analysis. Our focus throughout is on the mechanics of prediction more than on the properties of predictors."

(to_teach tags are tentative.)
to:NB  prediction  statistics  classifiers  regression  to_teach:undergrad-ADA  to_teach:data-mining 
20 days ago by cshalizi
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful. In particular, since BHV have shown that the space of trees is negatively curved (a CAT(0) space), a natural representation of a collection of trees is a tree. We compare this representation to the Euclidean approximations of treespace made available through Multidimensional Scaling of the matrix of distances between trees. We also provide applications of the distances between trees to hierarchical clustering trees constructed from microarrays. Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
to:NB  clustering  hierarchical_structure  holmes.susan  data_mining  statistics  to_teach:data-mining  gene_expression_data_analysis  via:ryan_t 
4 weeks ago by cshalizi
The Electronic Text Corpus of Sumerian Literature
"Sumerian is the first language for which we have written evidence and its literature the earliest known. The Electronic Text Corpus of Sumerian Literature (ETCSL), a project of the University of Oxford, comprises a selection of nearly 400 literary compositions recorded on sources which come from ancient Mesopotamia (modern Iraq) and date to the late third and early second millennia BCE.
"The corpus contains Sumerian texts in transliteration, English prose translations and bibliographical information for each composition. The transliterations and the translations can be searched, browsed and read online using the tools of the website."

(Re to_teach:data_mining tag: here are some bags of words for classification, principal components, topic models, maybe even manifold learning...)
sumer  mesopotamia  archaeology  history_of_ideas  data_sets  to_teach:data-mining  via:? 
6 weeks ago by cshalizi
Taylor & Francis Online :: Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
"For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality."
to:NB  visual_display_of_quantitative_information  clustering  data_mining  to_teach:data-mining 
8 weeks ago by cshalizi
[0805.3032] Testing earthquake predictions
"Statistical tests of earthquake predictions require a null hypothesis to model occasional chance successes. To define and quantify `chance success' is knotty. Some null hypotheses ascribe chance to the Earth: Seismicity is modeled as random. The null distribution of the number of successful predictions -- or any other test statistic -- is taken to be its distribution when the fixed set of predictions is applied to random seismicity. Such tests tacitly assume that the predictions do not depend on the observed seismicity. Conditioning on the predictions in this way sets a low hurdle for statistical significance. Consider this scheme: When an earthquake of magnitude 5.5 or greater occurs anywhere in the world, predict that an earthquake at least as large will occur within 21 days and within an epicentral distance of 50 km. We apply this rule to the Harvard centroid-moment-tensor (CMT) catalog for 2000--2004 to generate a set of predictions. The null hypothesis is that earthquake times are exchangeable conditional on their magnitudes and locations and on the predictions--a common ``nonparametric'' assumption in the literature. We generate random seismicity by permuting the times of events in the CMT catalog. We consider an event successfully predicted only if (i) it is predicted and (ii) there is no larger event within 50 km in the previous 21 days. The $P$-value for the observed success rate is $<0.001$: The method successfully predicts about 5% of earthquakes, far better than `chance,' because the predictor exploits the clustering of earthquakes -- occasional foreshocks -- which the null hypothesis lacks. Rather than condition on the predictions and use a stochastic model for seismicity, it is preferable to treat the observed seismicity as fixed, and to compare the success rate of the predictions to the success rate of simple-minded predictions like those just described. If the proffered predictions do no better than a simple scheme, they have little value."
have_read  to:NB  statistics  geology  prediction  earthquakes  to_teach:undergrad-ADA  to_teach:data-mining 
12 weeks ago by cshalizi
[1202.1523] Information Forests
"We describe Information Forests, an approach to classification that generalizes Random Forests by replacing the splitting criterion of non-leaf nodes from a discriminative one -- based on the entropy of the label distribution -- to a generative one -- based on maximizing the information divergence between the class-conditional distributions in the resulting partitions. The basic idea consists of deferring classification until a measure of "classification confidence" is sufficiently high, and instead breaking down the data so as to maximize this measure. In an alternative interpretation, Information Forests attempt to partition the data into subsets that are "as informative as possible" for the purpose of the task, which is to classify the data. Classification confidence, or informative content of the subsets, is quantified by the Information Divergence. Our approach relates to active learning, semi-supervised learning, mixed generative/discriminative learning."

After reading: meh.
have_read  decision_trees  information_theory  classifiers  machine_learning  to_teach:data-mining  re:AoS_project 
february 2012 by cshalizi
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
"We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this, we adopt a different strategy than is usual in the feature selection literature−instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples."
in_NB  information_theory  statistics  variable_selection  model_selection  to_teach:data-mining  to:blog  machine_learning  classifiers  have_read  graphical_models 
february 2012 by cshalizi
Sun , Wang , Fang : Regularized k-means clustering of high-dimensional data and its asymptotic consistency
"K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering."
in_NB  clustering  statistics  lasso  data_mining  to_teach:data-mining 
february 2012 by cshalizi
A General Framework for Dimensionality-Reducing Data Visualization Mapping
"In recent years, a wealth of dimension-reduction techniques for data visualization and preprocessing has been established. Nonparametric methods require additional effort for out-of-sample extensions, because they provide only a mapping of a given finite set of points. In this letter, we propose a general view on nonparametric dimension reduction based on the concept of cost functions and properties of the data. Based on this general principle, we transfer nonparametric dimension reduction to explicit mappings of the data manifold such that direct out-of-sample extensions become possible. Furthermore, this concept offers the possibility of investigating the generalization ability of data visualization to new data points. We demonstrate the approach based on a simple global linear mapping, as well as prototype-based local linear mappings. In addition, we can bias the functional form according to given auxiliary information. This leads to explicit supervised visualization mappings with discriminative properties comparable to state-of-the-art approaches."
in_NB  dimension_reduction  visual_display_of_quantitative_information  data_analysis  data_mining  manifold_learning  to_teach:data-mining 
february 2012 by cshalizi
Building Consistent Regression Trees from Complex Sample Data
"In the past several years a wide range of methods for the construction of regression trees and other estimators based on the recursive partitioning of samples have appeared in the statistics literature. Many applications involve data collected through a complex sample design. At present, however, relatively little is known regarding the properties of these methods under complex designs. This article proposes a method for incorporating information about the complex sample design when building a regression tree using a recursive partitioning algorithm. Sufficient conditions are established for asymptotic design L2 consistency of these regression trees as estimators for an arbitrary regression function. The proposed method is illustrated with Occupational Employment Statistics establishment survey data linked to Quarterly Census of Employment and Wage payroll data of the Bureau of Labor Statistics. Performance of the nonparametric estimator is investigated through a simulation study based on this example."
to:NB  regression  prediction_trees  statistics  machine_learning  to_teach:data-mining  nonparametrics 
january 2012 by cshalizi
Mining of Massive Datasets - Academic and Professional Books - Cambridge University Press
"The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike."

--- What a remarkably hideous cover!
to:NB  books:noted  data_mining  to_teach:data-mining  machine_learning  computational_statistics 
january 2012 by cshalizi
Modified Locally Linear Embedding Using Multiple Weights
Stabilizing LLE by (as it were) model averaging. Needs at least a reference in the data-mining lecture on LLE.

"The locally linear embedding (LLE) is improved by introducing multiple linearly independent local weight vectors for each neighborhood. We characterize the reconstruction weights and show the existence of the linearly independent weight vectors at each neighborhood. The modified locally linear embedding (MLLE) proposed in this paper is much stable. It can retrieve the ideal embedding if MLLE is applied on data points sampled from an isometric manifold. MLLE is also compared with the local tangent space alignment (LTSA). Numerical examples are given that show the improvement and efficiency of MLLE."
to:NB  to_teach:data-mining  dimension_reduction  manifold_learning  via:gmg  spectral_clustering 
december 2011 by cshalizi
Prediction-based regularization using data augmented regression - Statistics and Computing, Volume 22, Number 1
"The role of regularization is to control fitted model complexity and variance by penalizing (or constraining) models to be in an area of model space that is deemed reasonable, thus facilitating good predictive performance. This is typically achieved by penalizing a parametric or non-parametric representation of the model. In this paper we advocate instead the use of prior knowledge or expectations about the predictions of models for regularization. This has the twofold advantage of allowing a more intuitive interpretation of penalties and priors and explicitly controlling model extrapolation into relevant regions of the feature space. This second point is especially critical in high-dimensional modeling situations, where the curse of dimensionality implies that new prediction points usually require extrapolation. We demonstrate that prediction-based regularization can, in many cases, be stochastically implemented by simply augmenting the dataset with Monte Carlo pseudo-data. We investigate the range of applicability of this implementation. An asymptotic analysis of the performance of Data Augmented Regression (DAR) in parametric and non-parametric linear regression, and in nearest neighbor regression, clarifies the regularizing behavior of DAR. We apply DAR to simulated and real data, and show that it is able to control the variance of extrapolation, while maintaining, and often improving, predictive accuracy."
in_NB  to_read  statistics  prediction  estimation  hooker.giles  regression  to_teach:undergrad-ADA  to_teach:data-mining  curse_of_dimensionality 
december 2011 by cshalizi
[1110.3917] How to Evaluate Dimensionality Reduction? - Improving the Co-ranking Matrix
"The growing number of dimensionality reduction methods available for data visualization has recently inspired the development of quality assessment measures, in order to evaluate the resulting low-dimensional representation independently from a methods' inherent criteria. Several (existing) quality measures can be (re)formulated based on the so-called co-ranking matrix, which subsumes all rank errors (i.e. differences between the ranking of distances from every point to all others, comparing the low-dimensional representation to the original data). The measures are often based on the partioning of the co-ranking matrix into 4 submatrices, divided at the K-th row and column, calculating a weighted combination of the sums of each submatrix. Hence, the evaluation process typically involves plotting a graph over several (or even all possible) settings of the parameter K. Considering simple artificial examples, we argue that this parameter controls two notions at once, that need not necessarily be combined, and that the rectangular shape of submatrices is disadvantageous for an intuitive interpretation of the parameter. We debate that quality measures, as general and flexible evaluation tools, should have parameters with a direct and intuitive interpretation as to which specific error types are tolerated or penalized. Therefore, we propose to replace K with two parameters to control these notions separately, and introduce a differently shaped weighting on the co-ranking matrix. The two new parameters can then directly be interpreted as a threshold up to which rank errors are tolerated, and a threshold up to which the rank-distances are significant for the evaluation. Moreover, we propose a color representation of local quality to visually support the evaluation process for a given mapping, where every point in the mapping is colored according to its local contribution to the overall quality." --- Look at this carefully, and see if it could be taught in data mining (and whether it's worth doing so.)
to:NB  dimension_reduction  statistics  data_analysis  visual_display_of_quantitative_information  to_teach:data-mining 
october 2011 by cshalizi
Population Value Decomposition, a Framework for the Analysis of Image Populations - Journal of the American Statistical Association - 106(495):775
"Images, often stored in multidimensional arrays, are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most significant challenge is the massive size of the datasets incorporating images recorded for hundreds or thousands of subjects at multiple visits. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can be seamlessly incorporated into statistical modeling, leading to a new, transparent, and rapid inferential framework. Our PVD methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits. This article has supplementary material online." --- Presumably just some form of SVD for higher-dimensional arrays.
to:NB  principal_components  data_analysis  to_read  to_teach:data-mining  to_teach:undergrad-ADA 
october 2011 by cshalizi
Density Estimation in Several Populations With Uncertain Population Membership
"We devise methods to estimate probability density functions of several populations using observations with uncertain population membership, meaning from which population an observation comes is unknown. The probability of an observation being sampled from any given population can be calculated. We develop general estimation procedures and bandwidth selection methods for our setting. We establish large-sample properties and study finite-sample performance using simulation studies. We illustrate our methods with data from a nutrition study."
in_NB  density_estimation  mixture_models  to_teach:undergrad-ADA  to_teach:data-mining 
october 2011 by cshalizi
Robustification of the PC Algorithm for Directed Acyclic Graphs
"The PC-algorithm was shown to be a powerful method for estimating the equivalence class of a potentially very high-dimensional acyclic directed graph (DAG) with the corresponding Gaussian distribution. Here we propose a computationally eficient robustification of the PC-algorithm and prove its consistency. Furthermore, we compare the robustified and standard version of the PC-algorithm on simulated data using the new corresponding R package pcalg."
statistics  causal_inference  graphical_models  buhlmann.peter  in_NB  to_read  to_teach:data-mining  to_teach:undergrad-ADA  kalisch.markus 
october 2011 by cshalizi
The Fans Are All Right (Pinboard Blog)
"I learned a lot about fandom couple of years ago in conversations with my friend Britta, who was working at the time as community manager for Delicious. She taught me that fans were among the heaviest users of the bookmarking site, and had constructed an edifice of incredibly elaborate tagging conventions, plugins, and scripts to organize their output along a bewildering number of dimensions. If you wanted to read a 3000 word fic where Picard forces Gandalf into sexual bondage, and it seems unconsensual but secretly both want it, and it's R-explicit but not NC-17 explicit, all you had to do was search along the appropriate combination of tags (and if you couldn't find it, someone would probably write it for you). By 2008 a whole suite of theoretical ideas about folksonomy, crowdsourcing, faceted infomation retrieval, collaborative editing and emergent ontology had been implemented by a bunch of friendly people so that they could read about Kirk drilling Spock." --- See also the very last link.
fandom  social_life_of_the_mind  social_media  information_retrieval  tagging  pinboard  delicious.com  via:arsyed  to_teach:data-mining  ok_maybe_not_really_to_teach 
october 2011 by cshalizi
Draw - Google Correlate
So cool: draw a curve free-hand, get the keywords whose time series correlate best with it.  I can't go below a correlation of 0.70.
google  information_retrieval  spurious_correlations  to_teach:undergrad-ADA  to_teach:data-mining  to:blog  via:vqv  rademacher_complexity 
october 2011 by cshalizi
The Meta-Activism Project | A Non-Traditional Digital Activism Think Tank
Flagged "to_teach:data-mining" if I can think of a good project for students with this.
networked_life  politics  data_sets  to_teach:data-mining 
september 2011 by cshalizi
"Smooth Regression Analysis" (G. S. Watson, 1964) JSTOR: Sankhyā: The Indian Journal of Statistics, Series A, Vol. 26, No. 4 (Dec., 1964), pp. 359-372
The abstract is great: "Few would deny that the most powerful statistical tool is graph paper. When however there are many observations (and/or many variables) graphical procedures become tedious. It seems to the author that the most characteristic problem for statisticians at the moment is the development of methods for analyzing the data poured out by electronic observing systems. The present paper gives a simple computer method for obtaining a "graph" from a large number of observations."
smoothing  regression  kernel_estimators  data_mining  to_teach:undergrad-ADA  to_teach:data-mining  via:gmg 
june 2011 by cshalizi
Accuracy and reliability of forensic latent fingerprint decisions
"first large-scale study of the accuracy and reliability of latent print examiners’ decisions ... 169 latent print examiners each compared approximately 100 pairs of latent and exemplar fingerprints from a pool of 744 pairs. ... range of attributes and quality encountered in forensic casework ... . Five examiners made false positive errors for an overall false positive rate of 0.1%. Eighty-five percent of examiners made at least one false negative error for an overall false negative rate of 7.5%. Independent examination of the same comparisons by different participants (analogous to blind verification) [detected] all false positive errors and [most[ false negative errors... Examiners frequently differed on whether fingerprints were suitable for reaching a conclusion."
forensics  fingerprints  pattern_recognition  to_teach:data-mining  to:NB 
may 2011 by cshalizi
Efficient probabilistic forecasts for counts - McCabe et al., 2011 - JRSS-B
" Efficient probabilistic forecasts of integer-valued random variables are derived. The optimality is achieved by estimating the forecast distribution non-parametrically over a given broad model class and proving asymptotic (non-parametric) efficiency in that setting. The method is developed within the context of the integer auto-regressive class of models, which is a suitable class for any count data that can be interpreted as a queue, stock, birth-and-death process or branching process. The theoretical proofs of asymptotic efficiency are supplemented by simulation results that demonstrate the overall superiority of the non-parametric estimator relative to a misspecified parametric alternative, in large but finite samples. The method is applied to counts of stock market iceberg orders. A subsampling method is used to assess sampling variation in the full estimated forecast distribution and a proof of its validity is given."  (Dunno about the to_teach tags, I haven't read this yet.)
statistics  prediction  density_estimation  time_series  stochastic_processes  branching_processes  to_teach:data-mining  to_teach:undergrad-ADA 
march 2011 by cshalizi
A stable estimator of the information matrix under EM for dependent data
"This article develops a new and stable estimator for information matrix when the EM algorithm is used in maximum likelihood estimation. This estimator is constructed using the smoothed individual complete-data scores that are readily available from running the EM algorithm. The method works for dependent data sets and when the expectation step is an irregular function of the conditioning parameters."  (When I teach EM, I should say something about how to get uncertainty estimates...)
fisher_information  em_algorithm  estimation  statistics  to_teach:data-mining  to_teach:undergrad-ADA 
december 2010 by cshalizi
Rule generation for categorical time series with Markov assumptions
"Several procedures of sequential pattern analysis are designed to detect frequently occurring patterns in a single categorical time series (episode mining). Based on these frequent patterns, rules are generated and evaluated, for example, in terms of their confidence. The confidence value is commonly interpreted as an estimate of a conditional probability, so some kind of stochastic model has to be assumed. The model is identified as a variable length Markov model. With this assumption, the usual confidences are maximum likelihood estimates of the transition probabilities of the Markov model. We discuss possibilities of how to efficiently fit an appropriate model to the data. Based on this model, rules are formulated. It is demonstrated that this new approach generates noticeably less and more reliable rules." --- I should really add some time series stuff to data mining...
data_mining  markov_models  time_series  in_NB  to_teach:data-mining  variable-length_markov_models 
december 2010 by cshalizi
Consistent selection of the number of clusters via crossvalidation — Biometrika
"In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split."
clustering  stability_of_learning  data_mining  statistics  to_teach:data-mining  to_teach:undergrad-ADA 
december 2010 by cshalizi
Predicting consumer behavior with Web search — PNAS
What search can and cannot predict. They mention, but I think could have stressed even more, that the search data is generated _automatically_ as a by-product of now-ordinary social life, rather than a deliberate construction on the part of public or private data-collecting agencies, so it is very, very, very cheap.
internet  data_mining  to_teach:data-mining  kith_and_kin  watts.duncan  hofman.jake  sociology  information_retrieval  networked_life  have_read 
october 2010 by cshalizi
Bickel, Li: Local polynomial regression on unknown manifolds
"We reveal the phenomenon that “naive” multivariate local polynomial regression can adapt to local smooth lower dimensional structure in the sense that it achieves the optimal convergence rate for nonparametric estimation of regression functions belonging to a Sobolev space when the predictor variables live on or close to a lower dimensional manifold." Need to mention this when I talk about the curse of dimensionality in data mining...
via:students  regression  manifold_learning  statistics  to_teach:data-mining  curse_of_dimensionality 
october 2010 by cshalizi
A Cautionary Note on the Use of Matching to Estimate Causal Effects: An Empirical Example Comparing Matching Estimates to an Experimental Benchmark — Sociological Methods Research
"...social scientists have increasingly turned to matching [to draw] causal inferences from observational data. Matching compares those who receive a treatment to those with similar background attributes who do not receive a treatment. ... Drawing on a randomized voter mobilization experiment ... compare matching [estimates] to an experimental benchmark. ... enormous sample size .... exactly match each treated subject to 40 untreated subjects. Matching greatly exaggerates the effectiveness of pre-election phone calls encouraging voter participation. ... Matching suggests that another pre-election phone call that encouraged people to wear their seat belts also generated huge increases in voter turnout. ... caution is warranted when applying matching estimators to observational data, particularly when one is uncertain about the potential for biased inference." Ouch!
have_read  to_teach:data-mining  causal_inference  matching  experimental_political_science  evisceration  to:blog  to_teach:undergrad-ADA 
october 2010 by cshalizi
[1010.0499] Statistical analysis of $k$-nearest neighbor collaborative recommendation
"Collaborative recommendation is an information-filtering technique that attempts to present information items that are likely of interest to an Internet user. Traditionally, collaborative systems deal with situations with two types of variables, users and items. In its most common form, the problem is framed as trying to estimate ratings for items that have not yet been consumed by a user. Despite wide-ranging literature, little is known about the statistical properties of recommendation systems. In fact, no clear probabilistic model even exists which would allow us to precisely describe the mathematical forces driving collaborative filtering. ... [We] set out a general sequential stochastic model for collaborative recommendation. ... in-depth analysis of the so-called cosine-type nearest neighbor ,,, method .... asymptotic performance as the number of users grows. We establish consistency ... under mild assumptions... Rates of convergence and examples ..."
collaborative_filtering  information_retrieval  stochastic_models  nearest_neighbors  to_teach:data-mining 
october 2010 by cshalizi
D-squared Digest -- I've seen this film before
"will assert with truculence and jutted jaw that the derivation of commercially relevant and actionable insurance information from genetics is much, much more difficult than that, and furthermore, will tentatively and politely advance the possibility that to "accurately assess the cost of medical treatment over a lifetime" might end up being at least as much of a tough-nut as the socialist calculation problem.  As the rant linked above tries to point out, the big issue here is the "Titanic problem" (from Hitchcock's aphorism about it being possible to make a suspenseful film about the Titanic given that everyone knows that it sinks - they don't know when). Knowing genetic propensities to develop various conditions is just about one tiny baby step along the way to making a cost estimate ..."

Especially: not knowing what treatments will be available, or how costly, 20--40 years after the genetic test is performed! (Another teaching example for 350, I think.)
dsquared  human_genetics  market_failures_in_everything  to_teach:data-mining 
september 2010 by cshalizi
Suykens, Alzate, Pelckmans: Primal and dual model representations in kernel-based learning
"This paper discusses the role of primal and (Lagrange) dual model representations in problems of supervised and unsupervised learning. The specification of the estimation problem is conceived at the primal level as a constrained optimization problem. The constraints relate to the model which is expressed in terms of the feature map. From the conditions for optimality one jointly finds the optimal model representation and the model estimate. At the dual level the model is expressed in terms of a positive definite kernel function, which is characteristic for a support vector machine methodology. It is discussed how least squares support vector machines are playing a central role as core models across problems of regression, classification, principal component analysis, spectral clustering, canonical correlation analysis, dimensionality reduction and data visualization."
kernel_methods  statistics  machine_learning  data_mining  to_teach:data-mining 
august 2010 by cshalizi
Unit Testing in R: The Bare Minimum
I hesitate about the teaching tag, this seems quite clunky --- but perhaps it's not that bad when you try it.
via:arsyed  programming  R  to_teach:data-mining  to_teach:statcomp 
august 2010 by cshalizi
depmixS4: An R Package for Hidden Markov Models
"depmixS4 implements a general framework for defining and estimating dependent mixture models in the R programming language. This includes standard Markov models, latent/hidden Markov models, and latent class and finite mixture distribution models. The models can be fitted on mixed multivariate data with distributions from the glm family, the (logistic) multinomial, or the multivariate normal distribution. Other distributions can be added easily, and an example is provided with the exgaus distribution. Parameters are estimated by the expectation-maximization (EM) algorithm or, when (linear) constraints are imposed on the parameters, by direct numerical optimization with the Rsolnp or Rdonlp2 routines."
statistics  computational_statistics  R  markov_models  mixture_models  to_teach:data-mining  to_teach:complexity-and-inference  to_teach:undergrad-ADA 
august 2010 by cshalizi
"A Locally Adaptive Penalty for Estimation of Functions of Varying Roughness"
"We propose a new regularization method called Loco-Spline for nonparametric function estimation. Loco-Spline uses a penalty which is data driven and locally adaptive. This allows for more flexible estimation of the function in regions of the domain where it has more curvature, without over fitting in regions that have little curvature. This methodology is also transferred into higher dimensions via the Smoothing Spline ANOVA framework. General conditions for optimal MSE rate of convergence are given and the Loco-Spline is shown to achieve this rate. In our simulation study, the Loco-Spline substantially outperforms the traditional smoothing spline and the locally adaptive kernel smoother. Code to fit Loco-Spline models is included with the Supplemental Materials for this article which are available online." Teach? But I'd need to explain more about splines.
splines  curve_fitting  smoothing  regression  statistics  to_teach:data-mining  to_read  to_teach:undergrad-ADA 
august 2010 by cshalizi
The SHOGUN Machine Learning Toolbox
C++ library with R interface, supposedly good for Really Big data. Consider for 350?
machine_learning  computational_statistics  programming  to_read  to_teach:data-mining  R  c++ 
july 2010 by cshalizi
ILI 2009 Presentation – "Self-plagiarism is style"
Cool effects achieved by applying basic data mining to libraries. To be used as teaching fodder, but honestly I should also find the time to suggest it to our librarians.
libraries  data_mining  information_retrieval  collaborative_filtering  via:magistra_et_mater  to_teach:data-mining 
june 2010 by cshalizi
Phantom of Heilbronn - Wikipedia, the free encyclopedia
In which the combined police forces of Europe spend years chasing a female serial killer known only from DNA evidence, only to find that it's all down to contaminated cotton swabs from a single supplier!

Teaching note for data mining: This should make a great example of the importance of getting the data right, before worrying about the statistical processing...
via:arsyed  serial_killers  to_teach:data-mining  bad_data  DNA_testing  forensics  wtf  inference_to_latent_objects  blogged 
may 2010 by cshalizi
A dissection of John Gottman's love lab. - By Laurie Abraham - Slate Magazine
This is confused, or at least confusingly written. Is the objection to not evaluating the classifier out of sample? Or that the success of even a very stupid rule should be high (because most couples don't get divorced within five years)? (That would be a valid point, but it's not "base-rate neglect".) Or what?
marriage  classifiers  to_teach:data-mining  data_analysis 
march 2010 by cshalizi
[1003.0529] A Unified Algorithmic Framework for Multi-Dimensional Scaling
"In this paper, we propose a unified algorithmic framework for solving many known variants of \mds. Our algorithm is a simple iterative scheme with guaranteed convergence, and is \emph{modular}; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods, in comparable time. "
multidimensional_scaling  dimension_reduction  visual_display_of_quantitative_information  to_teach:data-mining  data_mining 
march 2010 by cshalizi
« earlier      

related tags

ackerman.margaret  additive_models  advertising  ahmed.amr  AI  airoldi.edo  aligheri.dante  america  american_south  analogy  anderson.chris  anderson.norm  anthropology  approximation_algorithms  archaeology  astrology  author-identification  backfitting  bad_data  bad_data_analysis  bad_science_journalism  behaviorism  ben-david.shai  biau.gerard  bibliometry  biochemical_networks  birds  blei.david  blogged  blogs  books:noted  book_reviews  boosting  bootstrap  branching_processes  breiman.leo  brumm.maria  buhlmann.peter  buntine.wray  burke.timothy  burns.patrick  c++  calibration  CART  cashill.jack  causality  causal_inference  cavalli-sforza  chalko.tom  classifiers  climate_change  clinical_vs_actuarial_prediction  clinton.hillary  clustering  coen.michael  collaborative_filtering  collective_cognition  community_discovery  computational_statistics  confidence_sets  content_analysis  corporations  counter-terrorism  cox.amanda  credit_ratings  creeping_authoritarianism  crime  cross-validation  curse_of_dimensionality  curve_fitting  data  data-mining  databases  data_analysis  data_mining  data_repositories  data_sets  debunking  deceiving_us_has_become_an_industrial_process  decision-making  decision_trees  delicious.com  density_estimation  development_economics  devroye.luc  dietterich.thomas  diffusion_maps  dimension_reduction  DNA_testing  drum.kevin  dsquared  eagle.nathan  earthquakes  econometrics  economics  economic_history  egrid  electric_power_grid  email  em_algorithm  engineers  enron  ensemble_methods  epidemiology  error_statistics  estimation  evisceration  expectation-maximization  experimental_political_science  experimental_psychology  factor_analysis  fandom  FBI  feature_selection  finance  financial_markets  fingerprints  FISA  fisher_information  flow_of_funds  fmri  food  forensics  franklin.charles  fraud  freedman.david  freese.jeremy  freund.yoav  fry.ben  funny:academic  funny:because_its_true  funny:geeky  funny:laughing_instead_of_screaming  funny:malicious  funny:sad  generations  genetics  gene_expression_data_analysis  geology  gibbons  google  gordon.geoff  gore.al  grammar_induction  graphical_models  graph_theory  great_depression  guyon.isabelle  handcock.mark  hansen.bruce  have_read  hayfield.tristen  healy.kieran  heard_the_talk  heteroskedasticity  hierarchical_structure  hinton.geoffrey  history_of_ideas  hofman.jake  hofmann.thomas  holmes.susan  homophily  hooker.giles  humanities  human_genetics  human_terrain_system  hypothesis_testing  ibm  image_retrieval  independent_components_analysis  inequality  inference_to_latent_objects  information_geometry  information_retrieval  information_theory  institutions  interface_design  internet  intro_stats  in_NB  iran  iterative_approximation  jakulin.aleks  janzing.dominik  juking_the_stats  k-means  kafadar.karen  kalisch.markus  kaufmann.scott_eric  kernel_estimators  kernel_methods  king.gary  kith_and_kin  klein.ezra  kleinberg.jon  klinkner.kristina  lafferty.john  lasso  latent_dirichlet_allocation  latent_semantic_analysis  latent_variables  lazer.david  learning_theory  lee.ann  liberman.mark  libraries  lie_detection  life_imitates_the_onion  linear_regression  linguistics  literary_criticism  literary_homage  liu.han  logistic_regression  lolcats  lolfoxes  low-rank_approximation  lugosi.gabor  luxburg.ulrike_von  machine_learning  macroeconomics  manifold_learning  marketing  market_failures_in_everything  markov_models  marriage  matching  mathematics  medicine  mesopotamia  methodological_advice  methodology  mis-specification_testing  misspecification  mixture_models  model_selection  moral_panic  moral_responsibility  morris.martina  mortgage_crisis  multidimensional_scaling  multiple_comparisons  multiple_testing  nadler.boaz  national_income_accounting  national_surveillance_state  natural_history_of_truthiness  natural_language_processing  nearest_neighbors  netflix_prize  networked_life  network_data_analysis  neuroscience  newspapers  new_york  niyogi.partha  nonparametrics  novels  no_really_via:warrenellis  NSA  nukes  o'neil.cathy  obama.barack  obesity  occupy_wall_street  official_statistics  ok_maybe_not_really_to_teach  online_learning  optimization  outliers  p-values  pattern_recognition  penn.mark  pentland.alex  phonology  photos  pictures  pinboard  police  political_science  politics  polling  pollution  poverty  precision-recall  prediction  prediction_trees  primates  principal_components  privacy  profiling  programming  psychology  public_relations  puchalsky.rich  R  racine.jeffrey  rademacher_complexity  randomization  random_forests  rauchway.eric  ravikumar.pradeep  re:AoS_project  re:g_paper  re:neutral_model_of_inquiry  re:XV_for_mixing  re:XV_for_networks  regression  reinforcement_learning  relative_distributions  review_papers  richards.joey  rinaldo.alessandro  risk_assessment  running_dogs_of_reaction  salmon  satire  search_engines  secure_flight  securitization  sentiment_analysis  serial_killers  shanteau.james  skinner.b.f.  slee.tom  sleep  smola.alex  smoothing  social_life_of_the_mind  social_media  social_networks  social_science_methodology  sociology  sociology_of_science  software  sparsity  spatial_statistics  spectral_clustering  spectral_methods  splines  spurious_correlations  stability_of_learning  stark.philip  statistics  stepping_stone_model  stochastic_approximation  stochastic_models  stochastic_processes  stross.charlie  studentization  stupid_security  stylistics  sumer  support_vector_machines  surveillance  tagging  teaching  terrorism_fears  textual_criticism  text_mining  the_continuing_crises  the_wired_ideology  tibshirani.robert  tibshirani.ryan  time_series  tishby.naftali  to:blog  to:NB  topic_models  to_read  to_teach  to_teach:complexity-and-inference  to_teach:data-mining  to_teach:statcomp  to_teach:undergrad-ADA  to_teach:undergrad-research  track_down_references  transaction_networks  tsa  turney.peter  tutorials  unemployment  us_civil_war  us_politics  utter_stupidity  van_der_maaten.laurens  variable-length_markov_models  variable_selection  vast_right-wing_conspiracy  verzani.john  via:?  via:aaron_clauset  via:absfac  via:ariddell  via:arsyed  via:arthegall  via:brad-carlin  via:chl  via:dpfeldman  via:fionajay  via:georg  via:gmg  via:guslacerda  via:hilzoy  via:jhofman  via:john-burke  via:klk  via:magistra_et_mater  via:mind-hacks  via:moritz-heene  via:myl  via:nicholas_della_penna  via:nikete  via:ryan_t  via:shachtman.noah  via:shreejoy  via:students  via:tomslee  via:vielmetti  via:vqv  via:warrenellis  violence  visual_display_of_quantitative_information  volcano  voting  wahba.grace  wasserman.larry  watts.duncan  weather_prediction  web  why_oh_why_cant_we_have_a_better_press_corps  williamson.robert  wolpert.david  world_bank  wtf  yellowstone  zhu.jerry 

Copy this bookmark:



description:


tags: