cshalizi + to_teach:data-mining 217
Clarke , Clarke : Prediction in several conventional contexts
20 days ago by cshalizi
"We review predictive techniques from several traditional branches of statistics. Starting with prediction based on the normal model and on the empirical distribution function, we proceed to techniques for various forms of regression and classification. Then, we turn to time series, longitudinal data, and survival analysis. Our focus throughout is on the mechanics of prediction more than on the properties of predictors."
(to_teach tags are tentative.)
to:NB
prediction
statistics
classifiers
regression
to_teach:undergrad-ADA
to_teach:data-mining
(to_teach tags are tentative.)
20 days ago by cshalizi
Attractive Models - Kieran Healy
29 days ago by cshalizi
Have I really not bookmarked this before?
p-values
statistics
political_science
social_science_methodology
bad_data_analysis
to_teach:undergrad-ADA
to_teach:data-mining
re:neutral_model_of_inquiry
healy.kieran
29 days ago by cshalizi
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
4 weeks ago by cshalizi
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful. In particular, since BHV have shown that the space of trees is negatively curved (a CAT(0) space), a natural representation of a collection of trees is a tree. We compare this representation to the Euclidean approximations of treespace made available through Multidimensional Scaling of the matrix of distances between trees. We also provide applications of the distances between trees to hierarchical clustering trees constructed from microarrays. Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
to:NB
clustering
hierarchical_structure
holmes.susan
data_mining
statistics
to_teach:data-mining
gene_expression_data_analysis
via:ryan_t
4 weeks ago by cshalizi
The Electronic Text Corpus of Sumerian Literature
6 weeks ago by cshalizi
"Sumerian is the first language for which we have written evidence and its literature the earliest known. The Electronic Text Corpus of Sumerian Literature (ETCSL), a project of the University of Oxford, comprises a selection of nearly 400 literary compositions recorded on sources which come from ancient Mesopotamia (modern Iraq) and date to the late third and early second millennia BCE.
"The corpus contains Sumerian texts in transliteration, English prose translations and bibliographical information for each composition. The transliterations and the translations can be searched, browsed and read online using the tools of the website."
(Re to_teach:data_mining tag: here are some bags of words for classification, principal components, topic models, maybe even manifold learning...)
sumer
mesopotamia
archaeology
history_of_ideas
data_sets
to_teach:data-mining
via:?
"The corpus contains Sumerian texts in transliteration, English prose translations and bibliographical information for each composition. The transliterations and the translations can be searched, browsed and read online using the tools of the website."
(Re to_teach:data_mining tag: here are some bags of words for classification, principal components, topic models, maybe even manifold learning...)
6 weeks ago by cshalizi
Taylor & Francis Online :: Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
8 weeks ago by cshalizi
"For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality."
to:NB
visual_display_of_quantitative_information
clustering
data_mining
to_teach:data-mining
8 weeks ago by cshalizi
[0805.3032] Testing earthquake predictions
12 weeks ago by cshalizi
"Statistical tests of earthquake predictions require a null hypothesis to model occasional chance successes. To define and quantify `chance success' is knotty. Some null hypotheses ascribe chance to the Earth: Seismicity is modeled as random. The null distribution of the number of successful predictions -- or any other test statistic -- is taken to be its distribution when the fixed set of predictions is applied to random seismicity. Such tests tacitly assume that the predictions do not depend on the observed seismicity. Conditioning on the predictions in this way sets a low hurdle for statistical significance. Consider this scheme: When an earthquake of magnitude 5.5 or greater occurs anywhere in the world, predict that an earthquake at least as large will occur within 21 days and within an epicentral distance of 50 km. We apply this rule to the Harvard centroid-moment-tensor (CMT) catalog for 2000--2004 to generate a set of predictions. The null hypothesis is that earthquake times are exchangeable conditional on their magnitudes and locations and on the predictions--a common ``nonparametric'' assumption in the literature. We generate random seismicity by permuting the times of events in the CMT catalog. We consider an event successfully predicted only if (i) it is predicted and (ii) there is no larger event within 50 km in the previous 21 days. The $P$-value for the observed success rate is $<0.001$: The method successfully predicts about 5% of earthquakes, far better than `chance,' because the predictor exploits the clustering of earthquakes -- occasional foreshocks -- which the null hypothesis lacks. Rather than condition on the predictions and use a stochastic model for seismicity, it is preferable to treat the observed seismicity as fixed, and to compare the success rate of the predictions to the success rate of simple-minded predictions like those just described. If the proffered predictions do no better than a simple scheme, they have little value."
have_read
to:NB
statistics
geology
prediction
earthquakes
to_teach:undergrad-ADA
to_teach:data-mining
12 weeks ago by cshalizi
[1202.1523] Information Forests
february 2012 by cshalizi
"We describe Information Forests, an approach to classification that generalizes Random Forests by replacing the splitting criterion of non-leaf nodes from a discriminative one -- based on the entropy of the label distribution -- to a generative one -- based on maximizing the information divergence between the class-conditional distributions in the resulting partitions. The basic idea consists of deferring classification until a measure of "classification confidence" is sufficiently high, and instead breaking down the data so as to maximize this measure. In an alternative interpretation, Information Forests attempt to partition the data into subsets that are "as informative as possible" for the purpose of the task, which is to classify the data. Classification confidence, or informative content of the subsets, is quantified by the Information Divergence. Our approach relates to active learning, semi-supervised learning, mixed generative/discriminative learning."
After reading: meh.
have_read
decision_trees
information_theory
classifiers
machine_learning
to_teach:data-mining
re:AoS_project
After reading: meh.
february 2012 by cshalizi
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
february 2012 by cshalizi
"We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this, we adopt a different strategy than is usual in the feature selection literature−instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples."
in_NB
information_theory
statistics
variable_selection
model_selection
to_teach:data-mining
to:blog
machine_learning
classifiers
have_read
graphical_models
february 2012 by cshalizi
Sun , Wang , Fang : Regularized k-means clustering of high-dimensional data and its asymptotic consistency
february 2012 by cshalizi
"K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering."
in_NB
clustering
statistics
lasso
data_mining
to_teach:data-mining
february 2012 by cshalizi
A General Framework for Dimensionality-Reducing Data Visualization Mapping
february 2012 by cshalizi
"In recent years, a wealth of dimension-reduction techniques for data visualization and preprocessing has been established. Nonparametric methods require additional effort for out-of-sample extensions, because they provide only a mapping of a given finite set of points. In this letter, we propose a general view on nonparametric dimension reduction based on the concept of cost functions and properties of the data. Based on this general principle, we transfer nonparametric dimension reduction to explicit mappings of the data manifold such that direct out-of-sample extensions become possible. Furthermore, this concept offers the possibility of investigating the generalization ability of data visualization to new data points. We demonstrate the approach based on a simple global linear mapping, as well as prototype-based local linear mappings. In addition, we can bias the functional form according to given auxiliary information. This leads to explicit supervised visualization mappings with discriminative properties comparable to state-of-the-art approaches."
in_NB
dimension_reduction
visual_display_of_quantitative_information
data_analysis
data_mining
manifold_learning
to_teach:data-mining
february 2012 by cshalizi
Building Consistent Regression Trees from Complex Sample Data
january 2012 by cshalizi
"In the past several years a wide range of methods for the construction of regression trees and other estimators based on the recursive partitioning of samples have appeared in the statistics literature. Many applications involve data collected through a complex sample design. At present, however, relatively little is known regarding the properties of these methods under complex designs. This article proposes a method for incorporating information about the complex sample design when building a regression tree using a recursive partitioning algorithm. Sufficient conditions are established for asymptotic design L2 consistency of these regression trees as estimators for an arbitrary regression function. The proposed method is illustrated with Occupational Employment Statistics establishment survey data linked to Quarterly Census of Employment and Wage payroll data of the Bureau of Labor Statistics. Performance of the nonparametric estimator is investigated through a simulation study based on this example."
to:NB
regression
prediction_trees
statistics
machine_learning
to_teach:data-mining
nonparametrics
january 2012 by cshalizi
Mining of Massive Datasets - Academic and Professional Books - Cambridge University Press
january 2012 by cshalizi
"The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike."
--- What a remarkably hideous cover!
to:NB
books:noted
data_mining
to_teach:data-mining
machine_learning
computational_statistics
--- What a remarkably hideous cover!
january 2012 by cshalizi
Modified Locally Linear Embedding Using Multiple Weights
december 2011 by cshalizi
Stabilizing LLE by (as it were) model averaging. Needs at least a reference in the data-mining lecture on LLE.
"The locally linear embedding (LLE) is improved by introducing multiple linearly independent local weight vectors for each neighborhood. We characterize the reconstruction weights and show the existence of the linearly independent weight vectors at each neighborhood. The modified locally linear embedding (MLLE) proposed in this paper is much stable. It can retrieve the ideal embedding if MLLE is applied on data points sampled from an isometric manifold. MLLE is also compared with the local tangent space alignment (LTSA). Numerical examples are given that show the improvement and efficiency of MLLE."
to:NB
to_teach:data-mining
dimension_reduction
manifold_learning
via:gmg
spectral_clustering
"The locally linear embedding (LLE) is improved by introducing multiple linearly independent local weight vectors for each neighborhood. We characterize the reconstruction weights and show the existence of the linearly independent weight vectors at each neighborhood. The modified locally linear embedding (MLLE) proposed in this paper is much stable. It can retrieve the ideal embedding if MLLE is applied on data points sampled from an isometric manifold. MLLE is also compared with the local tangent space alignment (LTSA). Numerical examples are given that show the improvement and efficiency of MLLE."
december 2011 by cshalizi
Prediction-based regularization using data augmented regression - Statistics and Computing, Volume 22, Number 1
december 2011 by cshalizi
"The role of regularization is to control fitted model complexity and variance by penalizing (or constraining) models to be in an area of model space that is deemed reasonable, thus facilitating good predictive performance. This is typically achieved by penalizing a parametric or non-parametric representation of the model. In this paper we advocate instead the use of prior knowledge or expectations about the predictions of models for regularization. This has the twofold advantage of allowing a more intuitive interpretation of penalties and priors and explicitly controlling model extrapolation into relevant regions of the feature space. This second point is especially critical in high-dimensional modeling situations, where the curse of dimensionality implies that new prediction points usually require extrapolation. We demonstrate that prediction-based regularization can, in many cases, be stochastically implemented by simply augmenting the dataset with Monte Carlo pseudo-data. We investigate the range of applicability of this implementation. An asymptotic analysis of the performance of Data Augmented Regression (DAR) in parametric and non-parametric linear regression, and in nearest neighbor regression, clarifies the regularizing behavior of DAR. We apply DAR to simulated and real data, and show that it is able to control the variance of extrapolation, while maintaining, and often improving, predictive accuracy."
in_NB
to_read
statistics
prediction
estimation
hooker.giles
regression
to_teach:undergrad-ADA
to_teach:data-mining
curse_of_dimensionality
december 2011 by cshalizi
[1110.3917] How to Evaluate Dimensionality Reduction? - Improving the Co-ranking Matrix
october 2011 by cshalizi
"The growing number of dimensionality reduction methods available for data visualization has recently inspired the development of quality assessment measures, in order to evaluate the resulting low-dimensional representation independently from a methods' inherent criteria. Several (existing) quality measures can be (re)formulated based on the so-called co-ranking matrix, which subsumes all rank errors (i.e. differences between the ranking of distances from every point to all others, comparing the low-dimensional representation to the original data). The measures are often based on the partioning of the co-ranking matrix into 4 submatrices, divided at the K-th row and column, calculating a weighted combination of the sums of each submatrix. Hence, the evaluation process typically involves plotting a graph over several (or even all possible) settings of the parameter K. Considering simple artificial examples, we argue that this parameter controls two notions at once, that need not necessarily be combined, and that the rectangular shape of submatrices is disadvantageous for an intuitive interpretation of the parameter. We debate that quality measures, as general and flexible evaluation tools, should have parameters with a direct and intuitive interpretation as to which specific error types are tolerated or penalized. Therefore, we propose to replace K with two parameters to control these notions separately, and introduce a differently shaped weighting on the co-ranking matrix. The two new parameters can then directly be interpreted as a threshold up to which rank errors are tolerated, and a threshold up to which the rank-distances are significant for the evaluation. Moreover, we propose a color representation of local quality to visually support the evaluation process for a given mapping, where every point in the mapping is colored according to its local contribution to the overall quality." --- Look at this carefully, and see if it could be taught in data mining (and whether it's worth doing so.)
to:NB
dimension_reduction
statistics
data_analysis
visual_display_of_quantitative_information
to_teach:data-mining
october 2011 by cshalizi
Population Value Decomposition, a Framework for the Analysis of Image Populations - Journal of the American Statistical Association - 106(495):775
october 2011 by cshalizi
"Images, often stored in multidimensional arrays, are fast becoming ubiquitous in medical and public health research. Analyzing populations of images is a statistical problem that raises a host of daunting challenges. The most significant challenge is the massive size of the datasets incorporating images recorded for hundreds or thousands of subjects at multiple visits. We introduce the population value decomposition (PVD), a general method for simultaneous dimensionality reduction of large populations of massive images. We show how PVD can be seamlessly incorporated into statistical modeling, leading to a new, transparent, and rapid inferential framework. Our PVD methodology was motivated by and applied to the Sleep Heart Health Study, the largest community-based cohort study of sleep containing more than 85 billion observations on thousands of subjects at two visits. This article has supplementary material online." --- Presumably just some form of SVD for higher-dimensional arrays.
to:NB
principal_components
data_analysis
to_read
to_teach:data-mining
to_teach:undergrad-ADA
october 2011 by cshalizi
Density Estimation in Several Populations With Uncertain Population Membership
october 2011 by cshalizi
"We devise methods to estimate probability density functions of several populations using observations with uncertain population membership, meaning from which population an observation comes is unknown. The probability of an observation being sampled from any given population can be calculated. We develop general estimation procedures and bandwidth selection methods for our setting. We establish large-sample properties and study finite-sample performance using simulation studies. We illustrate our methods with data from a nutrition study."
in_NB
density_estimation
mixture_models
to_teach:undergrad-ADA
to_teach:data-mining
october 2011 by cshalizi
Robustification of the PC Algorithm for Directed Acyclic Graphs
october 2011 by cshalizi
"The PC-algorithm was shown to be a powerful method for estimating the equivalence class of a potentially very high-dimensional acyclic directed graph (DAG) with the corresponding Gaussian distribution. Here we propose a computationally eficient robustification of the PC-algorithm and prove its consistency. Furthermore, we compare the robustified and standard version of the PC-algorithm on simulated data using the new corresponding R package pcalg."
statistics
causal_inference
graphical_models
buhlmann.peter
in_NB
to_read
to_teach:data-mining
to_teach:undergrad-ADA
kalisch.markus
october 2011 by cshalizi
k-means++: The Advantages of Careful Seeding
october 2011 by cshalizi
Why hadn't I heard of this before?
k-means
clustering
to_teach:data-mining
have_read
in_NB
via:georg
approximation_algorithms
machine_learning
october 2011 by cshalizi
The Fans Are All Right (Pinboard Blog)
october 2011 by cshalizi
"I learned a lot about fandom couple of years ago in conversations with my friend Britta, who was working at the time as community manager for Delicious. She taught me that fans were among the heaviest users of the bookmarking site, and had constructed an edifice of incredibly elaborate tagging conventions, plugins, and scripts to organize their output along a bewildering number of dimensions. If you wanted to read a 3000 word fic where Picard forces Gandalf into sexual bondage, and it seems unconsensual but secretly both want it, and it's R-explicit but not NC-17 explicit, all you had to do was search along the appropriate combination of tags (and if you couldn't find it, someone would probably write it for you). By 2008 a whole suite of theoretical ideas about folksonomy, crowdsourcing, faceted infomation retrieval, collaborative editing and emergent ontology had been implemented by a bunch of friendly people so that they could read about Kirk drilling Spock." --- See also the very last link.
fandom
social_life_of_the_mind
social_media
information_retrieval
tagging
pinboard
delicious.com
via:arsyed
to_teach:data-mining
ok_maybe_not_really_to_teach
october 2011 by cshalizi
Draw - Google Correlate
october 2011 by cshalizi
So cool: draw a curve free-hand, get the keywords whose time series correlate best with it. I can't go below a correlation of 0.70.
google
information_retrieval
spurious_correlations
to_teach:undergrad-ADA
to_teach:data-mining
to:blog
via:vqv
rademacher_complexity
october 2011 by cshalizi
The Meta-Activism Project | A Non-Traditional Digital Activism Think Tank
september 2011 by cshalizi
Flagged "to_teach:data-mining" if I can think of a good project for students with this.
networked_life
politics
data_sets
to_teach:data-mining
september 2011 by cshalizi
"Smooth Regression Analysis" (G. S. Watson, 1964) JSTOR: Sankhyā: The Indian Journal of Statistics, Series A, Vol. 26, No. 4 (Dec., 1964), pp. 359-372
june 2011 by cshalizi
The abstract is great: "Few would deny that the most powerful statistical tool is graph paper. When however there are many observations (and/or many variables) graphical procedures become tedious. It seems to the author that the most characteristic problem for statisticians at the moment is the development of methods for analyzing the data poured out by electronic observing systems. The present paper gives a simple computer method for obtaining a "graph" from a large number of observations."
smoothing
regression
kernel_estimators
data_mining
to_teach:undergrad-ADA
to_teach:data-mining
via:gmg
june 2011 by cshalizi
Accuracy and reliability of forensic latent fingerprint decisions
may 2011 by cshalizi
"first large-scale study of the accuracy and reliability of latent print examiners’ decisions ... 169 latent print examiners each compared approximately 100 pairs of latent and exemplar fingerprints from a pool of 744 pairs. ... range of attributes and quality encountered in forensic casework ... . Five examiners made false positive errors for an overall false positive rate of 0.1%. Eighty-five percent of examiners made at least one false negative error for an overall false negative rate of 7.5%. Independent examination of the same comparisons by different participants (analogous to blind verification) [detected] all false positive errors and [most[ false negative errors... Examiners frequently differed on whether fingerprints were suitable for reaching a conclusion."
forensics
fingerprints
pattern_recognition
to_teach:data-mining
to:NB
may 2011 by cshalizi
Adventures in Data Land, Graphical Models for the Internet
march 2011 by cshalizi
Look at this later and re-consider the to_teach tags.
clustering
graphical_models
tutorials
expectation-maximization
internet
text_mining
to_teach:data-mining
to_teach:undergrad-ADA
smola.alex
ahmed.amr
heard_the_talk
march 2011 by cshalizi
Baluja, S.: The Silicon Jungle: A Novel of Deception, Power, and Internet Intrigue.
march 2011 by cshalizi
To assign in the data mining class? (Only if it's good, obviously.)
books:noted
data_mining
novels
to_teach:data-mining
march 2011 by cshalizi
Efficient probabilistic forecasts for counts - McCabe et al., 2011 - JRSS-B
march 2011 by cshalizi
" Efficient probabilistic forecasts of integer-valued random variables are derived. The optimality is achieved by estimating the forecast distribution non-parametrically over a given broad model class and proving asymptotic (non-parametric) efficiency in that setting. The method is developed within the context of the integer auto-regressive class of models, which is a suitable class for any count data that can be interpreted as a queue, stock, birth-and-death process or branching process. The theoretical proofs of asymptotic efficiency are supplemented by simulation results that demonstrate the overall superiority of the non-parametric estimator relative to a misspecified parametric alternative, in large but finite samples. The method is applied to counts of stock market iceberg orders. A subsampling method is used to assess sampling variation in the full estimated forecast distribution and a proof of its validity is given." (Dunno about the to_teach tags, I haven't read this yet.)
statistics
prediction
density_estimation
time_series
stochastic_processes
branching_processes
to_teach:data-mining
to_teach:undergrad-ADA
march 2011 by cshalizi
A stable estimator of the information matrix under EM for dependent data
december 2010 by cshalizi
"This article develops a new and stable estimator for information matrix when the EM algorithm is used in maximum likelihood estimation. This estimator is constructed using the smoothed individual complete-data scores that are readily available from running the EM algorithm. The method works for dependent data sets and when the expectation step is an irregular function of the conditioning parameters." (When I teach EM, I should say something about how to get uncertainty estimates...)
fisher_information
em_algorithm
estimation
statistics
to_teach:data-mining
to_teach:undergrad-ADA
december 2010 by cshalizi
Rule generation for categorical time series with Markov assumptions
december 2010 by cshalizi
"Several procedures of sequential pattern analysis are designed to detect frequently occurring patterns in a single categorical time series (episode mining). Based on these frequent patterns, rules are generated and evaluated, for example, in terms of their confidence. The confidence value is commonly interpreted as an estimate of a conditional probability, so some kind of stochastic model has to be assumed. The model is identified as a variable length Markov model. With this assumption, the usual confidences are maximum likelihood estimates of the transition probabilities of the Markov model. We discuss possibilities of how to efficiently fit an appropriate model to the data. Based on this model, rules are formulated. It is demonstrated that this new approach generates noticeably less and more reliable rules." --- I should really add some time series stuff to data mining...
data_mining
markov_models
time_series
in_NB
to_teach:data-mining
variable-length_markov_models
december 2010 by cshalizi
Consistent selection of the number of clusters via crossvalidation — Biometrika
december 2010 by cshalizi
"In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split."
clustering
stability_of_learning
data_mining
statistics
to_teach:data-mining
to_teach:undergrad-ADA
december 2010 by cshalizi
Predicting consumer behavior with Web search — PNAS
october 2010 by cshalizi
What search can and cannot predict. They mention, but I think could have stressed even more, that the search data is generated _automatically_ as a by-product of now-ordinary social life, rather than a deliberate construction on the part of public or private data-collecting agencies, so it is very, very, very cheap.
internet
data_mining
to_teach:data-mining
kith_and_kin
watts.duncan
hofman.jake
sociology
information_retrieval
networked_life
have_read
october 2010 by cshalizi
Bickel, Li: Local polynomial regression on unknown manifolds
october 2010 by cshalizi
"We reveal the phenomenon that “naive” multivariate local polynomial regression can adapt to local smooth lower dimensional structure in the sense that it achieves the optimal convergence rate for nonparametric estimation of regression functions belonging to a Sobolev space when the predictor variables live on or close to a lower dimensional manifold." Need to mention this when I talk about the curse of dimensionality in data mining...
via:students
regression
manifold_learning
statistics
to_teach:data-mining
curse_of_dimensionality
october 2010 by cshalizi
A Cautionary Note on the Use of Matching to Estimate Causal Effects: An Empirical Example Comparing Matching Estimates to an Experimental Benchmark — Sociological Methods Research
october 2010 by cshalizi
"...social scientists have increasingly turned to matching [to draw] causal inferences from observational data. Matching compares those who receive a treatment to those with similar background attributes who do not receive a treatment. ... Drawing on a randomized voter mobilization experiment ... compare matching [estimates] to an experimental benchmark. ... enormous sample size .... exactly match each treated subject to 40 untreated subjects. Matching greatly exaggerates the effectiveness of pre-election phone calls encouraging voter participation. ... Matching suggests that another pre-election phone call that encouraged people to wear their seat belts also generated huge increases in voter turnout. ... caution is warranted when applying matching estimators to observational data, particularly when one is uncertain about the potential for biased inference." Ouch!
have_read
to_teach:data-mining
causal_inference
matching
experimental_political_science
evisceration
to:blog
to_teach:undergrad-ADA
october 2010 by cshalizi
[1010.0499] Statistical analysis of $k$-nearest neighbor collaborative recommendation
october 2010 by cshalizi
"Collaborative recommendation is an information-filtering technique that attempts to present information items that are likely of interest to an Internet user. Traditionally, collaborative systems deal with situations with two types of variables, users and items. In its most common form, the problem is framed as trying to estimate ratings for items that have not yet been consumed by a user. Despite wide-ranging literature, little is known about the statistical properties of recommendation systems. In fact, no clear probabilistic model even exists which would allow us to precisely describe the mathematical forces driving collaborative filtering. ... [We] set out a general sequential stochastic model for collaborative recommendation. ... in-depth analysis of the so-called cosine-type nearest neighbor ,,, method .... asymptotic performance as the number of users grows. We establish consistency ... under mild assumptions... Rates of convergence and examples ..."
collaborative_filtering
information_retrieval
stochastic_models
nearest_neighbors
to_teach:data-mining
october 2010 by cshalizi
D-squared Digest -- I've seen this film before
september 2010 by cshalizi
"will assert with truculence and jutted jaw that the derivation of commercially relevant and actionable insurance information from genetics is much, much more difficult than that, and furthermore, will tentatively and politely advance the possibility that to "accurately assess the cost of medical treatment over a lifetime" might end up being at least as much of a tough-nut as the socialist calculation problem. As the rant linked above tries to point out, the big issue here is the "Titanic problem" (from Hitchcock's aphorism about it being possible to make a suspenseful film about the Titanic given that everyone knows that it sinks - they don't know when). Knowing genetic propensities to develop various conditions is just about one tiny baby step along the way to making a cost estimate ..."
Especially: not knowing what treatments will be available, or how costly, 20--40 years after the genetic test is performed! (Another teaching example for 350, I think.)
dsquared
human_genetics
market_failures_in_everything
to_teach:data-mining
Especially: not knowing what treatments will be available, or how costly, 20--40 years after the genetic test is performed! (Another teaching example for 350, I think.)
september 2010 by cshalizi
Suykens, Alzate, Pelckmans: Primal and dual model representations in kernel-based learning
august 2010 by cshalizi
"This paper discusses the role of primal and (Lagrange) dual model representations in problems of supervised and unsupervised learning. The specification of the estimation problem is conceived at the primal level as a constrained optimization problem. The constraints relate to the model which is expressed in terms of the feature map. From the conditions for optimality one jointly finds the optimal model representation and the model estimate. At the dual level the model is expressed in terms of a positive definite kernel function, which is characteristic for a support vector machine methodology. It is discussed how least squares support vector machines are playing a central role as core models across problems of regression, classification, principal component analysis, spectral clustering, canonical correlation analysis, dimensionality reduction and data visualization."
kernel_methods
statistics
machine_learning
data_mining
to_teach:data-mining
august 2010 by cshalizi
Unit Testing in R: The Bare Minimum
august 2010 by cshalizi
I hesitate about the teaching tag, this seems quite clunky --- but perhaps it's not that bad when you try it.
via:arsyed
programming
R
to_teach:data-mining
to_teach:statcomp
august 2010 by cshalizi
Practical Approaches to Principal Component Analysis in the Presence of Missing Values
august 2010 by cshalizi
From a quick skim, it looks too advanced to actually teach in 350, but potentially a handy reference.
principal_components
dimension_reduction
to_teach:data-mining
statistics
data_mining
to_teach:undergrad-ADA
august 2010 by cshalizi
depmixS4: An R Package for Hidden Markov Models
august 2010 by cshalizi
"depmixS4 implements a general framework for defining and estimating dependent mixture models in the R programming language. This includes standard Markov models, latent/hidden Markov models, and latent class and finite mixture distribution models. The models can be fitted on mixed multivariate data with distributions from the glm family, the (logistic) multinomial, or the multivariate normal distribution. Other distributions can be added easily, and an example is provided with the exgaus distribution. Parameters are estimated by the expectation-maximization (EM) algorithm or, when (linear) constraints are imposed on the parameters, by direct numerical optimization with the Rsolnp or Rdonlp2 routines."
statistics
computational_statistics
R
markov_models
mixture_models
to_teach:data-mining
to_teach:complexity-and-inference
to_teach:undergrad-ADA
august 2010 by cshalizi
"A Locally Adaptive Penalty for Estimation of Functions of Varying Roughness"
august 2010 by cshalizi
"We propose a new regularization method called Loco-Spline for nonparametric function estimation. Loco-Spline uses a penalty which is data driven and locally adaptive. This allows for more flexible estimation of the function in regions of the domain where it has more curvature, without over fitting in regions that have little curvature. This methodology is also transferred into higher dimensions via the Smoothing Spline ANOVA framework. General conditions for optimal MSE rate of convergence are given and the Loco-Spline is shown to achieve this rate. In our simulation study, the Loco-Spline substantially outperforms the traditional smoothing spline and the locally adaptive kernel smoother. Code to fit Loco-Spline models is included with the Supplemental Materials for this article which are available online." Teach? But I'd need to explain more about splines.
splines
curve_fitting
smoothing
regression
statistics
to_teach:data-mining
to_read
to_teach:undergrad-ADA
august 2010 by cshalizi
The SHOGUN Machine Learning Toolbox
july 2010 by cshalizi
C++ library with R interface, supposedly good for Really Big data. Consider for 350?
machine_learning
computational_statistics
programming
to_read
to_teach:data-mining
R
c++
july 2010 by cshalizi
ILI 2009 Presentation – "Self-plagiarism is style"
june 2010 by cshalizi
Cool effects achieved by applying basic data mining to libraries. To be used as teaching fodder, but honestly I should also find the time to suggest it to our librarians.
libraries
data_mining
information_retrieval
collaborative_filtering
via:magistra_et_mater
to_teach:data-mining
june 2010 by cshalizi
Phantom of Heilbronn - Wikipedia, the free encyclopedia
may 2010 by cshalizi
In which the combined police forces of Europe spend years chasing a female serial killer known only from DNA evidence, only to find that it's all down to contaminated cotton swabs from a single supplier!
Teaching note for data mining: This should make a great example of the importance of getting the data right, before worrying about the statistical processing...
via:arsyed
serial_killers
to_teach:data-mining
bad_data
DNA_testing
forensics
wtf
inference_to_latent_objects
blogged
Teaching note for data mining: This should make a great example of the importance of getting the data right, before worrying about the statistical processing...
may 2010 by cshalizi
[1004.3101] Strong Consistency of Prototype Based Clustering in Probabilistic Space
april 2010 by cshalizi
Not clear at first glance exactly what they're doing. Read before considering teaching.
clustering
k-means
learning_theory
to_teach:data-mining
april 2010 by cshalizi
A dissection of John Gottman's love lab. - By Laurie Abraham - Slate Magazine
march 2010 by cshalizi
This is confused, or at least confusingly written. Is the objection to not evaluating the classifier out of sample? Or that the success of even a very stupid rule should be high (because most couples don't get divorced within five years)? (That would be a valid point, but it's not "base-rate neglect".) Or what?
marriage
classifiers
to_teach:data-mining
data_analysis
march 2010 by cshalizi
[1003.0529] A Unified Algorithmic Framework for Multi-Dimensional Scaling
march 2010 by cshalizi
"In this paper, we propose a unified algorithmic framework for solving many known variants of \mds. Our algorithm is a simple iterative scheme with guaranteed convergence, and is \emph{modular}; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods, in comparable time. "
multidimensional_scaling
dimension_reduction
visual_display_of_quantitative_information
to_teach:data-mining
data_mining
march 2010 by cshalizi
[1003.0783] Supervised Topic Models
march 2010 by cshalizi
What a coincidence, some of the kids in 490 have labeled documents...
latent_dirichlet_allocation
text_mining
classifiers
machine_learning
statistics
to_teach:data-mining
to_teach:undergrad-research
topic_models
blei.david
march 2010 by cshalizi
related tags
ackerman.margaret ⊕ additive_models ⊕ advertising ⊕ ahmed.amr ⊕ AI ⊕ airoldi.edo ⊕ aligheri.dante ⊕ america ⊕ american_south ⊕ analogy ⊕ anderson.chris ⊕ anderson.norm ⊕ anthropology ⊕ approximation_algorithms ⊕ archaeology ⊕ astrology ⊕ author-identification ⊕ backfitting ⊕ bad_data ⊕ bad_data_analysis ⊕ bad_science_journalism ⊕ behaviorism ⊕ ben-david.shai ⊕ biau.gerard ⊕ bibliometry ⊕ biochemical_networks ⊕ birds ⊕ blei.david ⊕ blogged ⊕ blogs ⊕ books:noted ⊕ book_reviews ⊕ boosting ⊕ bootstrap ⊕ branching_processes ⊕ breiman.leo ⊕ brumm.maria ⊕ buhlmann.peter ⊕ buntine.wray ⊕ burke.timothy ⊕ burns.patrick ⊕ c++ ⊕ calibration ⊕ CART ⊕ cashill.jack ⊕ causality ⊕ causal_inference ⊕ cavalli-sforza ⊕ chalko.tom ⊕ classifiers ⊕ climate_change ⊕ clinical_vs_actuarial_prediction ⊕ clinton.hillary ⊕ clustering ⊕ coen.michael ⊕ collaborative_filtering ⊕ collective_cognition ⊕ community_discovery ⊕ computational_statistics ⊕ confidence_sets ⊕ content_analysis ⊕ corporations ⊕ counter-terrorism ⊕ cox.amanda ⊕ credit_ratings ⊕ creeping_authoritarianism ⊕ crime ⊕ cross-validation ⊕ curse_of_dimensionality ⊕ curve_fitting ⊕ data ⊕ data-mining ⊕ databases ⊕ data_analysis ⊕ data_mining ⊕ data_repositories ⊕ data_sets ⊕ debunking ⊕ deceiving_us_has_become_an_industrial_process ⊕ decision-making ⊕ decision_trees ⊕ delicious.com ⊕ density_estimation ⊕ development_economics ⊕ devroye.luc ⊕ dietterich.thomas ⊕ diffusion_maps ⊕ dimension_reduction ⊕ DNA_testing ⊕ drum.kevin ⊕ dsquared ⊕ eagle.nathan ⊕ earthquakes ⊕ econometrics ⊕ economics ⊕ economic_history ⊕ egrid ⊕ electric_power_grid ⊕ email ⊕ em_algorithm ⊕ engineers ⊕ enron ⊕ ensemble_methods ⊕ epidemiology ⊕ error_statistics ⊕ estimation ⊕ evisceration ⊕ expectation-maximization ⊕ experimental_political_science ⊕ experimental_psychology ⊕ factor_analysis ⊕ fandom ⊕ FBI ⊕ feature_selection ⊕ finance ⊕ financial_markets ⊕ fingerprints ⊕ FISA ⊕ fisher_information ⊕ flow_of_funds ⊕ fmri ⊕ food ⊕ forensics ⊕ franklin.charles ⊕ fraud ⊕ freedman.david ⊕ freese.jeremy ⊕ freund.yoav ⊕ fry.ben ⊕ funny:academic ⊕ funny:because_its_true ⊕ funny:geeky ⊕ funny:laughing_instead_of_screaming ⊕ funny:malicious ⊕ funny:sad ⊕ generations ⊕ genetics ⊕ gene_expression_data_analysis ⊕ geology ⊕ gibbons ⊕ google ⊕ gordon.geoff ⊕ gore.al ⊕ grammar_induction ⊕ graphical_models ⊕ graph_theory ⊕ great_depression ⊕ guyon.isabelle ⊕ handcock.mark ⊕ hansen.bruce ⊕ have_read ⊕ hayfield.tristen ⊕ healy.kieran ⊕ heard_the_talk ⊕ heteroskedasticity ⊕ hierarchical_structure ⊕ hinton.geoffrey ⊕ history_of_ideas ⊕ hofman.jake ⊕ hofmann.thomas ⊕ holmes.susan ⊕ homophily ⊕ hooker.giles ⊕ humanities ⊕ human_genetics ⊕ human_terrain_system ⊕ hypothesis_testing ⊕ ibm ⊕ image_retrieval ⊕ independent_components_analysis ⊕ inequality ⊕ inference_to_latent_objects ⊕ information_geometry ⊕ information_retrieval ⊕ information_theory ⊕ institutions ⊕ interface_design ⊕ internet ⊕ intro_stats ⊕ in_NB ⊕ iran ⊕ iterative_approximation ⊕ jakulin.aleks ⊕ janzing.dominik ⊕ juking_the_stats ⊕ k-means ⊕ kafadar.karen ⊕ kalisch.markus ⊕ kaufmann.scott_eric ⊕ kernel_estimators ⊕ kernel_methods ⊕ king.gary ⊕ kith_and_kin ⊕ klein.ezra ⊕ kleinberg.jon ⊕ klinkner.kristina ⊕ lafferty.john ⊕ lasso ⊕ latent_dirichlet_allocation ⊕ latent_semantic_analysis ⊕ latent_variables ⊕ lazer.david ⊕ learning_theory ⊕ lee.ann ⊕ liberman.mark ⊕ libraries ⊕ lie_detection ⊕ life_imitates_the_onion ⊕ linear_regression ⊕ linguistics ⊕ literary_criticism ⊕ literary_homage ⊕ liu.han ⊕ logistic_regression ⊕ lolcats ⊕ lolfoxes ⊕ low-rank_approximation ⊕ lugosi.gabor ⊕ luxburg.ulrike_von ⊕ machine_learning ⊕ macroeconomics ⊕ manifold_learning ⊕ marketing ⊕ market_failures_in_everything ⊕ markov_models ⊕ marriage ⊕ matching ⊕ mathematics ⊕ medicine ⊕ mesopotamia ⊕ methodological_advice ⊕ methodology ⊕ mis-specification_testing ⊕ misspecification ⊕ mixture_models ⊕ model_selection ⊕ moral_panic ⊕ moral_responsibility ⊕ morris.martina ⊕ mortgage_crisis ⊕ multidimensional_scaling ⊕ multiple_comparisons ⊕ multiple_testing ⊕ nadler.boaz ⊕ national_income_accounting ⊕ national_surveillance_state ⊕ natural_history_of_truthiness ⊕ natural_language_processing ⊕ nearest_neighbors ⊕ netflix_prize ⊕ networked_life ⊕ network_data_analysis ⊕ neuroscience ⊕ newspapers ⊕ new_york ⊕ niyogi.partha ⊕ nonparametrics ⊕ novels ⊕ no_really_via:warrenellis ⊕ NSA ⊕ nukes ⊕ o'neil.cathy ⊕ obama.barack ⊕ obesity ⊕ occupy_wall_street ⊕ official_statistics ⊕ ok_maybe_not_really_to_teach ⊕ online_learning ⊕ optimization ⊕ outliers ⊕ p-values ⊕ pattern_recognition ⊕ penn.mark ⊕ pentland.alex ⊕ phonology ⊕ photos ⊕ pictures ⊕ pinboard ⊕ police ⊕ political_science ⊕ politics ⊕ polling ⊕ pollution ⊕ poverty ⊕ precision-recall ⊕ prediction ⊕ prediction_trees ⊕ primates ⊕ principal_components ⊕ privacy ⊕ profiling ⊕ programming ⊕ psychology ⊕ public_relations ⊕ puchalsky.rich ⊕ R ⊕ racine.jeffrey ⊕ rademacher_complexity ⊕ randomization ⊕ random_forests ⊕ rauchway.eric ⊕ ravikumar.pradeep ⊕ re:AoS_project ⊕ re:g_paper ⊕ re:neutral_model_of_inquiry ⊕ re:XV_for_mixing ⊕ re:XV_for_networks ⊕ regression ⊕ reinforcement_learning ⊕ relative_distributions ⊕ review_papers ⊕ richards.joey ⊕ rinaldo.alessandro ⊕ risk_assessment ⊕ running_dogs_of_reaction ⊕ salmon ⊕ satire ⊕ search_engines ⊕ secure_flight ⊕ securitization ⊕ sentiment_analysis ⊕ serial_killers ⊕ shanteau.james ⊕ skinner.b.f. ⊕ slee.tom ⊕ sleep ⊕ smola.alex ⊕ smoothing ⊕ social_life_of_the_mind ⊕ social_media ⊕ social_networks ⊕ social_science_methodology ⊕ sociology ⊕ sociology_of_science ⊕ software ⊕ sparsity ⊕ spatial_statistics ⊕ spectral_clustering ⊕ spectral_methods ⊕ splines ⊕ spurious_correlations ⊕ stability_of_learning ⊕ stark.philip ⊕ statistics ⊕ stepping_stone_model ⊕ stochastic_approximation ⊕ stochastic_models ⊕ stochastic_processes ⊕ stross.charlie ⊕ studentization ⊕ stupid_security ⊕ stylistics ⊕ sumer ⊕ support_vector_machines ⊕ surveillance ⊕ tagging ⊕ teaching ⊕ terrorism_fears ⊕ textual_criticism ⊕ text_mining ⊕ the_continuing_crises ⊕ the_wired_ideology ⊕ tibshirani.robert ⊕ tibshirani.ryan ⊕ time_series ⊕ tishby.naftali ⊕ to:blog ⊕ to:NB ⊕ topic_models ⊕ to_read ⊕ to_teach ⊕ to_teach:complexity-and-inference ⊕ to_teach:data-mining ⊖ to_teach:statcomp ⊕ to_teach:undergrad-ADA ⊕ to_teach:undergrad-research ⊕ track_down_references ⊕ transaction_networks ⊕ tsa ⊕ turney.peter ⊕ tutorials ⊕ unemployment ⊕ us_civil_war ⊕ us_politics ⊕ utter_stupidity ⊕ van_der_maaten.laurens ⊕ variable-length_markov_models ⊕ variable_selection ⊕ vast_right-wing_conspiracy ⊕ verzani.john ⊕ via:? ⊕ via:aaron_clauset ⊕ via:absfac ⊕ via:ariddell ⊕ via:arsyed ⊕ via:arthegall ⊕ via:brad-carlin ⊕ via:chl ⊕ via:dpfeldman ⊕ via:fionajay ⊕ via:georg ⊕ via:gmg ⊕ via:guslacerda ⊕ via:hilzoy ⊕ via:jhofman ⊕ via:john-burke ⊕ via:klk ⊕ via:magistra_et_mater ⊕ via:mind-hacks ⊕ via:moritz-heene ⊕ via:myl ⊕ via:nicholas_della_penna ⊕ via:nikete ⊕ via:ryan_t ⊕ via:shachtman.noah ⊕ via:shreejoy ⊕ via:students ⊕ via:tomslee ⊕ via:vielmetti ⊕ via:vqv ⊕ via:warrenellis ⊕ violence ⊕ visual_display_of_quantitative_information ⊕ volcano ⊕ voting ⊕ wahba.grace ⊕ wasserman.larry ⊕ watts.duncan ⊕ weather_prediction ⊕ web ⊕ why_oh_why_cant_we_have_a_better_press_corps ⊕ williamson.robert ⊕ wolpert.david ⊕ world_bank ⊕ wtf ⊕ yellowstone ⊕ zhu.jerry ⊕Copy this bookmark: