cshalizi + data_mining 99
[1204.6441] "I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data
24 days ago by cshalizi
"Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely difficult. However, most of the authors seem to be more interested in claiming positive results than in providing sound and reproducible methods. It is also especially worrisome that many recent papers seem to only acknowledge those studies supporting the idea of Twitter predicting elections, instead of conducting a balanced literature review showing both sides of the matter. After reading many of such papers I have decided to write such a survey myself. Hence, in this paper, every study relevant to the matter of electoral prediction using social media is commented. From this review it can be concluded that the predictive power of Twitter regarding elections has been greatly exaggerated, and that hard research problems still lie ahead."
to:NB
social_media
data_mining
prediction
have_read
24 days ago by cshalizi
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
4 weeks ago by cshalizi
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful. In particular, since BHV have shown that the space of trees is negatively curved (a CAT(0) space), a natural representation of a collection of trees is a tree. We compare this representation to the Euclidean approximations of treespace made available through Multidimensional Scaling of the matrix of distances between trees. We also provide applications of the distances between trees to hierarchical clustering trees constructed from microarrays. Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
to:NB
clustering
hierarchical_structure
holmes.susan
data_mining
statistics
to_teach:data-mining
gene_expression_data_analysis
via:ryan_t
4 weeks ago by cshalizi
Game-powered machine learning
4 weeks ago by cshalizi
"Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the “wisdom of the crowds.” Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., “funky jazz with saxophone,” “spooky electronica,” etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data."
--- This is more than a bit of a stunt, but it points in an interesting direction.
to:NB
to_read
data_mining
collective_cognition
active_learning
tagging
classifiers
re:democratic_cognition
--- This is more than a bit of a stunt, but it points in an interesting direction.
4 weeks ago by cshalizi
[1204.1002] Fast Multi-Scale Detection of Relevant Communities
7 weeks ago by cshalizi
"Nowadays, networks are almost ubiquitous. In the past decade, community detection received an increasing interest as a way to uncover the structure of networks by grouping nodes into communities more densely connected internally than externally. Yet most of the effective methods available do not consider the potential levels of organisation, or scales, a network may encompass and are therefore limited. In this paper we present a method compatible with global and local criteria that enables fast multi-scale community detection. The method is derived in two algorithms, one for each type of criterion, and implemented with 6 known criteria. Uncovering communities at various scales is a computationally expensive task. Therefore this work puts a strong emphasis on the reduction of computational complexity. Some heuristics are introduced for speed-up purposes. Experiments demonstrate the efficiency and accuracy of our method with respect to each algorithm and criterion by testing them against large generated multi-scale networks. This study also offers a comparison between criteria and between the global and local approaches."
to:NB
community_discovery
data_mining
7 weeks ago by cshalizi
[1203.2200] Role-Dynamics: Fast Mining of Large Dynamic Networks
7 weeks ago by cshalizi
"To understand the structural dynamics of a large-scale social, biological or technological network, it may be useful to discover behavioral roles representing the main connectivity patterns present over time. In this paper, we propose a scalable non-parametric approach to automatically learn the structural dynamics of the network and individual nodes. Roles may represent structural or behavioral patterns such as the center of a star, peripheral nodes, or bridge nodes that connect different communities. Our novel approach learns the appropriate structural role dynamics for any arbitrary network and tracks the changes over time. In particular, we uncover the specific global network dynamics and the local node dynamics of a technological, communication, and social network. We identify interesting node and network patterns such as stationary and non-stationary roles, spikes/steps in role-memberships (perhaps indicating anomalies), increasing/decreasing role trends, among many others. Our results indicate that the nodes in each of these networks have distinct connectivity patterns that are non-stationary and evolve considerably over time. Overall, the experiments demonstrate the effectiveness of our approach for fast mining and tracking of the dynamics in large networks. Furthermore, the dynamic structural representation provides a basis for building more sophisticated models and tools that are fast for exploring large dynamic networks."
in_NB
network_data_analysis
data_mining
community_discovery
neville.jennifer
7 weeks ago by cshalizi
[1203.1647] A Survey of Prediction Using Social Media
7 weeks ago by cshalizi
"Social media comprises interactive applications and platforms for creating, sharing and exchange of user-generated contents. The past ten years have brought huge growth in social media, especially online social networking services, and it is changing our ways to organize and communicate. It aggregates opinions and feelings of diverse groups of people at low cost. Mining the attributes and contents of social media gives us an opportunity to discover social structure characteristics, analyze action patterns qualitatively and quantitatively, and sometimes the ability to predict future human related events. In this paper, we firstly discuss the realms which can be predicted with current social media, then overview available predictors and techniques of prediction, and finally discuss challenges and possible future directions."
to:NB
social_media
re:social-networks-as-sensor-networks
data_mining
7 weeks ago by cshalizi
Not an April Fool - Charlie's Diary
8 weeks ago by cshalizi
"It's easy to imagine how we could make something worse than "Girls Around Me"—something much worse. Facebook encourages us to disclose a wide range of information about ourselves, including our religion and a photograph. Religion is obvious: "Yids Among Us" would obviously be one of the go-to tools of choice for Neo-Nazis. As for skin colour, ethnicity identification from face images is out there already. Want to go queer bashing? There's an algorithm out there for guessing sexual orientation based on the network graph of the target's facebook friends. It's probably possible to apply this sort of data mining exercise to determine whether a woman has had an abortion or is pro-choice.
"In the worst case, it's possible to envisage geolocation and data aggregation apps being designed to facilitate the identification and elimination of some ethnic or class enemy, not only by making it easy for users to track them down, but by making it easy for users to identify each other and form ad-hoc lynch mobs. (Hence my reference to the Rwandan Genocide earlier. Think it couldn't happen? Look at Iran and imagine an app written for the Basij to make it easy to identify dissidents and form ad-hoc goon squads to proactively hunt them down. Or any other organization in the post-networked world that has a social role corresponding to the Red Guards.)
"But as I said earlier, the app is not the problem. The problem is the deployment by profit-oriented corporations of behavioural psychology techniques to induce people to over-share information which can then be aggregated and disclosed to third parties for targeted marketing purposes."
Comment: Stross is not, sadly, exaggerating.
networked_life
data_mining
social_networks
moral_responsibility
you_are_the_product
stross.charlie
"In the worst case, it's possible to envisage geolocation and data aggregation apps being designed to facilitate the identification and elimination of some ethnic or class enemy, not only by making it easy for users to track them down, but by making it easy for users to identify each other and form ad-hoc lynch mobs. (Hence my reference to the Rwandan Genocide earlier. Think it couldn't happen? Look at Iran and imagine an app written for the Basij to make it easy to identify dissidents and form ad-hoc goon squads to proactively hunt them down. Or any other organization in the post-networked world that has a social role corresponding to the Red Guards.)
"But as I said earlier, the app is not the problem. The problem is the deployment by profit-oriented corporations of behavioural psychology techniques to induce people to over-share information which can then be aggregated and disclosed to third parties for targeted marketing purposes."
Comment: Stross is not, sadly, exaggerating.
8 weeks ago by cshalizi
[1203.6093] Consensus clustering in complex networks
8 weeks ago by cshalizi
"The community structure of complex networks reveals both their organization and hidden relationships among their constituents. Most community detection methods currently available are not deterministic, and their results typically depend on the specific random seeds, initial conditions and tie-break rules adopted for their execution. Consensus clustering is used in data analysis to generate stable results out of a set of partitions delivered by stochastic methods. Here we show that consensus clustering can be combined with any existing method in a self-consistent way, enhancing considerably both the stability and the accuracy of the resulting partitions. This framework is also particularly suitable to monitor the evolution of community structure in temporal networks. An application of consensus clustering to a large citation network of physics papers demonstrates its capability to keep track of the birth, death and diversification of topics."
in_NB
community_discovery
network_data_analysis
clustering
data_mining
8 weeks ago by cshalizi
Taylor & Francis Online :: Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
8 weeks ago by cshalizi
"For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality."
to:NB
visual_display_of_quantitative_information
clustering
data_mining
to_teach:data-mining
8 weeks ago by cshalizi
"Local equivalences of distances between clusterings—a geometric perspective" --- Marina Meilă
february 2012 by cshalizi
"In comparing clusterings, several different distances and indices are in use. We prove that the Misclassification Error distance, the Hamming distance (equivalent to the unadjusted Rand index), and the χ 2 distance between partitions are equivalent in the neighborhood of 0. In other words, if two partitions are very similar, then one distance defines upper and lower bounds on the other and viceversa. The proofs are geometric and rely on the concavity of the distances. The geometric intuitions themselves advance the understanding of the space of all clusterings. To our knowledge, this is the first result of its kind.
Practically, distances are frequently used to compare two clusterings of a set of observations. But the motivation for this work is in the theoretical study of data clustering. Distances between partitions are involved in constructing new methods for cluster validation, determining the number of clusters, and analyzing clustering algorithms. From a probability theory point of view, the present results apply to any pair of finite valued random variables, and provide simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent."
in_NB
clustering
data_mining
meila.marina
Practically, distances are frequently used to compare two clusterings of a set of observations. But the motivation for this work is in the theoretical study of data clustering. Distances between partitions are involved in constructing new methods for cluster validation, determining the number of clusters, and analyzing clustering algorithms. From a probability theory point of view, the present results apply to any pair of finite valued random variables, and provide simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent."
february 2012 by cshalizi
[1202.1561] Tree Models for Difference and Change Detection in a Complex Environment
february 2012 by cshalizi
"A new family of tree models is proposed, which we call "differential trees." A differential tree model is constructed from multiple data sets and aims to detect distributional differences between them. The new methodology differs from the existing difference and change detection techniques in its nonparametric nature, model construction from multiple data sets, and applicability to high-dimensional data. Through a detailed study of an arson case in New Zealand, where an individual is known to have been laying vegetation fires within a certain time period, we illustrate how these models can help detect changes in the frequencies of event occurrences and uncover unusual clusters of events in a complex environment."
--- After reading, I think their exposition is needlessly hard to follow, but let me take a stab at it. In an ordinary classification tree, we are interested in the distribution of the class labels Y given the predictors X, i.e., Pr(Y|X), and make splits on X so that (in essence) the conditional entropy H[Y|X] becomes small. This is of course equivalent to making splits so that the divergence of Pr(Y|X) from Pr(Y) is maximized. What they are interested in is not classification but _describing_ how the different classes are distinct, so the relevant distribution is Pr(X|Y), and they want a big divergence between Pr(X) and Pr(X|Y).
to:NB
re:network_differences
statistics
hypothesis_testing
density_estimation
decision_trees
have_read
data_mining
two-sample_tests
--- After reading, I think their exposition is needlessly hard to follow, but let me take a stab at it. In an ordinary classification tree, we are interested in the distribution of the class labels Y given the predictors X, i.e., Pr(Y|X), and make splits on X so that (in essence) the conditional entropy H[Y|X] becomes small. This is of course equivalent to making splits so that the divergence of Pr(Y|X) from Pr(Y) is maximized. What they are interested in is not classification but _describing_ how the different classes are distinct, so the relevant distribution is Pr(X|Y), and they want a big divergence between Pr(X) and Pr(X|Y).
february 2012 by cshalizi
Sun , Wang , Fang : Regularized k-means clustering of high-dimensional data and its asymptotic consistency
february 2012 by cshalizi
"K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering."
in_NB
clustering
statistics
lasso
data_mining
to_teach:data-mining
february 2012 by cshalizi
A General Framework for Dimensionality-Reducing Data Visualization Mapping
february 2012 by cshalizi
"In recent years, a wealth of dimension-reduction techniques for data visualization and preprocessing has been established. Nonparametric methods require additional effort for out-of-sample extensions, because they provide only a mapping of a given finite set of points. In this letter, we propose a general view on nonparametric dimension reduction based on the concept of cost functions and properties of the data. Based on this general principle, we transfer nonparametric dimension reduction to explicit mappings of the data manifold such that direct out-of-sample extensions become possible. Furthermore, this concept offers the possibility of investigating the generalization ability of data visualization to new data points. We demonstrate the approach based on a simple global linear mapping, as well as prototype-based local linear mappings. In addition, we can bias the functional form according to given auxiliary information. This leads to explicit supervised visualization mappings with discriminative properties comparable to state-of-the-art approaches."
in_NB
dimension_reduction
visual_display_of_quantitative_information
data_analysis
data_mining
manifold_learning
to_teach:data-mining
february 2012 by cshalizi
[1201.5568] Dynamic trees for streaming and massive data contexts
january 2012 by cshalizi
"Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, from vast offline databases to streaming real-time information. Learning algorithms deployed in such contexts must rely on single-pass inference, where the data history is never revisited. In streaming contexts, learning must also be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Although rapidly growing, the online Bayesian inference literature remains challenged by massive data and transient, evolving data streams. Non-parametric modelling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting standard streaming techniques, like data discarding and downweighting, into a fully Bayesian framework via the use of informative priors and active learning heuristics. We showcase our methods by augmenting a modern non-parametric modelling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favourably to the state-of-the-art."
to:NB
machine_learning
non-stationarity
statistics
data_mining
to_read
re:growing_ensemble_project
january 2012 by cshalizi
RE-EM Trees: A Data Ming Approach for Longitudinal and Clustered Data
january 2012 by cshalizi
"Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data, where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic differences between objects that are not related to attributes and autocorrelation within objects across time periods. A standard methodology in the statistics literature for this type of data is the mixed effects model, where these differences between objects are represented by so-called “random effects” that are estimated from the data (population-level relationships are termed “fixed effects,” together resulting in a mixed effects model). This paper presents a methodology that combines the structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods. We apply the resulting estimation method, called the RE-EM tree, to pricing in online transactions, showing that the RE-EM tree is less sensitive to parametric assumptions and provides improved predictive power compared to linear models with random effects and regression trees without random effects. We also apply it to a smaller data set examining accident fatalities, and show that the RE-EM tree strongly outperforms a tree without random effects while performing comparably to a linear model with random effects. We also perform extensive simulation experiments to show that the estimator improves predictive performance relative to regression trees without random effects and is comparable or superior to using linear models with random effects in more general situations."
to:NB
machine_learning
decision_trees
data_mining
statistics
hierarchical_models
january 2012 by cshalizi
Mining of Massive Datasets - Academic and Professional Books - Cambridge University Press
january 2012 by cshalizi
"The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike."
--- What a remarkably hideous cover!
to:NB
books:noted
data_mining
to_teach:data-mining
machine_learning
computational_statistics
--- What a remarkably hideous cover!
january 2012 by cshalizi
Graph-based Natural Language Processing and Information Retrieval - Mihaclea and Radev
december 2011 by cshalizi
"Graph theory and the fields of natural language processing and information retrieval are well-studied disciplines. Traditionally, these areas have been perceived as distinct, with different algorithms, different applications, and different potential end-users. However, recent research has shown that these disciplines are intimately connected, with a large variety of natural language processing and information retrieval applications finding efficient solutions within graph-theoretical frameworks. This book extensively covers the use of graph-based algorithms for natural language processing and information retrieval. It brings together topics as diverse as lexical semantics, text summarization, text mining, ontology construction, text classification, and information retrieval, which are connected by the common underlying theme of the use of graph-theoretical methods for text and information processing tasks. Readers will come away with a firm understanding of the major methods and applications in natural language processing and information retrieval that rely on graph-based representations and algorithms."
in_NB
books:noted
natural_language_processing
graph_theory
data_mining
text_mining
radev.dragomir
december 2011 by cshalizi
[1110.4851] Leveraging User Diversity to Harvest Knowledge on the Social Web
october 2011 by cshalizi
"Social web users are a very diverse group with varying interests, levels of expertise, enthusiasm, and expressiveness. As a result, the quality of content and annotations they create to organize content is also highly variable. While several approaches have been proposed to mine social annotations, for example, to learn folksonomies that reflect how people relate narrower concepts to broader ones, these methods treat all users and the annotations they create uniformly. We propose a framework to automatically identify experts, i.e., knowledgeable users who create high quality annotations, and use their knowledge to guide folksonomy learning. We evaluate the approach on a large body of social annotations extracted from the photosharing site Flickr. We show that using expert knowledge leads to more detailed and accurate folksonomies. Moreover, we show that including annotations from non-expert, or novice, users leads to more comprehensive folksonomies than experts' knowledge alone."
to:NB
data_mining
social_life_of_the_mind
social_media
kith_and_kin
lerman.kristina
tagging
october 2011 by cshalizi
[1110.3225] Mining Patterns in Networks using Homomorphism
october 2011 by cshalizi
"In recent years many algorithms have been developed for finding patterns in graphs and networks. A disadvantage of these algorithms is that they use subgraph isomorphism to determine the support of a graph pattern; subgraph isomorphism is a well-known NP complete problem. In this paper, we propose an alternative approach which mines tree patterns in networks by using subgraph homomorphism. The advantage of homomorphism is that it can be computed in polynomial time, which allows us to develop an algorithm that mines tree patterns in arbitrary graphs in incremental polynomial time. Homomorphism however entails two problems not found when using isomorphism: (1) two patterns of different size can be equivalent; (2) patterns of unbounded size can be frequent. In this paper we formalize these problems and study solutions that easily fit within our algorithm."
in_NB
to_read
re:smoothing_adjacency_matrices
network_data_analysis
data_mining
graph_theory
october 2011 by cshalizi
[1110.2515] Normalized Mutual Information to evaluate overlapping community finding algorithms
october 2011 by cshalizi
"Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical. A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2]."
in_NB
community_discovery
information_theory
clustering
data_mining
october 2011 by cshalizi
http://www.dtic.mil/descriptivesum/Y2012/DARPA/0602702E_2_PB_2012.pdf
july 2011 by cshalizi
"develop tools [for] automated interpretation, quantitative analysis, and visualization of social networks.... social networks [are models for] terrorist cells, insurgent groups, and other stateless actors whose connectedness is established not [by] shared geography but [by] correlat[ed] participation in coordinated activities ... apply emerging methods for edge finding and cluster analysis to detect, characterize, and predict the dynamics of social networks. ... application in tactical contexts... foundation for cultural intelligence - understanding the stability, governance, and economic indicators of a region ... 2012 Plans: Develop techniques for simulation, visualization, inference, and prediction of social network dynamics; ... for modeling the interactions between and within cooperating/competing/conflicting social networks, sub- networks, and super-networks and for predicting the merging and splitting of social networks; Evaluate ... on real-world social-cultural-network data."
darpa
nexus-7
afghanistan
data_mining
counter-insurgency
network_data_analysis
to:blog
july 2011 by cshalizi
Exclusive: Inside Darpa’s Secret Afghan Spy Machine | Danger Room | Wired.com
july 2011 by cshalizi
I must say that the few lines in the budget document were a hell of a lot clearer about what this is _supposed_ to achieve than the Wired article itself. But I do not have a good feeling about this project, at all. (At the very least, for $30 million, you could teach a lot of soldiers Dari and Pashto, or recruit a lot of Afghan informants.)
afghanistan
darpa
military_industrial_complex
data_mining
network_data_analysis
us_military
counter-insurgency
to:blog
july 2011 by cshalizi
"Smooth Regression Analysis" (G. S. Watson, 1964) JSTOR: Sankhyā: The Indian Journal of Statistics, Series A, Vol. 26, No. 4 (Dec., 1964), pp. 359-372
june 2011 by cshalizi
The abstract is great: "Few would deny that the most powerful statistical tool is graph paper. When however there are many observations (and/or many variables) graphical procedures become tedious. It seems to the author that the most characteristic problem for statisticians at the moment is the development of methods for analyzing the data poured out by electronic observing systems. The present paper gives a simple computer method for obtaining a "graph" from a large number of observations."
smoothing
regression
kernel_estimators
data_mining
to_teach:undergrad-ADA
to_teach:data-mining
via:gmg
june 2011 by cshalizi
Baluja, S.: The Silicon Jungle: A Novel of Deception, Power, and Internet Intrigue.
march 2011 by cshalizi
To assign in the data mining class? (Only if it's good, obviously.)
books:noted
data_mining
novels
to_teach:data-mining
march 2011 by cshalizi
Friedman , Yu : Leo Breiman (1929–2005)
january 2011 by cshalizi
Introduction to the special section of _Annals of Applied Statistics_ in memory of Leo Breiman.
breiman.leo
lives_of_the_scientists
statistics
machine_learning
data_mining
CART
january 2011 by cshalizi
Quantitative Analysis of Culture Using Millions of Digitized Books | Science/AAAS
december 2010 by cshalizi
What the bleep? Not included in CMU's subscription to _Science_?!?
data_mining
corpus_linguistics
to_read
december 2010 by cshalizi
Rule generation for categorical time series with Markov assumptions
december 2010 by cshalizi
"Several procedures of sequential pattern analysis are designed to detect frequently occurring patterns in a single categorical time series (episode mining). Based on these frequent patterns, rules are generated and evaluated, for example, in terms of their confidence. The confidence value is commonly interpreted as an estimate of a conditional probability, so some kind of stochastic model has to be assumed. The model is identified as a variable length Markov model. With this assumption, the usual confidences are maximum likelihood estimates of the transition probabilities of the Markov model. We discuss possibilities of how to efficiently fit an appropriate model to the data. Based on this model, rules are formulated. It is demonstrated that this new approach generates noticeably less and more reliable rules." --- I should really add some time series stuff to data mining...
data_mining
markov_models
time_series
in_NB
to_teach:data-mining
variable-length_markov_models
december 2010 by cshalizi
Consistent selection of the number of clusters via crossvalidation — Biometrika
december 2010 by cshalizi
"In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split."
clustering
stability_of_learning
data_mining
statistics
to_teach:data-mining
to_teach:undergrad-ADA
december 2010 by cshalizi
Predicting consumer behavior with Web search — PNAS
october 2010 by cshalizi
What search can and cannot predict. They mention, but I think could have stressed even more, that the search data is generated _automatically_ as a by-product of now-ordinary social life, rather than a deliberate construction on the part of public or private data-collecting agencies, so it is very, very, very cheap.
internet
data_mining
to_teach:data-mining
kith_and_kin
watts.duncan
hofman.jake
sociology
information_retrieval
networked_life
have_read
october 2010 by cshalizi
Suykens, Alzate, Pelckmans: Primal and dual model representations in kernel-based learning
august 2010 by cshalizi
"This paper discusses the role of primal and (Lagrange) dual model representations in problems of supervised and unsupervised learning. The specification of the estimation problem is conceived at the primal level as a constrained optimization problem. The constraints relate to the model which is expressed in terms of the feature map. From the conditions for optimality one jointly finds the optimal model representation and the model estimate. At the dual level the model is expressed in terms of a positive definite kernel function, which is characteristic for a support vector machine methodology. It is discussed how least squares support vector machines are playing a central role as core models across problems of regression, classification, principal component analysis, spectral clustering, canonical correlation analysis, dimensionality reduction and data visualization."
kernel_methods
statistics
machine_learning
data_mining
to_teach:data-mining
august 2010 by cshalizi
Practical Approaches to Principal Component Analysis in the Presence of Missing Values
august 2010 by cshalizi
From a quick skim, it looks too advanced to actually teach in 350, but potentially a handy reference.
principal_components
dimension_reduction
to_teach:data-mining
statistics
data_mining
to_teach:undergrad-ADA
august 2010 by cshalizi
ILI 2009 Presentation – "Self-plagiarism is style"
june 2010 by cshalizi
Cool effects achieved by applying basic data mining to libraries. To be used as teaching fodder, but honestly I should also find the time to suggest it to our librarians.
libraries
data_mining
information_retrieval
collaborative_filtering
via:magistra_et_mater
to_teach:data-mining
june 2010 by cshalizi
[0901.2735] State Space Realization Theorems For Data Mining
may 2010 by cshalizi
"In this paper, we consider formal series associated with events, profiles derived from events, and statistical models that make predictions about events. We prove theorems about realizations for these formal series using the language and tools of Hopf algebras."
state-space_models
data_mining
may 2010 by cshalizi
Desiderata for a Predictive Theory of Statistics - Clarke, 2010
march 2010 by cshalizi
"In many contexts the predictive validation of models or their associated prediction strategies is of greater importance than model identification which may be practically impossible. This is particularly so in fields involving complex or high dimensional data where model selection, or more generally predictor selection is the main focus of effort. This paper suggests a unified treatment for predictive analyses based on six `desiderata'. These desiderata are an effort to clarify what criteria a good predictive theory of statistics should satisfy." --- I presume (I haven't read the paper yet) that he means a theory of statistical predictions, and not a theory which tries to predict future developments within statistics.
statistics
prediction
methodology
to_read
data_mining
march 2010 by cshalizi
[1003.0529] A Unified Algorithmic Framework for Multi-Dimensional Scaling
march 2010 by cshalizi
"In this paper, we propose a unified algorithmic framework for solving many known variants of \mds. Our algorithm is a simple iterative scheme with guaranteed convergence, and is \emph{modular}; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods, in comparable time. "
multidimensional_scaling
dimension_reduction
visual_display_of_quantitative_information
to_teach:data-mining
data_mining
march 2010 by cshalizi
Beyond DCG: User Behavior as a Predictor of a Successful Search
february 2010 by cshalizi
Yay Kristina! (Not sure I could actually teach this in 350.)
search_engines
markov_models
data_mining
klinkner.kristina
information_retrieval
to_teach:data-mining
kith_and_kin
february 2010 by cshalizi
Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining
december 2009 by cshalizi
I should really add a rule-learning segment to the class.
to_teach:data-mining
data_mining
machine_learning
december 2009 by cshalizi
Clustering: Art or Science?
november 2009 by cshalizi
I think this crystallize why I never like teaching clustering.
clustering
data_mining
statistics
machine_learning
data_analysis
guyon.isabelle
luxburg.ulrike_von
williamson.robert
via:chl
to_teach:data-mining
november 2009 by cshalizi
[0910.2340] A Stochastic Model for Collaborative Recommendation
october 2009 by cshalizi
"Collaborative recommendation is an information-filtering technique that attempts to present ,,, movies, music, books, news, images, Web pages, etc. that are likely of interest to [users]. ... In its most common form, the problem is framed as trying to estimate ratings for items that have not yet been consumed by a user. Despite wide-ranging literature, little is known about the statistical properties of recommendation systems. In fact, no clear probabilistic model even exists allowing us to precisely describe the mathematical forces driving collaborative filtering. To provide an initial contribution to this, we propose to set out a general sequential stochastic model for collaborative recommendation and analyze its asymptotic performance as the number of users grows.... analysis of the so-called cosine-type nearest neighbor collaborative method .... consistency of the procedure under mild assumptions on the model. Rates of convergence and examples..."
collaborative_filtering
information_retrieval
data_mining
to_read
to:NB
to_teach:data-mining
october 2009 by cshalizi
Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.
october 2009 by cshalizi
Free PDF! (Still, I find my bound physical copy much more convenient.)
books:recommended
machine_learning
data_mining
statistics
learning_theory
estimation
cross-validation
ensemble_methods
classifiers
regression
graphical_models
clustering
dimension_reduction
bootstrap
via:arthegall
have_read
october 2009 by cshalizi
Powell's Books - Principles and Theory for Data Mining and Machine Learning (Springer Series in Statistics) by Bertrand Clarke
july 2009 by cshalizi
Too late to consider using as a textbook for 36-350, but I should ask for an examination copy. Update: bought it. Pretty good but way too advanced mathematically for my class; more "If you liked _The Elements of Statistical Learning_, but wish it had more traditional statistical theory, have we got a book for you."
books:noted
data_mining
statistics
machine_learning
to_teach:data-mining
july 2009 by cshalizi
All we want are the facts, ma'am
february 2009 by cshalizi
When I wrote about Chris Anderson's idiotic piece back in the spring, I didn't say anything about the quote from Norvig, because it sounded very strange and not at all like Norvig. And, indeed, he now says "That's a silly statement, I didn't say it, and I disagree with it." Ah, Wired!
why_oh_why_cant_we_have_a_better_press_corps
anderson.chris
statistics
modeling
data_mining
norvig.peter
machine_learning
bad_science_journalism
fact_checking
via:arthegall
via:shivak
february 2009 by cshalizi
Margaret Ackerman and Shai Ben-David, "Measures of Clustering Quality: A Working Set of Axioms for Clustering"
december 2008 by cshalizi
A rebuttal to Kleinberg's impossibility theory for clustering (bookmarked earlier). There are measures of _cluster quality_ which satisfy all the natural axioms, which is good enough.
clustering
to_teach:data-mining
via:arthegall
via:vielmetti
data_mining
ackerman.margaret
ben-david.shai
kleinberg.jon
december 2008 by cshalizi
Whimsley: Theses on Netflix
november 2008 by cshalizi
Mostly good, except for the last thesis: "Recommender systems only filter culture. The point, in various ways, is to create environments in which artists can prosper." No! The point is to create environments in which CULTURE can prosper; professional artists are something else.
slee.tom
collaborative_filtering
data_mining
november 2008 by cshalizi
The Screens Issue - If You Liked This, Sure to Love That - Winning the Netflix Prize - NYTimes.com
november 2008 by cshalizi
What the ******* ****, Netflix wasn't using singular value decomposition? Can that really be true? (The hope that the report massively misunderstood is the only thing saving this from an "utter_stupidity" tag.)
netflix_prize
data_mining
collaborative_filtering
to_teach:data-mining
principal_components
november 2008 by cshalizi
Notional Slurry » Is this a good time to reveal credit card terms?
november 2008 by cshalizi
In which Bill proposes that customers start data-mining the credit-card companies.
credit_cards
data_mining
modest_proposals
tozier.william
november 2008 by cshalizi
related tags
ackerman.margaret ⊕ active_learning ⊕ additive_models ⊕ advertising ⊕ afghanistan ⊕ algorithms ⊕ anderson.chris ⊕ arthegall ⊕ artificial_intelligence ⊕ bad_data_analysis ⊕ bad_science_journalism ⊕ ben-david.shai ⊕ bioinformatics ⊕ blogged ⊕ blogs ⊕ books:noted ⊕ books:recommended ⊕ bootstrap ⊕ breiman.leo ⊕ burke.timothy ⊕ carnegie_mellon ⊕ CART ⊕ causality ⊕ classifiers ⊕ clinical_vs_actuarial_prediction ⊕ clustering ⊕ collaborative_filtering ⊕ collective_cognition ⊕ community_discovery ⊕ computational_statistics ⊕ computers ⊕ content_analysis ⊕ corpus_linguistics ⊕ counter-insurgency ⊕ counter-terrorism ⊕ credit_cards ⊕ credit_ratings ⊕ creeping_authoritarianism ⊕ crime ⊕ cross-validation ⊕ darpa ⊕ databases ⊕ data_analysis ⊕ data_mining ⊖ decision_trees ⊕ density_estimation ⊕ dimension_reduction ⊕ distributed_systems ⊕ econometrics ⊕ economics ⊕ ensemble_methods ⊕ estimation ⊕ fact_checking ⊕ FBI ⊕ financial_speculation ⊕ food ⊕ fraud ⊕ freese.jeremy ⊕ funny:geeky ⊕ funny:laughing_instead_of_screaming ⊕ funny:sad ⊕ gene_expression_data_analysis ⊕ glymour.clark ⊕ google ⊕ graphical_models ⊕ graph_theory ⊕ guyon.isabelle ⊕ have_read ⊕ heard_the_talk ⊕ herding ⊕ hierarchical_models ⊕ hierarchical_structure ⊕ history_of_technology ⊕ hofman.jake ⊕ holmes.susan ⊕ homophily ⊕ humanities ⊕ hypothesis_testing ⊕ information_retrieval ⊕ information_theory ⊕ insurance ⊕ internet ⊕ in_NB ⊕ iran ⊕ jordan.michael_i. ⊕ kernel_estimators ⊕ kernel_methods ⊕ kith_and_kin ⊕ kleinberg.jon ⊕ klinkner.kristina ⊕ lasso ⊕ lead ⊕ leamer.ed ⊕ learning_theory ⊕ lerman.kristina ⊕ liberman.mark ⊕ libraries ⊕ life_imitates_science_fiction ⊕ life_imitates_the_onion ⊕ linguistics ⊕ lives_of_the_scientists ⊕ luxburg.ulrike_von ⊕ machine_learning ⊕ management ⊕ manifold_learning ⊕ map-reduce ⊕ marketing ⊕ markov_models ⊕ meila.marina ⊕ methodological_advice ⊕ methodology ⊕ military_industrial_complex ⊕ mirror_worlds ⊕ modeling ⊕ model_selection ⊕ modest_proposals ⊕ moral_panic ⊕ moral_responsibility ⊕ multidimensional_scaling ⊕ multiple_comparisons ⊕ national_surveillance_state ⊕ natural_history_of_truthiness ⊕ natural_language_processing ⊕ netflix_prize ⊕ networked_life ⊕ networks ⊕ network_data ⊕ network_data_analysis ⊕ neville.jennifer ⊕ nexus-7 ⊕ non-stationarity ⊕ norvig.peter ⊕ novels ⊕ NSA ⊕ o'neil.cathy ⊕ organizations ⊕ outliers ⊕ parallel_computing ⊕ pattern_discovery ⊕ prediction ⊕ prediction_trees ⊕ principal_components ⊕ privacy ⊕ profiling ⊕ programming ⊕ psychology ⊕ R ⊕ radev.dragomir ⊕ random_forests ⊕ re:democratic_cognition ⊕ re:growing_ensemble_project ⊕ re:network_differences ⊕ re:smoothing_adjacency_matrices ⊕ re:social-networks-as-sensor-networks ⊕ regression ⊕ risk_assessment ⊕ risk_vs_uncertainty ⊕ scientific_computing ⊕ search_engines ⊕ semantics_from_syntax ⊕ slee.tom ⊕ smola.alex ⊕ smoothing ⊕ social_life_of_the_mind ⊕ social_media ⊕ social_networks ⊕ sociology ⊕ software ⊕ spam ⊕ stability_of_learning ⊕ state-space_models ⊕ statistics ⊕ stross.charlie ⊕ structured_data ⊕ stupid_security ⊕ support_vector_machines ⊕ surveillance ⊕ tagging ⊕ taste:bad ⊕ technical_change ⊕ terrorism_fears ⊕ text_mining ⊕ theoretical_computer_science ⊕ the_continuing_crises ⊕ the_present_before_it_was_widely_distributed ⊕ the_wired_ideology ⊕ time_series ⊕ to:blog ⊕ to:NB ⊕ topic_models ⊕ tozier.william ⊕ to_read ⊕ to_teach:data-mining ⊕ to_teach:undergrad-ADA ⊕ transaction_networks ⊕ two-sample_tests ⊕ us_civil_war ⊕ us_military ⊕ us_politics ⊕ utter_stupidity ⊕ variable-length_markov_models ⊕ vast_right-wing_conspiracy ⊕ via:? ⊕ via:ariddell ⊕ via:arthegall ⊕ via:brad-carlin ⊕ via:chl ⊕ via:crooked_timber ⊕ via:dpfeldman ⊕ via:gmg ⊕ via:jhofman ⊕ via:klk ⊕ via:laura_rozen ⊕ via:magistra_et_mater ⊕ via:making_light ⊕ via:mind-hacks ⊕ via:ryan_t ⊕ via:schneier ⊕ via:shachtman.noah ⊕ via:shivak ⊕ via:tomslee ⊕ via:vaguery ⊕ via:vielmetti ⊕ vishwanathan.s.v.n. ⊕ visual_display_of_quantitative_information ⊕ wahba.grace ⊕ watts.duncan ⊕ web ⊕ why_oh_why_cant_we_have_a_better_press_corps ⊕ williamson.robert ⊕ yates.joanne ⊕ you_are_the_product ⊕Copy this bookmark: