cshalizi + data_mining   99

[1204.6441] "I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data
"Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely difficult. However, most of the authors seem to be more interested in claiming positive results than in providing sound and reproducible methods. It is also especially worrisome that many recent papers seem to only acknowledge those studies supporting the idea of Twitter predicting elections, instead of conducting a balanced literature review showing both sides of the matter. After reading many of such papers I have decided to write such a survey myself. Hence, in this paper, every study relevant to the matter of electoral prediction using social media is commented. From this review it can be concluded that the predictive power of Twitter regarding elections has been greatly exaggerated, and that hard research problems still lie ahead."
to:NB  social_media  data_mining  prediction  have_read 
24 days ago by cshalizi
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful. In particular, since BHV have shown that the space of trees is negatively curved (a CAT(0) space), a natural representation of a collection of trees is a tree. We compare this representation to the Euclidean approximations of treespace made available through Multidimensional Scaling of the matrix of distances between trees. We also provide applications of the distances between trees to hierarchical clustering trees constructed from microarrays. Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
to:NB  clustering  hierarchical_structure  holmes.susan  data_mining  statistics  to_teach:data-mining  gene_expression_data_analysis  via:ryan_t 
4 weeks ago by cshalizi
Game-powered machine learning
"Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the “wisdom of the crowds.” Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., “funky jazz with saxophone,” “spooky electronica,” etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data."

--- This is more than a bit of a stunt, but it points in an interesting direction.
to:NB  to_read  data_mining  collective_cognition  active_learning  tagging  classifiers  re:democratic_cognition 
4 weeks ago by cshalizi
[1204.1002] Fast Multi-Scale Detection of Relevant Communities
"Nowadays, networks are almost ubiquitous. In the past decade, community detection received an increasing interest as a way to uncover the structure of networks by grouping nodes into communities more densely connected internally than externally. Yet most of the effective methods available do not consider the potential levels of organisation, or scales, a network may encompass and are therefore limited. In this paper we present a method compatible with global and local criteria that enables fast multi-scale community detection. The method is derived in two algorithms, one for each type of criterion, and implemented with 6 known criteria. Uncovering communities at various scales is a computationally expensive task. Therefore this work puts a strong emphasis on the reduction of computational complexity. Some heuristics are introduced for speed-up purposes. Experiments demonstrate the efficiency and accuracy of our method with respect to each algorithm and criterion by testing them against large generated multi-scale networks. This study also offers a comparison between criteria and between the global and local approaches."
to:NB  community_discovery  data_mining 
7 weeks ago by cshalizi
[1203.2200] Role-Dynamics: Fast Mining of Large Dynamic Networks
"To understand the structural dynamics of a large-scale social, biological or technological network, it may be useful to discover behavioral roles representing the main connectivity patterns present over time. In this paper, we propose a scalable non-parametric approach to automatically learn the structural dynamics of the network and individual nodes. Roles may represent structural or behavioral patterns such as the center of a star, peripheral nodes, or bridge nodes that connect different communities. Our novel approach learns the appropriate structural role dynamics for any arbitrary network and tracks the changes over time. In particular, we uncover the specific global network dynamics and the local node dynamics of a technological, communication, and social network. We identify interesting node and network patterns such as stationary and non-stationary roles, spikes/steps in role-memberships (perhaps indicating anomalies), increasing/decreasing role trends, among many others. Our results indicate that the nodes in each of these networks have distinct connectivity patterns that are non-stationary and evolve considerably over time. Overall, the experiments demonstrate the effectiveness of our approach for fast mining and tracking of the dynamics in large networks. Furthermore, the dynamic structural representation provides a basis for building more sophisticated models and tools that are fast for exploring large dynamic networks."
in_NB  network_data_analysis  data_mining  community_discovery  neville.jennifer 
7 weeks ago by cshalizi
[1203.1647] A Survey of Prediction Using Social Media
"Social media comprises interactive applications and platforms for creating, sharing and exchange of user-generated contents. The past ten years have brought huge growth in social media, especially online social networking services, and it is changing our ways to organize and communicate. It aggregates opinions and feelings of diverse groups of people at low cost. Mining the attributes and contents of social media gives us an opportunity to discover social structure characteristics, analyze action patterns qualitatively and quantitatively, and sometimes the ability to predict future human related events. In this paper, we firstly discuss the realms which can be predicted with current social media, then overview available predictors and techniques of prediction, and finally discuss challenges and possible future directions."
to:NB  social_media  re:social-networks-as-sensor-networks  data_mining 
7 weeks ago by cshalizi
Not an April Fool - Charlie's Diary
"It's easy to imagine how we could make something worse than "Girls Around Me"—something much worse. Facebook encourages us to disclose a wide range of information about ourselves, including our religion and a photograph. Religion is obvious: "Yids Among Us" would obviously be one of the go-to tools of choice for Neo-Nazis. As for skin colour, ethnicity identification from face images is out there already. Want to go queer bashing? There's an algorithm out there for guessing sexual orientation based on the network graph of the target's facebook friends. It's probably possible to apply this sort of data mining exercise to determine whether a woman has had an abortion or is pro-choice.
"In the worst case, it's possible to envisage geolocation and data aggregation apps being designed to facilitate the identification and elimination of some ethnic or class enemy, not only by making it easy for users to track them down, but by making it easy for users to identify each other and form ad-hoc lynch mobs. (Hence my reference to the Rwandan Genocide earlier. Think it couldn't happen? Look at Iran and imagine an app written for the Basij to make it easy to identify dissidents and form ad-hoc goon squads to proactively hunt them down. Or any other organization in the post-networked world that has a social role corresponding to the Red Guards.)
"But as I said earlier, the app is not the problem. The problem is the deployment by profit-oriented corporations of behavioural psychology techniques to induce people to over-share information which can then be aggregated and disclosed to third parties for targeted marketing purposes."

Comment: Stross is not, sadly, exaggerating.
networked_life  data_mining  social_networks  moral_responsibility  you_are_the_product  stross.charlie 
8 weeks ago by cshalizi
[1203.6093] Consensus clustering in complex networks
"The community structure of complex networks reveals both their organization and hidden relationships among their constituents. Most community detection methods currently available are not deterministic, and their results typically depend on the specific random seeds, initial conditions and tie-break rules adopted for their execution. Consensus clustering is used in data analysis to generate stable results out of a set of partitions delivered by stochastic methods. Here we show that consensus clustering can be combined with any existing method in a self-consistent way, enhancing considerably both the stability and the accuracy of the resulting partitions. This framework is also particularly suitable to monitor the evolution of community structure in temporal networks. An application of consensus clustering to a large citation network of physics papers demonstrates its capability to keep track of the birth, death and diversification of topics."
in_NB  community_discovery  network_data_analysis  clustering  data_mining 
8 weeks ago by cshalizi
Taylor & Francis Online :: Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
"For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality."
to:NB  visual_display_of_quantitative_information  clustering  data_mining  to_teach:data-mining 
8 weeks ago by cshalizi
"Local equivalences of distances between clusterings—a geometric perspective" --- Marina Meilă
"In comparing clusterings, several different distances and indices are in use. We prove that the Misclassification Error distance, the Hamming distance (equivalent to the unadjusted Rand index), and the χ 2 distance between partitions are equivalent in the neighborhood of 0. In other words, if two partitions are very similar, then one distance defines upper and lower bounds on the other and viceversa. The proofs are geometric and rely on the concavity of the distances. The geometric intuitions themselves advance the understanding of the space of all clusterings. To our knowledge, this is the first result of its kind.
Practically, distances are frequently used to compare two clusterings of a set of observations. But the motivation for this work is in the theoretical study of data clustering. Distances between partitions are involved in constructing new methods for cluster validation, determining the number of clusters, and analyzing clustering algorithms. From a probability theory point of view, the present results apply to any pair of finite valued random variables, and provide simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent."
in_NB  clustering  data_mining  meila.marina 
february 2012 by cshalizi
[1202.1561] Tree Models for Difference and Change Detection in a Complex Environment
"A new family of tree models is proposed, which we call "differential trees." A differential tree model is constructed from multiple data sets and aims to detect distributional differences between them. The new methodology differs from the existing difference and change detection techniques in its nonparametric nature, model construction from multiple data sets, and applicability to high-dimensional data. Through a detailed study of an arson case in New Zealand, where an individual is known to have been laying vegetation fires within a certain time period, we illustrate how these models can help detect changes in the frequencies of event occurrences and uncover unusual clusters of events in a complex environment."

--- After reading, I think their exposition is needlessly hard to follow, but let me take a stab at it. In an ordinary classification tree, we are interested in the distribution of the class labels Y given the predictors X, i.e., Pr(Y|X), and make splits on X so that (in essence) the conditional entropy H[Y|X] becomes small. This is of course equivalent to making splits so that the divergence of Pr(Y|X) from Pr(Y) is maximized. What they are interested in is not classification but _describing_ how the different classes are distinct, so the relevant distribution is Pr(X|Y), and they want a big divergence between Pr(X) and Pr(X|Y).
to:NB  re:network_differences  statistics  hypothesis_testing  density_estimation  decision_trees  have_read  data_mining  two-sample_tests 
february 2012 by cshalizi
Sun , Wang , Fang : Regularized k-means clustering of high-dimensional data and its asymptotic consistency
"K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering."
in_NB  clustering  statistics  lasso  data_mining  to_teach:data-mining 
february 2012 by cshalizi
A General Framework for Dimensionality-Reducing Data Visualization Mapping
"In recent years, a wealth of dimension-reduction techniques for data visualization and preprocessing has been established. Nonparametric methods require additional effort for out-of-sample extensions, because they provide only a mapping of a given finite set of points. In this letter, we propose a general view on nonparametric dimension reduction based on the concept of cost functions and properties of the data. Based on this general principle, we transfer nonparametric dimension reduction to explicit mappings of the data manifold such that direct out-of-sample extensions become possible. Furthermore, this concept offers the possibility of investigating the generalization ability of data visualization to new data points. We demonstrate the approach based on a simple global linear mapping, as well as prototype-based local linear mappings. In addition, we can bias the functional form according to given auxiliary information. This leads to explicit supervised visualization mappings with discriminative properties comparable to state-of-the-art approaches."
in_NB  dimension_reduction  visual_display_of_quantitative_information  data_analysis  data_mining  manifold_learning  to_teach:data-mining 
february 2012 by cshalizi
[1201.5568] Dynamic trees for streaming and massive data contexts
"Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, from vast offline databases to streaming real-time information. Learning algorithms deployed in such contexts must rely on single-pass inference, where the data history is never revisited. In streaming contexts, learning must also be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Although rapidly growing, the online Bayesian inference literature remains challenged by massive data and transient, evolving data streams. Non-parametric modelling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting standard streaming techniques, like data discarding and downweighting, into a fully Bayesian framework via the use of informative priors and active learning heuristics. We showcase our methods by augmenting a modern non-parametric modelling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favourably to the state-of-the-art."
to:NB  machine_learning  non-stationarity  statistics  data_mining  to_read  re:growing_ensemble_project 
january 2012 by cshalizi
RE-EM Trees: A Data Ming Approach for Longitudinal and Clustered Data
"Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data, where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic differences between objects that are not related to attributes and autocorrelation within objects across time periods. A standard methodology in the statistics literature for this type of data is the mixed effects model, where these differences between objects are represented by so-called “random effects” that are estimated from the data (population-level relationships are termed “fixed effects,” together resulting in a mixed effects model). This paper presents a methodology that combines the structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods. We apply the resulting estimation method, called the RE-EM tree, to pricing in online transactions, showing that the RE-EM tree is less sensitive to parametric assumptions and provides improved predictive power compared to linear models with random effects and regression trees without random effects. We also apply it to a smaller data set examining accident fatalities, and show that the RE-EM tree strongly outperforms a tree without random effects while performing comparably to a linear model with random effects. We also perform extensive simulation experiments to show that the estimator improves predictive performance relative to regression trees without random effects and is comparable or superior to using linear models with random effects in more general situations."
to:NB  machine_learning  decision_trees  data_mining  statistics  hierarchical_models 
january 2012 by cshalizi
Mining of Massive Datasets - Academic and Professional Books - Cambridge University Press
"The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike."

--- What a remarkably hideous cover!
to:NB  books:noted  data_mining  to_teach:data-mining  machine_learning  computational_statistics 
january 2012 by cshalizi
Graph-based Natural Language Processing and Information Retrieval - Mihaclea and Radev
"Graph theory and the fields of natural language processing and information retrieval are well-studied disciplines. Traditionally, these areas have been perceived as distinct, with different algorithms, different applications, and different potential end-users. However, recent research has shown that these disciplines are intimately connected, with a large variety of natural language processing and information retrieval applications finding efficient solutions within graph-theoretical frameworks. This book extensively covers the use of graph-based algorithms for natural language processing and information retrieval. It brings together topics as diverse as lexical semantics, text summarization, text mining, ontology construction, text classification, and information retrieval, which are connected by the common underlying theme of the use of graph-theoretical methods for text and information processing tasks. Readers will come away with a firm understanding of the major methods and applications in natural language processing and information retrieval that rely on graph-based representations and algorithms."
in_NB  books:noted  natural_language_processing  graph_theory  data_mining  text_mining  radev.dragomir 
december 2011 by cshalizi
[1110.4851] Leveraging User Diversity to Harvest Knowledge on the Social Web
"Social web users are a very diverse group with varying interests, levels of expertise, enthusiasm, and expressiveness. As a result, the quality of content and annotations they create to organize content is also highly variable. While several approaches have been proposed to mine social annotations, for example, to learn folksonomies that reflect how people relate narrower concepts to broader ones, these methods treat all users and the annotations they create uniformly. We propose a framework to automatically identify experts, i.e., knowledgeable users who create high quality annotations, and use their knowledge to guide folksonomy learning. We evaluate the approach on a large body of social annotations extracted from the photosharing site Flickr. We show that using expert knowledge leads to more detailed and accurate folksonomies. Moreover, we show that including annotations from non-expert, or novice, users leads to more comprehensive folksonomies than experts' knowledge alone."
to:NB  data_mining  social_life_of_the_mind  social_media  kith_and_kin  lerman.kristina  tagging 
october 2011 by cshalizi
[1110.3225] Mining Patterns in Networks using Homomorphism
"In recent years many algorithms have been developed for finding patterns in graphs and networks. A disadvantage of these algorithms is that they use subgraph isomorphism to determine the support of a graph pattern; subgraph isomorphism is a well-known NP complete problem. In this paper, we propose an alternative approach which mines tree patterns in networks by using subgraph homomorphism. The advantage of homomorphism is that it can be computed in polynomial time, which allows us to develop an algorithm that mines tree patterns in arbitrary graphs in incremental polynomial time. Homomorphism however entails two problems not found when using isomorphism: (1) two patterns of different size can be equivalent; (2) patterns of unbounded size can be frequent. In this paper we formalize these problems and study solutions that easily fit within our algorithm."
in_NB  to_read  re:smoothing_adjacency_matrices  network_data_analysis  data_mining  graph_theory 
october 2011 by cshalizi
[1110.2515] Normalized Mutual Information to evaluate overlapping community finding algorithms
"Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical. A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2]."
in_NB  community_discovery  information_theory  clustering  data_mining 
october 2011 by cshalizi
http://www.dtic.mil/descriptivesum/Y2012/DARPA/0602702E_2_PB_2012.pdf
"develop tools [for] automated interpretation, quantitative analysis, and visualization of social networks.... social networks [are models for] terrorist cells, insurgent groups, and other stateless actors whose connectedness is established not [by] shared geography but [by] correlat[ed] participation in coordinated activities ... apply emerging methods for edge finding and cluster analysis to detect, characterize, and predict the dynamics of social networks. ... application in tactical contexts... foundation for cultural intelligence - understanding the stability, governance, and economic indicators of a region ... 2012 Plans: Develop techniques for simulation, visualization, inference, and prediction of social network dynamics; ... for modeling the interactions between and within cooperating/competing/conflicting social networks, sub- networks, and super-networks and for predicting the merging and splitting of social networks; Evaluate ... on real-world social-cultural-network data."
darpa  nexus-7  afghanistan  data_mining  counter-insurgency  network_data_analysis  to:blog 
july 2011 by cshalizi
Exclusive: Inside Darpa’s Secret Afghan Spy Machine | Danger Room | Wired.com
I must say that the few lines in the budget document were a hell of a lot clearer about what this is _supposed_ to achieve than the Wired article itself.  But I do not have a good feeling about this project, at all.  (At the very least, for $30 million, you could teach a lot of soldiers Dari and Pashto, or recruit a lot of Afghan informants.)
afghanistan  darpa  military_industrial_complex  data_mining  network_data_analysis  us_military  counter-insurgency  to:blog 
july 2011 by cshalizi
"Smooth Regression Analysis" (G. S. Watson, 1964) JSTOR: Sankhyā: The Indian Journal of Statistics, Series A, Vol. 26, No. 4 (Dec., 1964), pp. 359-372
The abstract is great: "Few would deny that the most powerful statistical tool is graph paper. When however there are many observations (and/or many variables) graphical procedures become tedious. It seems to the author that the most characteristic problem for statisticians at the moment is the development of methods for analyzing the data poured out by electronic observing systems. The present paper gives a simple computer method for obtaining a "graph" from a large number of observations."
smoothing  regression  kernel_estimators  data_mining  to_teach:undergrad-ADA  to_teach:data-mining  via:gmg 
june 2011 by cshalizi
Friedman , Yu : Leo Breiman (1929–2005)
Introduction to the special section of _Annals of Applied Statistics_ in memory of Leo Breiman.
breiman.leo  lives_of_the_scientists  statistics  machine_learning  data_mining  CART 
january 2011 by cshalizi
Rule generation for categorical time series with Markov assumptions
"Several procedures of sequential pattern analysis are designed to detect frequently occurring patterns in a single categorical time series (episode mining). Based on these frequent patterns, rules are generated and evaluated, for example, in terms of their confidence. The confidence value is commonly interpreted as an estimate of a conditional probability, so some kind of stochastic model has to be assumed. The model is identified as a variable length Markov model. With this assumption, the usual confidences are maximum likelihood estimates of the transition probabilities of the Markov model. We discuss possibilities of how to efficiently fit an appropriate model to the data. Based on this model, rules are formulated. It is demonstrated that this new approach generates noticeably less and more reliable rules." --- I should really add some time series stuff to data mining...
data_mining  markov_models  time_series  in_NB  to_teach:data-mining  variable-length_markov_models 
december 2010 by cshalizi
Consistent selection of the number of clusters via crossvalidation — Biometrika
"In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split."
clustering  stability_of_learning  data_mining  statistics  to_teach:data-mining  to_teach:undergrad-ADA 
december 2010 by cshalizi
Predicting consumer behavior with Web search — PNAS
What search can and cannot predict. They mention, but I think could have stressed even more, that the search data is generated _automatically_ as a by-product of now-ordinary social life, rather than a deliberate construction on the part of public or private data-collecting agencies, so it is very, very, very cheap.
internet  data_mining  to_teach:data-mining  kith_and_kin  watts.duncan  hofman.jake  sociology  information_retrieval  networked_life  have_read 
october 2010 by cshalizi
Suykens, Alzate, Pelckmans: Primal and dual model representations in kernel-based learning
"This paper discusses the role of primal and (Lagrange) dual model representations in problems of supervised and unsupervised learning. The specification of the estimation problem is conceived at the primal level as a constrained optimization problem. The constraints relate to the model which is expressed in terms of the feature map. From the conditions for optimality one jointly finds the optimal model representation and the model estimate. At the dual level the model is expressed in terms of a positive definite kernel function, which is characteristic for a support vector machine methodology. It is discussed how least squares support vector machines are playing a central role as core models across problems of regression, classification, principal component analysis, spectral clustering, canonical correlation analysis, dimensionality reduction and data visualization."
kernel_methods  statistics  machine_learning  data_mining  to_teach:data-mining 
august 2010 by cshalizi
ILI 2009 Presentation – "Self-plagiarism is style"
Cool effects achieved by applying basic data mining to libraries. To be used as teaching fodder, but honestly I should also find the time to suggest it to our librarians.
libraries  data_mining  information_retrieval  collaborative_filtering  via:magistra_et_mater  to_teach:data-mining 
june 2010 by cshalizi
[0901.2735] State Space Realization Theorems For Data Mining
"In this paper, we consider formal series associated with events, profiles derived from events, and statistical models that make predictions about events. We prove theorems about realizations for these formal series using the language and tools of Hopf algebras."
state-space_models  data_mining 
may 2010 by cshalizi
Desiderata for a Predictive Theory of Statistics - Clarke, 2010
"In many contexts the predictive validation of models or their associated prediction strategies is of greater importance than model identification which may be practically impossible. This is particularly so in fields involving complex or high dimensional data where model selection, or more generally predictor selection is the main focus of effort. This paper suggests a unified treatment for predictive analyses based on six `desiderata'. These desiderata are an effort to clarify what criteria a good predictive theory of statistics should satisfy." --- I presume (I haven't read the paper yet) that he means a theory of statistical predictions, and not a theory which tries to predict future developments within statistics.
statistics  prediction  methodology  to_read  data_mining 
march 2010 by cshalizi
[1003.0529] A Unified Algorithmic Framework for Multi-Dimensional Scaling
"In this paper, we propose a unified algorithmic framework for solving many known variants of \mds. Our algorithm is a simple iterative scheme with guaranteed convergence, and is \emph{modular}; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods, in comparable time. "
multidimensional_scaling  dimension_reduction  visual_display_of_quantitative_information  to_teach:data-mining  data_mining 
march 2010 by cshalizi
[0910.2340] A Stochastic Model for Collaborative Recommendation
"Collaborative recommendation is an information-filtering technique that attempts to present ,,, movies, music, books, news, images, Web pages, etc. that are likely of interest to [users]. ... In its most common form, the problem is framed as trying to estimate ratings for items that have not yet been consumed by a user. Despite wide-ranging literature, little is known about the statistical properties of recommendation systems. In fact, no clear probabilistic model even exists allowing us to precisely describe the mathematical forces driving collaborative filtering. To provide an initial contribution to this, we propose to set out a general sequential stochastic model for collaborative recommendation and analyze its asymptotic performance as the number of users grows.... analysis of the so-called cosine-type nearest neighbor collaborative method .... consistency of the procedure under mild assumptions on the model. Rates of convergence and examples..."
collaborative_filtering  information_retrieval  data_mining  to_read  to:NB  to_teach:data-mining 
october 2009 by cshalizi
Powell's Books - Principles and Theory for Data Mining and Machine Learning (Springer Series in Statistics) by Bertrand Clarke
Too late to consider using as a textbook for 36-350, but I should ask for an examination copy. Update: bought it. Pretty good but way too advanced mathematically for my class; more "If you liked _The Elements of Statistical Learning_, but wish it had more traditional statistical theory, have we got a book for you."
books:noted  data_mining  statistics  machine_learning  to_teach:data-mining 
july 2009 by cshalizi
All we want are the facts, ma'am
When I wrote about Chris Anderson's idiotic piece back in the spring, I didn't say anything about the quote from Norvig, because it sounded very strange and not at all like Norvig. And, indeed, he now says "That's a silly statement, I didn't say it, and I disagree with it." Ah, Wired!
why_oh_why_cant_we_have_a_better_press_corps  anderson.chris  statistics  modeling  data_mining  norvig.peter  machine_learning  bad_science_journalism  fact_checking  via:arthegall  via:shivak 
february 2009 by cshalizi
Margaret Ackerman and Shai Ben-David, "Measures of Clustering Quality: A Working Set of Axioms for Clustering"
A rebuttal to Kleinberg's impossibility theory for clustering (bookmarked earlier). There are measures of _cluster quality_ which satisfy all the natural axioms, which is good enough.
clustering  to_teach:data-mining  via:arthegall  via:vielmetti  data_mining  ackerman.margaret  ben-david.shai  kleinberg.jon 
december 2008 by cshalizi
Whimsley: Theses on Netflix
Mostly good, except for the last thesis: "Recommender systems only filter culture. The point, in various ways, is to create environments in which artists can prosper." No! The point is to create environments in which CULTURE can prosper; professional artists are something else.
slee.tom  collaborative_filtering  data_mining 
november 2008 by cshalizi
The Screens Issue - If You Liked This, Sure to Love That - Winning the Netflix Prize - NYTimes.com
What the ******* ****, Netflix wasn't using singular value decomposition? Can that really be true? (The hope that the report massively misunderstood is the only thing saving this from an "utter_stupidity" tag.)
netflix_prize  data_mining  collaborative_filtering  to_teach:data-mining  principal_components 
november 2008 by cshalizi
Notional Slurry » Is this a good time to reveal credit card terms?
In which Bill proposes that customers start data-mining the credit-card companies.
credit_cards  data_mining  modest_proposals  tozier.william 
november 2008 by cshalizi
« earlier      

related tags

ackerman.margaret  active_learning  additive_models  advertising  afghanistan  algorithms  anderson.chris  arthegall  artificial_intelligence  bad_data_analysis  bad_science_journalism  ben-david.shai  bioinformatics  blogged  blogs  books:noted  books:recommended  bootstrap  breiman.leo  burke.timothy  carnegie_mellon  CART  causality  classifiers  clinical_vs_actuarial_prediction  clustering  collaborative_filtering  collective_cognition  community_discovery  computational_statistics  computers  content_analysis  corpus_linguistics  counter-insurgency  counter-terrorism  credit_cards  credit_ratings  creeping_authoritarianism  crime  cross-validation  darpa  databases  data_analysis  data_mining  decision_trees  density_estimation  dimension_reduction  distributed_systems  econometrics  economics  ensemble_methods  estimation  fact_checking  FBI  financial_speculation  food  fraud  freese.jeremy  funny:geeky  funny:laughing_instead_of_screaming  funny:sad  gene_expression_data_analysis  glymour.clark  google  graphical_models  graph_theory  guyon.isabelle  have_read  heard_the_talk  herding  hierarchical_models  hierarchical_structure  history_of_technology  hofman.jake  holmes.susan  homophily  humanities  hypothesis_testing  information_retrieval  information_theory  insurance  internet  in_NB  iran  jordan.michael_i.  kernel_estimators  kernel_methods  kith_and_kin  kleinberg.jon  klinkner.kristina  lasso  lead  leamer.ed  learning_theory  lerman.kristina  liberman.mark  libraries  life_imitates_science_fiction  life_imitates_the_onion  linguistics  lives_of_the_scientists  luxburg.ulrike_von  machine_learning  management  manifold_learning  map-reduce  marketing  markov_models  meila.marina  methodological_advice  methodology  military_industrial_complex  mirror_worlds  modeling  model_selection  modest_proposals  moral_panic  moral_responsibility  multidimensional_scaling  multiple_comparisons  national_surveillance_state  natural_history_of_truthiness  natural_language_processing  netflix_prize  networked_life  networks  network_data  network_data_analysis  neville.jennifer  nexus-7  non-stationarity  norvig.peter  novels  NSA  o'neil.cathy  organizations  outliers  parallel_computing  pattern_discovery  prediction  prediction_trees  principal_components  privacy  profiling  programming  psychology  R  radev.dragomir  random_forests  re:democratic_cognition  re:growing_ensemble_project  re:network_differences  re:smoothing_adjacency_matrices  re:social-networks-as-sensor-networks  regression  risk_assessment  risk_vs_uncertainty  scientific_computing  search_engines  semantics_from_syntax  slee.tom  smola.alex  smoothing  social_life_of_the_mind  social_media  social_networks  sociology  software  spam  stability_of_learning  state-space_models  statistics  stross.charlie  structured_data  stupid_security  support_vector_machines  surveillance  tagging  taste:bad  technical_change  terrorism_fears  text_mining  theoretical_computer_science  the_continuing_crises  the_present_before_it_was_widely_distributed  the_wired_ideology  time_series  to:blog  to:NB  topic_models  tozier.william  to_read  to_teach:data-mining  to_teach:undergrad-ADA  transaction_networks  two-sample_tests  us_civil_war  us_military  us_politics  utter_stupidity  variable-length_markov_models  vast_right-wing_conspiracy  via:?  via:ariddell  via:arthegall  via:brad-carlin  via:chl  via:crooked_timber  via:dpfeldman  via:gmg  via:jhofman  via:klk  via:laura_rozen  via:magistra_et_mater  via:making_light  via:mind-hacks  via:ryan_t  via:schneier  via:shachtman.noah  via:shivak  via:tomslee  via:vaguery  via:vielmetti  vishwanathan.s.v.n.  visual_display_of_quantitative_information  wahba.grace  watts.duncan  web  why_oh_why_cant_we_have_a_better_press_corps  williamson.robert  yates.joanne  you_are_the_product 

Copy this bookmark:



description:


tags: