cshalizi + clustering   51

[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful. In particular, since BHV have shown that the space of trees is negatively curved (a CAT(0) space), a natural representation of a collection of trees is a tree. We compare this representation to the Euclidean approximations of treespace made available through Multidimensional Scaling of the matrix of distances between trees. We also provide applications of the distances between trees to hierarchical clustering trees constructed from microarrays. Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
to:NB  clustering  hierarchical_structure  holmes.susan  data_mining  statistics  to_teach:data-mining  gene_expression_data_analysis  via:ryan_t 
4 weeks ago by cshalizi
[1204.2296] Co-clustering for Directed Graphs; the Stochastic Co-Blockmodel and a Spectral Algorithm
"This paper extends the spectral clustering algorithm to directed networks in a way that co-clusters or bi-clusters the rows and columns of a graph Laplacian. Co-clustering leverages the increased complexity of asymmetric relationships to gain new insight into the structure of the directed network. To understand this algorithm and to study its asymptotic properties in a canonical setting, we propose the Stochastic Co-Blockmodel to encode co-clustering structure. This is the first statistical model of co-clustering and it is derived using the concept of stochastic equivalence that motivated the original Stochastic Blockmodel. Although directed spectral clustering is not derived from the Stochastic Co-Blockmodel, we show that, asymptotically, the algorithm can estimate the blocks in a high dimensional asymptotic setting in which the number of blocks grows with the number of nodes. The algorithm, model, and asymptotic results can all be extended to bipartite graphs."
in_NB  relational_learning  network_data_analysis  statistics  clustering  community_discovery  spectral_clustering  yu.bin 
6 weeks ago by cshalizi
[0804.0678] Consistency of spectral clustering
"Consistency is a key property of all statistical procedures analyzing randomly sampled data. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that, for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result, we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering."
to:NB  statistics  machine_learning  clustering  spectral_clustering 
8 weeks ago by cshalizi
[1203.6093] Consensus clustering in complex networks
"The community structure of complex networks reveals both their organization and hidden relationships among their constituents. Most community detection methods currently available are not deterministic, and their results typically depend on the specific random seeds, initial conditions and tie-break rules adopted for their execution. Consensus clustering is used in data analysis to generate stable results out of a set of partitions delivered by stochastic methods. Here we show that consensus clustering can be combined with any existing method in a self-consistent way, enhancing considerably both the stability and the accuracy of the resulting partitions. This framework is also particularly suitable to monitor the evolution of community structure in temporal networks. An application of consensus clustering to a large citation network of physics papers demonstrates its capability to keep track of the birth, death and diversification of topics."
in_NB  community_discovery  network_data_analysis  clustering  data_mining 
8 weeks ago by cshalizi
Taylor & Francis Online :: Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
"For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality."
to:NB  visual_display_of_quantitative_information  clustering  data_mining  to_teach:data-mining 
8 weeks ago by cshalizi
Time-series clustering via quasi U-statistics - Valk - 2012 - Journal of Time Series Analysis - Wiley Online Library
"The problem of time-series discrimination and classification is discussed. We propose a novel clustering algorithm based on a class of quasi U-statistics and subgroup decomposition tests. The decomposition may be applied to any concave time-series distance. The resulting test statistics are proven to be asymptotically normal for either i.i.d. or non-identically distributed groups of time-series under mild conditions. We illustrate its empirical performance on a simulation study and a real data analysis. The simulation setup includes stationary vs. stationary and stationary vs. non-stationary cases. The performance of the proposed method is favourably compared with some of the most common clustering measures available."
to:NB  clustering  time_series  statistics  classifiers 
8 weeks ago by cshalizi
[1203.5181] $k$-MLE: A fast algorithm for learning statistical mixture models
"We describe $k$-MLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectation-maximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering $k$-MLE algorithm iteratively assigns data to the most likely weighted component and update the component models using Maximum Likelihood Estimators (MLEs). Using the duality between exponential families and Bregman divergences, we prove that the local convergence of the complete likelihood of $k$-MLE follows directly from the convergence of a dual additively weighted Bregman hard clustering. The inner loop of $k$-MLE can be implemented using any $k$-means heuristic like the celebrated Lloyd's batched or Hartigan's greedy swap updates. We then show how to update the mixture weights by minimizing a cross-entropy criterion that implies to update weights by taking the relative proportion of cluster points, and reiterate the mixture parameter update and mixture weight update processes until convergence. Hard EM is interpreted as a special case of $k$-MLE when both the component update and the weight update are performed successively in the inner loop. To initialize $k$-MLE, we propose $k$-MLE++, a careful initialization of $k$-MLE guaranteeing probabilistically a global bound on the best possible complete likelihood."
in_NB  em_algorithm  mixture_models  statistics  machine_learning  clustering  estimation 
9 weeks ago by cshalizi
[0803.1628] Component models for large networks
"Being among the easiest ways to find meaningful structure from discrete data, Latent Dirichlet Allocation (LDA) and related component models have been applied widely. They are simple, computationally fast and scalable, interpretable, and admit nonparametric priors. In the currently popular field of network modeling, relatively little work has taken uncertainty of data seriously in the Bayesian sense, and component models have been introduced to the field only recently, by treating each node as a bag of out-going links. We introduce an alternative, interaction component model for communities (ICMc), where the whole network is a bag of links, stemming from different components. The former finds both disassortative and assortative structure, while the alternative assumes assortativity and finds community-like structures like the earlier methods motivated by physics. With Dirichlet Process priors and an efficient implementation the models are highly scalable, as demonstrated with a social network from the Last.fm web site, with 670,000 nodes and 1.89 million links."
in_NB  community_discovery  statistics  nonparametrics  clustering 
10 weeks ago by cshalizi
"Local equivalences of distances between clusterings—a geometric perspective" --- Marina Meilă
"In comparing clusterings, several different distances and indices are in use. We prove that the Misclassification Error distance, the Hamming distance (equivalent to the unadjusted Rand index), and the χ 2 distance between partitions are equivalent in the neighborhood of 0. In other words, if two partitions are very similar, then one distance defines upper and lower bounds on the other and viceversa. The proofs are geometric and rely on the concavity of the distances. The geometric intuitions themselves advance the understanding of the space of all clusterings. To our knowledge, this is the first result of its kind.
Practically, distances are frequently used to compare two clusterings of a set of observations. But the motivation for this work is in the theoretical study of data clustering. Distances between partitions are involved in constructing new methods for cluster validation, determining the number of clusters, and analyzing clustering algorithms. From a probability theory point of view, the present results apply to any pair of finite valued random variables, and provide simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent."
in_NB  clustering  data_mining  meila.marina 
february 2012 by cshalizi
Sun , Wang , Fang : Regularized k-means clustering of high-dimensional data and its asymptotic consistency
"K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering."
in_NB  clustering  statistics  lasso  data_mining  to_teach:data-mining 
february 2012 by cshalizi
[0805.0463] Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions
"We propose a distance between two realizations of a random process where for each realization only sparse and irregularly spaced measurements with additional measurement errors are available. Such data occur commonly in longitudinal studies and online trading data. A distance measure then makes it possible to apply distance-based analysis such as classification, clustering and multidimensional scaling for irregularly sampled longitudinal data. Once a suitable distance measure for sparsely sampled longitudinal trajectories has been found, we apply distance-based clustering methods to eBay online auction data. We identify six distinct clusters of bidding patterns. Each of these bidding patterns is found to be associated with a specific chance to obtain the auctioned item at a reasonable price."
to:NB  statistics  clustering  time_series 
december 2011 by cshalizi
[1110.2515] Normalized Mutual Information to evaluate overlapping community finding algorithms
"Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical. A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2]."
in_NB  community_discovery  information_theory  clustering  data_mining 
october 2011 by cshalizi
Bayesian Checking for Topic Models
"Real document collections do not fit the inde- pendence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved."
topic_models  model-checking  blei.david  in_NB  via:ariddell  statistics  machine_learning  information_retrieval  clustering  have_read 
july 2011 by cshalizi
[1011.2624] Clustering using Unsupervised Binary Trees: CUBT
"We introduce a new clustering method based on unsupervised binary trees. It is a three stages procedure, which performs on a first stage recursive binary splits reducing the heterogeneity of the data within the new subsamples. On the second stage (pruning) adjacent nodes are considered to be aggregated. Finally, on the third stage (joining) similar clusters are joined even if they do not descend from the same node. Consistency results are obtained and the procedure is tested on simulated and real data sets"
clustering  re:AoS_project 
december 2010 by cshalizi
Consistent selection of the number of clusters via crossvalidation — Biometrika
"In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split."
clustering  stability_of_learning  data_mining  statistics  to_teach:data-mining  to_teach:undergrad-ADA 
december 2010 by cshalizi
[1009.2470] Significance analysis and statistical mechanics: an application to clustering
"This paper addresses the statistical significance of structures in random data: Given a set of vectors and a measure of mutual similarity, how likely does a subset of these vectors form a cluster with enhanced similarity among its elements? The computation of this cluster p-value for randomly distributed vectors is mapped onto a well-defined problem of statistical mechanics. We solve this problem analytically, establishing a connection between the physics of quenched disorder and multiple testing statistics in clustering and related problems. In an application to gene expression data, we find a remarkable link between the statistical significance of a cluster and the functional relationships between its genes"
to_be_shot_after_a_fair_trial  to_read  clustering  p-values  statistical_mechanics  gene_expression_data_analysis  machine_learning 
september 2010 by cshalizi
[0911.2634] Local statistical modeling by cluster-weighted
Revisiting Gershenfeld et al.'s "cluster-weighted modeling" from a more properly statistical perspective.
statistics  mixture_models  regression  clustering  to_read  to_teach:data-mining 
november 2009 by cshalizi
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation (Belkin and Niyogi)
Using the spectral graph theory to get new features, for visualization, clustering, &c. Also includes a good discussion of the local linear embedding algorithm. Will try to teach this in 350 but may make heads explode.
to_teach:data-mining  clustering  graph_theory  spectral_clustering  dimension_reduction  niyogi.partha 
august 2009 by cshalizi
MALBEC: Jerry Zhu, Michael Coen, how to say snake in gibbon « Quomodocumque
The clustering comparison method sounds like something worth teaching in 350, but apparently its work-in-prep without a published version.
machine_learning  phonology  primates  gibbons  clustering  coen.michael  zhu.jerry  grammar_induction  to_teach:data-mining 
may 2009 by cshalizi
Partisan Influence in Congress and Institutional Change
I am not surprised that Nominate is unstable under subsampling, but I had no idea it was _that_ unstable.
congress  nominate  clustering  statistics  political_science  latent_variables  via:justin 
may 2009 by cshalizi
Margaret Ackerman and Shai Ben-David, "Measures of Clustering Quality: A Working Set of Axioms for Clustering"
A rebuttal to Kleinberg's impossibility theory for clustering (bookmarked earlier). There are measures of _cluster quality_ which satisfy all the natural axioms, which is good enough.
clustering  to_teach:data-mining  via:arthegall  via:vielmetti  data_mining  ackerman.margaret  ben-david.shai  kleinberg.jon 
december 2008 by cshalizi
An Impossibility Theorem for Clustering
Proof that it's impossible to build a clustering function which is at once scale-invariant, able to give all possible partitions as appropriate, and "consistent", meaning not statistical consistency but rather that shrinking within-cluster distances and expanding between-cluster distances should not change the partitioning. Very clever; I wish I'd know this before teaching clustering.
clustering  kleinberg.jon  via:aaron_clauset  to_teach:data-mining  machine_learning 
october 2008 by cshalizi
Introduction to Machine Learning - Ethem Alpaydin [@Labyrinth]
Useful introductory survey/reference; mathematically undemanding. Nice that it has chapters on hidden Markov models and various model-combination methods.
books:recommended  machine_learning  classifiers  regression  clustering  markov_models  ensemble_methods 
august 2008 by cshalizi

related tags

ackerman.margaret  ahmed.amr  approximation_algorithms  ben-david.shai  bergstrom.carl  bibliometry  blei.david  books:recommended  bootstrap  buhlmann.peter  category_theory  citation_networks  classifiers  clustering  coen.michael  community_discovery  congress  cross-validation  dasgupta.anirban  data_analysis  data_mining  density_estimation  dimension_reduction  dsm  em_algorithm  ensemble_methods  ergodic_theory  estimation  expectation-maximization  gene_expression_data_analysis  gibbons  grammar_induction  graphical_models  graph_theory  guyon.isabelle  have_read  heard_the_talk  heavy_tails  hierarchical_structure  holmes.susan  hopcroft.john  hypothesis_testing  information_retrieval  information_theory  internet  in_NB  jordan.michael_i.  k-means  kith_and_kin  kleinberg.jon  lasso  latent_variables  learning_theory  luxburg.ulrike_von  machine_learning  manifold_learning  markov_models  meila.marina  meinshausen.nicolai  minimum_description_length  mixture_models  model-checking  model_selection  network_data_analysis  niyogi.partha  nominate  nonparametrics  p-values  phonology  political_science  popular_social_science  primates  programming  R  re:AoS_project  re:network_differences  re:stacs  regression  relational_learning  rinaldo.alessandro  rosvall.martin  ryabko.daniil  sandler.mark  singh.aarti  smola.alex  spectral_clustering  stability_of_learning  statistical_inference_for_stochastic_processes  statistical_mechanics  statistics  text_mining  time_series  tishby.naftali  to:NB  topic_models  to_be_shot_after_a_fair_trial  to_read  to_teach:complexity-and-inference  to_teach:data-mining  to_teach:undergrad-ADA  tutorials  via:aaron_clauset  via:ariddell  via:arthegall  via:chl  via:georg  via:justin  via:ryan_t  via:vielmetti  visual_display_of_quantitative_information  wasserman.larry  williamson.robert  yu.bin  zhu.jerry 

Copy this bookmark:



description:


tags: