cshalizi + clustering 51
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
4 weeks ago by cshalizi
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful. In particular, since BHV have shown that the space of trees is negatively curved (a CAT(0) space), a natural representation of a collection of trees is a tree. We compare this representation to the Euclidean approximations of treespace made available through Multidimensional Scaling of the matrix of distances between trees. We also provide applications of the distances between trees to hierarchical clustering trees constructed from microarrays. Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
to:NB
clustering
hierarchical_structure
holmes.susan
data_mining
statistics
to_teach:data-mining
gene_expression_data_analysis
via:ryan_t
4 weeks ago by cshalizi
[1204.2296] Co-clustering for Directed Graphs; the Stochastic Co-Blockmodel and a Spectral Algorithm
6 weeks ago by cshalizi
"This paper extends the spectral clustering algorithm to directed networks in a way that co-clusters or bi-clusters the rows and columns of a graph Laplacian. Co-clustering leverages the increased complexity of asymmetric relationships to gain new insight into the structure of the directed network. To understand this algorithm and to study its asymptotic properties in a canonical setting, we propose the Stochastic Co-Blockmodel to encode co-clustering structure. This is the first statistical model of co-clustering and it is derived using the concept of stochastic equivalence that motivated the original Stochastic Blockmodel. Although directed spectral clustering is not derived from the Stochastic Co-Blockmodel, we show that, asymptotically, the algorithm can estimate the blocks in a high dimensional asymptotic setting in which the number of blocks grows with the number of nodes. The algorithm, model, and asymptotic results can all be extended to bipartite graphs."
in_NB
relational_learning
network_data_analysis
statistics
clustering
community_discovery
spectral_clustering
yu.bin
6 weeks ago by cshalizi
[0804.0678] Consistency of spectral clustering
8 weeks ago by cshalizi
"Consistency is a key property of all statistical procedures analyzing randomly sampled data. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that, for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result, we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering."
to:NB
statistics
machine_learning
clustering
spectral_clustering
8 weeks ago by cshalizi
[1203.6093] Consensus clustering in complex networks
8 weeks ago by cshalizi
"The community structure of complex networks reveals both their organization and hidden relationships among their constituents. Most community detection methods currently available are not deterministic, and their results typically depend on the specific random seeds, initial conditions and tie-break rules adopted for their execution. Consensus clustering is used in data analysis to generate stable results out of a set of partitions delivered by stochastic methods. Here we show that consensus clustering can be combined with any existing method in a self-consistent way, enhancing considerably both the stability and the accuracy of the resulting partitions. This framework is also particularly suitable to monitor the evolution of community structure in temporal networks. An application of consensus clustering to a large citation network of physics papers demonstrates its capability to keep track of the birth, death and diversification of topics."
in_NB
community_discovery
network_data_analysis
clustering
data_mining
8 weeks ago by cshalizi
Taylor & Francis Online :: Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering - Journal of Computational and Graphical Statistics - Volume 20, Issue 2
8 weeks ago by cshalizi
"For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality."
to:NB
visual_display_of_quantitative_information
clustering
data_mining
to_teach:data-mining
8 weeks ago by cshalizi
Time-series clustering via quasi U-statistics - Valk - 2012 - Journal of Time Series Analysis - Wiley Online Library
8 weeks ago by cshalizi
"The problem of time-series discrimination and classification is discussed. We propose a novel clustering algorithm based on a class of quasi U-statistics and subgroup decomposition tests. The decomposition may be applied to any concave time-series distance. The resulting test statistics are proven to be asymptotically normal for either i.i.d. or non-identically distributed groups of time-series under mild conditions. We illustrate its empirical performance on a simulation study and a real data analysis. The simulation setup includes stationary vs. stationary and stationary vs. non-stationary cases. The performance of the proposed method is favourably compared with some of the most common clustering measures available."
to:NB
clustering
time_series
statistics
classifiers
8 weeks ago by cshalizi
[1203.5181] $k$-MLE: A fast algorithm for learning statistical mixture models
9 weeks ago by cshalizi
"We describe $k$-MLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectation-maximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering $k$-MLE algorithm iteratively assigns data to the most likely weighted component and update the component models using Maximum Likelihood Estimators (MLEs). Using the duality between exponential families and Bregman divergences, we prove that the local convergence of the complete likelihood of $k$-MLE follows directly from the convergence of a dual additively weighted Bregman hard clustering. The inner loop of $k$-MLE can be implemented using any $k$-means heuristic like the celebrated Lloyd's batched or Hartigan's greedy swap updates. We then show how to update the mixture weights by minimizing a cross-entropy criterion that implies to update weights by taking the relative proportion of cluster points, and reiterate the mixture parameter update and mixture weight update processes until convergence. Hard EM is interpreted as a special case of $k$-MLE when both the component update and the weight update are performed successively in the inner loop. To initialize $k$-MLE, we propose $k$-MLE++, a careful initialization of $k$-MLE guaranteeing probabilistically a global bound on the best possible complete likelihood."
in_NB
em_algorithm
mixture_models
statistics
machine_learning
clustering
estimation
9 weeks ago by cshalizi
[0803.1628] Component models for large networks
10 weeks ago by cshalizi
"Being among the easiest ways to find meaningful structure from discrete data, Latent Dirichlet Allocation (LDA) and related component models have been applied widely. They are simple, computationally fast and scalable, interpretable, and admit nonparametric priors. In the currently popular field of network modeling, relatively little work has taken uncertainty of data seriously in the Bayesian sense, and component models have been introduced to the field only recently, by treating each node as a bag of out-going links. We introduce an alternative, interaction component model for communities (ICMc), where the whole network is a bag of links, stemming from different components. The former finds both disassortative and assortative structure, while the alternative assumes assortativity and finds community-like structures like the earlier methods motivated by physics. With Dirichlet Process priors and an efficient implementation the models are highly scalable, as demonstrated with a social network from the Last.fm web site, with 670,000 nodes and 1.89 million links."
in_NB
community_discovery
statistics
nonparametrics
clustering
10 weeks ago by cshalizi
"Local equivalences of distances between clusterings—a geometric perspective" --- Marina Meilă
february 2012 by cshalizi
"In comparing clusterings, several different distances and indices are in use. We prove that the Misclassification Error distance, the Hamming distance (equivalent to the unadjusted Rand index), and the χ 2 distance between partitions are equivalent in the neighborhood of 0. In other words, if two partitions are very similar, then one distance defines upper and lower bounds on the other and viceversa. The proofs are geometric and rely on the concavity of the distances. The geometric intuitions themselves advance the understanding of the space of all clusterings. To our knowledge, this is the first result of its kind.
Practically, distances are frequently used to compare two clusterings of a set of observations. But the motivation for this work is in the theoretical study of data clustering. Distances between partitions are involved in constructing new methods for cluster validation, determining the number of clusters, and analyzing clustering algorithms. From a probability theory point of view, the present results apply to any pair of finite valued random variables, and provide simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent."
in_NB
clustering
data_mining
meila.marina
Practically, distances are frequently used to compare two clusterings of a set of observations. But the motivation for this work is in the theoretical study of data clustering. Distances between partitions are involved in constructing new methods for cluster validation, determining the number of clusters, and analyzing clustering algorithms. From a probability theory point of view, the present results apply to any pair of finite valued random variables, and provide simple yet tight upper and lower bounds on the χ 2 measure of (in)dependence valid when the two variables are strongly dependent."
february 2012 by cshalizi
Sun , Wang , Fang : Regularized k-means clustering of high-dimensional data and its asymptotic consistency
february 2012 by cshalizi
"K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering."
in_NB
clustering
statistics
lasso
data_mining
to_teach:data-mining
february 2012 by cshalizi
[0805.0463] Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions
december 2011 by cshalizi
"We propose a distance between two realizations of a random process where for each realization only sparse and irregularly spaced measurements with additional measurement errors are available. Such data occur commonly in longitudinal studies and online trading data. A distance measure then makes it possible to apply distance-based analysis such as classification, clustering and multidimensional scaling for irregularly sampled longitudinal data. Once a suitable distance measure for sparsely sampled longitudinal trajectories has been found, we apply distance-based clustering methods to eBay online auction data. We identify six distinct clusters of bidding patterns. Each of these bidding patterns is found to be associated with a specific chance to obtain the auctioned item at a reasonable price."
to:NB
statistics
clustering
time_series
december 2011 by cshalizi
[1110.2515] Normalized Mutual Information to evaluate overlapping community finding algorithms
october 2011 by cshalizi
"Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical. A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2]."
in_NB
community_discovery
information_theory
clustering
data_mining
october 2011 by cshalizi
k-means++: The Advantages of Careful Seeding
october 2011 by cshalizi
Why hadn't I heard of this before?
k-means
clustering
to_teach:data-mining
have_read
in_NB
via:georg
approximation_algorithms
machine_learning
october 2011 by cshalizi
Bayesian Checking for Topic Models
july 2011 by cshalizi
"Real document collections do not fit the inde- pendence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our method can identify where a topic model fits the data, where it falls short, and in which directions it might be improved."
topic_models
model-checking
blei.david
in_NB
via:ariddell
statistics
machine_learning
information_retrieval
clustering
have_read
july 2011 by cshalizi
Adventures in Data Land, Graphical Models for the Internet
march 2011 by cshalizi
Look at this later and re-consider the to_teach tags.
clustering
graphical_models
tutorials
expectation-maximization
internet
text_mining
to_teach:data-mining
to_teach:undergrad-ADA
smola.alex
ahmed.amr
heard_the_talk
march 2011 by cshalizi
[1011.2624] Clustering using Unsupervised Binary Trees: CUBT
december 2010 by cshalizi
"We introduce a new clustering method based on unsupervised binary trees. It is a three stages procedure, which performs on a first stage recursive binary splits reducing the heterogeneity of the data within the new subsamples. On the second stage (pruning) adjacent nodes are considered to be aggregated. Finally, on the third stage (joining) similar clusters are joined even if they do not descend from the same node. Consistency results are obtained and the procedure is tested on simulated and real data sets"
clustering
re:AoS_project
december 2010 by cshalizi
Consistent selection of the number of clusters via crossvalidation — Biometrika
december 2010 by cshalizi
"In cluster analysis, one of the major challenges is to estimate the number of clusters. Most existing approaches attempt to minimize some distance-based dissimilarity measure within clusters. This article proposes a novel selection criterion that is applicable to all kinds of clustering algorithms, including distance based or non-distance based algorithms. The key idea is to select the number of clusters that minimizes the algorithm's instability, which measures the robustness of any given clustering algorithm against the randomness in sampling.Anovel estimation scheme for clustering instability is developed based on crossvalidation. The proposed selection criterion's effectiveness is demonstrated on a variety of numerical experiments, and its asymptotic selection consistency is established when the dataset is properly split."
clustering
stability_of_learning
data_mining
statistics
to_teach:data-mining
to_teach:undergrad-ADA
december 2010 by cshalizi
[1009.2470] Significance analysis and statistical mechanics: an application to clustering
september 2010 by cshalizi
"This paper addresses the statistical significance of structures in random data: Given a set of vectors and a measure of mutual similarity, how likely does a subset of these vectors form a cluster with enhanced similarity among its elements? The computation of this cluster p-value for randomly distributed vectors is mapped onto a well-defined problem of statistical mechanics. We solve this problem analytically, establishing a connection between the physics of quenched disorder and multiple testing statistics in clustering and related problems. In an application to gene expression data, we find a remarkable link between the statistical significance of a cluster and the functional relationships between its genes"
to_be_shot_after_a_fair_trial
to_read
clustering
p-values
statistical_mechanics
gene_expression_data_analysis
machine_learning
september 2010 by cshalizi
[1004.3101] Strong Consistency of Prototype Based Clustering in Probabilistic Space
april 2010 by cshalizi
Not clear at first glance exactly what they're doing. Read before considering teaching.
clustering
k-means
learning_theory
to_teach:data-mining
april 2010 by cshalizi
Clustering: Art or Science?
november 2009 by cshalizi
I think this crystallize why I never like teaching clustering.
clustering
data_mining
statistics
machine_learning
data_analysis
guyon.isabelle
luxburg.ulrike_von
williamson.robert
via:chl
to_teach:data-mining
november 2009 by cshalizi
[0911.2634] Local statistical modeling by cluster-weighted
november 2009 by cshalizi
Revisiting Gershenfeld et al.'s "cluster-weighted modeling" from a more properly statistical perspective.
statistics
mixture_models
regression
clustering
to_read
to_teach:data-mining
november 2009 by cshalizi
Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.
october 2009 by cshalizi
Free PDF! (Still, I find my bound physical copy much more convenient.)
books:recommended
machine_learning
data_mining
statistics
learning_theory
estimation
cross-validation
ensemble_methods
classifiers
regression
graphical_models
clustering
dimension_reduction
bootstrap
via:arthegall
have_read
october 2009 by cshalizi
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation (Belkin and Niyogi)
august 2009 by cshalizi
Using the spectral graph theory to get new features, for visualization, clustering, &c. Also includes a good discussion of the local linear embedding algorithm. Will try to teach this in 350 but may make heads explode.
to_teach:data-mining
clustering
graph_theory
spectral_clustering
dimension_reduction
niyogi.partha
august 2009 by cshalizi
MALBEC: Jerry Zhu, Michael Coen, how to say snake in gibbon « Quomodocumque
may 2009 by cshalizi
The clustering comparison method sounds like something worth teaching in 350, but apparently its work-in-prep without a published version.
machine_learning
phonology
primates
gibbons
clustering
coen.michael
zhu.jerry
grammar_induction
to_teach:data-mining
may 2009 by cshalizi
Partisan Influence in Congress and Institutional Change
may 2009 by cshalizi
I am not surprised that Nominate is unstable under subsampling, but I had no idea it was _that_ unstable.
congress
nominate
clustering
statistics
political_science
latent_variables
via:justin
may 2009 by cshalizi
Margaret Ackerman and Shai Ben-David, "Measures of Clustering Quality: A Working Set of Axioms for Clustering"
december 2008 by cshalizi
A rebuttal to Kleinberg's impossibility theory for clustering (bookmarked earlier). There are measures of _cluster quality_ which satisfy all the natural axioms, which is good enough.
clustering
to_teach:data-mining
via:arthegall
via:vielmetti
data_mining
ackerman.margaret
ben-david.shai
kleinberg.jon
december 2008 by cshalizi
[0812.1242] Mapping change in large networks
december 2008 by cshalizi
Heard Martin talk about this at SFI last week. Nice, though I think the MDL frame-tale needs some work.
The "alluvial diagrams" are very pretty.
minimum_description_length
rosvall.martin
bergstrom.carl
kith_and_kin
network_data_analysis
have_read
re:network_differences
community_discovery
visual_display_of_quantitative_information
bootstrap
statistics
clustering
hypothesis_testing
re:stacs
to_teach:complexity-and-inference
citation_networks
bibliometry
The "alluvial diagrams" are very pretty.
december 2008 by cshalizi
An Impossibility Theorem for Clustering
october 2008 by cshalizi
Proof that it's impossible to build a clustering function which is at once scale-invariant, able to give all possible partitions as appropriate, and "consistent", meaning not statistical consistency but rather that shrinking within-cluster distances and expanding between-cluster distances should not change the partitioning. Very clever; I wish I'd know this before teaching clustering.
clustering
kleinberg.jon
via:aaron_clauset
to_teach:data-mining
machine_learning
october 2008 by cshalizi
Introduction to Machine Learning - Ethem Alpaydin [@Labyrinth]
august 2008 by cshalizi
Useful introductory survey/reference; mathematically undemanding. Nice that it has chapters on hidden Markov models and various model-combination methods.
books:recommended
machine_learning
classifiers
regression
clustering
markov_models
ensemble_methods
august 2008 by cshalizi
related tags
ackerman.margaret ⊕ ahmed.amr ⊕ approximation_algorithms ⊕ ben-david.shai ⊕ bergstrom.carl ⊕ bibliometry ⊕ blei.david ⊕ books:recommended ⊕ bootstrap ⊕ buhlmann.peter ⊕ category_theory ⊕ citation_networks ⊕ classifiers ⊕ clustering ⊖ coen.michael ⊕ community_discovery ⊕ congress ⊕ cross-validation ⊕ dasgupta.anirban ⊕ data_analysis ⊕ data_mining ⊕ density_estimation ⊕ dimension_reduction ⊕ dsm ⊕ em_algorithm ⊕ ensemble_methods ⊕ ergodic_theory ⊕ estimation ⊕ expectation-maximization ⊕ gene_expression_data_analysis ⊕ gibbons ⊕ grammar_induction ⊕ graphical_models ⊕ graph_theory ⊕ guyon.isabelle ⊕ have_read ⊕ heard_the_talk ⊕ heavy_tails ⊕ hierarchical_structure ⊕ holmes.susan ⊕ hopcroft.john ⊕ hypothesis_testing ⊕ information_retrieval ⊕ information_theory ⊕ internet ⊕ in_NB ⊕ jordan.michael_i. ⊕ k-means ⊕ kith_and_kin ⊕ kleinberg.jon ⊕ lasso ⊕ latent_variables ⊕ learning_theory ⊕ luxburg.ulrike_von ⊕ machine_learning ⊕ manifold_learning ⊕ markov_models ⊕ meila.marina ⊕ meinshausen.nicolai ⊕ minimum_description_length ⊕ mixture_models ⊕ model-checking ⊕ model_selection ⊕ network_data_analysis ⊕ niyogi.partha ⊕ nominate ⊕ nonparametrics ⊕ p-values ⊕ phonology ⊕ political_science ⊕ popular_social_science ⊕ primates ⊕ programming ⊕ R ⊕ re:AoS_project ⊕ re:network_differences ⊕ re:stacs ⊕ regression ⊕ relational_learning ⊕ rinaldo.alessandro ⊕ rosvall.martin ⊕ ryabko.daniil ⊕ sandler.mark ⊕ singh.aarti ⊕ smola.alex ⊕ spectral_clustering ⊕ stability_of_learning ⊕ statistical_inference_for_stochastic_processes ⊕ statistical_mechanics ⊕ statistics ⊕ text_mining ⊕ time_series ⊕ tishby.naftali ⊕ to:NB ⊕ topic_models ⊕ to_be_shot_after_a_fair_trial ⊕ to_read ⊕ to_teach:complexity-and-inference ⊕ to_teach:data-mining ⊕ to_teach:undergrad-ADA ⊕ tutorials ⊕ via:aaron_clauset ⊕ via:ariddell ⊕ via:arthegall ⊕ via:chl ⊕ via:georg ⊕ via:justin ⊕ via:ryan_t ⊕ via:vielmetti ⊕ visual_display_of_quantitative_information ⊕ wasserman.larry ⊕ williamson.robert ⊕ yu.bin ⊕ zhu.jerry ⊕Copy this bookmark: