Vaguery + clustering 27
[1203.1065] Subspace clustering of high-dimensional data: a predictive approach
10 weeks ago by Vaguery
"In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results."
ain't-performance-space
statistics
clustering
cure-for-dimensionality
algorithms
10 weeks ago by Vaguery
[1202.0077] An Interacting Particle Model for Clustering Euclidean Datasets
february 2012 by Vaguery
"In this paper we propose a method based on interacting particle physics, devised for clustering Euclidean datasets without initial constraints or conditions. We model any dataset as an interacting particle system, whose elements correspond to particles that interact through a simplified version of Lennard-Jones potentials. In so doing, mutual attractive interactions allow to identify groups of proximal particles. The main outcome of this modeling task is an adjacency matrix, taken as input by a community detection algorithm aimed to identify different partitions. The underlying conjecture is that, using a multiresolution analysis, the adopted model allows to find the right number of clusters for any given dataset. Experimental results, performed in comparison with a classical clustering algorithm, confirm this assumption."
clustering
data-analysis
algorithms
nudge-targets
distributed-processing
february 2012 by Vaguery
[1112.6045] Comparing intermittency and network measurements of words and their dependency on authorship
january 2012 by Vaguery
Many features from texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books from 8 authors who lived in the 19th and 20th centuries, for which the following network measurements were obtained: clustering coefficient, average shortest path lengths, and betweenness. We found that the two factors with stronger dependency on the authors were the skewness in the distribution of word intermittency and the average shortest paths. Other factors such as the betweeness and the Zipf's law exponent show only weak dependency on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was a ca. 65 % accuracy upon combining complex network and intermittency features with the nearest neighbor algorithm. From a detailed analysis of the interdependence of the various metrics it is concluded that the methods used here are complementary for providing short- and long-scale perspectives of texts, which are useful for applications such as identification of topical words and information retrieval.
natural-language-processing
document-clustering
clustering
feature-selection
algorithms
nudge-targets
january 2012 by Vaguery
[1112.2316] Complexity-entropy causality plane: a useful approach for distinguishing songs
january 2012 by Vaguery
Nowadays we are often faced with huge databases resulting from the rapid growth of data storage technologies. This is particularly true when dealing with music databases. In this context, it is essential to have techniques and tools able to discriminate properties from these massive sets. In this work, we report on a statistical analysis of more than ten thousand songs aiming to obtain a complexity hierarchy. Our approach is based on the estimation of the permutation entropy combined with an intensive complexity measure, building up the complexity-entropy causality plane. The results obtained indicate that this representation space is very promising to discriminate songs as well as to allow a relative quantitative comparison among songs. Additionally, we believe that the here-reported method may be applied in practical situations since it is simple, robust and has a fast numerical implementation.
signal-processing
classification
data-analysis
clustering
representation
music
nudge-targets
january 2012 by Vaguery
[1112.0826] Clustering under Perturbation Resilience
january 2012 by Vaguery
Recently, Bilu and Linial formalized an implicit assumption often made when choosing a clustering objective: that the optimum clustering to the objective should be preserved under small multiplicative perturbations to distances between points. They showed that for max-cut clustering it is possible to circumvent NP-hardness and obtain polynomial-time algorithms for instances resilient to large (factor $O(sqrt{n})$) perturbations, and subsequently Awasthi et al. considered center-based objectives, giving algorithms for instances resilient to O(1) factor perturbations.
In this paper, we greatly advance this line of work. For center-based objectives, we present an algorithm that can optimally cluster instances resilient to $(1 + sqrt{2})$-factor perturbations, solving an open problem of Awasthi et al. For a commonly used center-based objective $k$-median, we additionally give algorithms for a more relaxed assumption in which we allow the optimal solution to change in a small $epsilon$ fraction of the points after perturbation. We give the first bounds known for this more realistic and more general setting. We also provide positive results for min-sum clustering which is a generally much harder objective than $k$-median (and also non-center-based). Our algorithms are based on new linkage criteria that may be of independent interest.
Additionally, we give sublinear-time algorithms, showing algorithms that can return an implicit clustering from only access to a small random sample.
clustering
statistics
nonparametric-methods
robustness
resilience
algorithms
nudge-targets
In this paper, we greatly advance this line of work. For center-based objectives, we present an algorithm that can optimally cluster instances resilient to $(1 + sqrt{2})$-factor perturbations, solving an open problem of Awasthi et al. For a commonly used center-based objective $k$-median, we additionally give algorithms for a more relaxed assumption in which we allow the optimal solution to change in a small $epsilon$ fraction of the points after perturbation. We give the first bounds known for this more realistic and more general setting. We also provide positive results for min-sum clustering which is a generally much harder objective than $k$-median (and also non-center-based). Our algorithms are based on new linkage criteria that may be of independent interest.
Additionally, we give sublinear-time algorithms, showing algorithms that can return an implicit clustering from only access to a small random sample.
january 2012 by Vaguery
[1109.5664] Deterministic Feature Selection for $k$-means Clustering
december 2011 by Vaguery
"We study feature selection for $k$-means clustering. Although the literature contains many methods with good empirical performance, algorithms with provable theoretical behavior have only recently been developed. Unfortunately, these algorithms are randomized and fail with, say, a constant probability. We address this issue by presenting a emph{deterministic} feature selection algorithm for $k$-means with theoretical guarantees. At the heart of our algorithm lies a deterministic method for decompositions of the identity."
clustering
statistics
algorithms
nudge-targets
december 2011 by Vaguery
[1107.2379] Data Stability in Clustering: A Closer Look
december 2011 by Vaguery
"This paper considers the model introduced by Bilu and Linial (2010), who study problems for which the optimal clustering does not change when the distances are perturbed by multiplicative factors. They show that even when a problem is NP-hard, it is sometimes possible to obtain polynomial-time algorithms for instances resilient to large perturbations, e.g. on the order of $O(sqrt{n})$ for max-cut clustering. Awasthi et al. (2010) extend this line of work by considering center-based objectives, and Balcan and Liang (2011) consider the $k$-median and min-sum objectives, giving efficient algorithms for instances resilient to certain constant multiplicative perturbations.
Here, we are motivated by the question of to what extent these assumptions can be relaxed while allowing for efficient algorithms. We show there is little room to improve these results by giving NP-hardness lower bounds for both the $k$-median and min-sum objectives. On the other hand, we show that multiplicative resilience parameters, even only on the order of $Theta(1)$, can be so strong as to make the clustering problem trivial, and we exploit these assumptions to present a simple one pass streaming algorithm for the $k$-median objective. We also consider a model of additive perturbations and give a correspondence between additive and multiplicative notions of stability. Our results provide a close examination of the consequences of assuming, even constant, stability in data."
clustering
statistics
algorithms
robustness
nudge-targets
Here, we are motivated by the question of to what extent these assumptions can be relaxed while allowing for efficient algorithms. We show there is little room to improve these results by giving NP-hardness lower bounds for both the $k$-median and min-sum objectives. On the other hand, we show that multiplicative resilience parameters, even only on the order of $Theta(1)$, can be so strong as to make the clustering problem trivial, and we exploit these assumptions to present a simple one pass streaming algorithm for the $k$-median objective. We also consider a model of additive perturbations and give a correspondence between additive and multiplicative notions of stability. Our results provide a close examination of the consequences of assuming, even constant, stability in data."
december 2011 by Vaguery
[1110.1462] Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances
october 2011 by Vaguery
"…To cluster sets of histogram data, we propose to use Dynamic Clustering Algorithm, (based on adaptive squared Wasserstein distances) that is a k-means-like algorithm for clustering a set of individuals into K classes that are apriori fixed. The main aim of this research is to provide a tool for clustering histograms, emphasizing the different contributions of the histogram variables, and their components, to the definition of the clusters. We demonstrate that this can be achieved using adaptive distances.
Two kind of adaptive distances are considered: the first takes into account the variability of each component of each descriptor for the whole set of individuals; the second takes into account the variability of each component of each descriptor in each cluster. We furnish interpretative tools of the obtained partition based on an extension of the classical measures (indexes) to the use of adaptive distances in the clustering criterion function. Applications on synthetic and real-world data corroborate the proposed procedure."
classification
statistics
histograms
metrics
clustering
Two kind of adaptive distances are considered: the first takes into account the variability of each component of each descriptor for the whole set of individuals; the second takes into account the variability of each component of each descriptor in each cluster. We furnish interpretative tools of the obtained partition based on an extension of the classical measures (indexes) to the use of adaptive distances in the clustering criterion function. Applications on synthetic and real-world data corroborate the proposed procedure."
october 2011 by Vaguery
[1008.1191] Improved Fast Similarity Search in Dictionaries
august 2010 by Vaguery
"We engineer an algorithm to solve the approximate dictionary matching problem. Given a list of words $\mathcal{W}$, maximum distance $d$ fixed at preprocessing time and a query word $q$, we would like to retrieve all words from $\mathcal{W}$ that can be transformed into $q$ with $d$ or less edit operations. We present data structures that support fault tolerant queries by generating an index. On top of that, we present a generalization of the method that eases memory consumption and preprocessing time significantly. At the same time, running times of queries are virtually unaffected. We are able to match in lists of hundreds of thousands of words and beyond within microseconds for reasonable distances."
nudge-targets
strings
search-engines
clustering
algorithms
heuristics
august 2010 by Vaguery
[1008.1758] Stochastic Data Clustering
august 2010 by Vaguery
"In 1961 Herbert Simon and Albert Ando published the theory behind the long-term behavior of a dynamical system that can be described by a nearly completely decomposable matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, computer performance, and ecology. In all these applications, the structure of the system is known and the point of interest is the various states the system passes through on its way to some long-term equilibrium. This paper looks at this problem from the other direction. That is, we develop a technique for using the evolution of the system to tell us about its initial structure, and we use this technique to develop a new algorithm for data clustering."
clustering
data-analysis
exploratory-data-analysis
statistics
algorithms
august 2010 by Vaguery
[1006.4531] Generalised network clustering and its dynamical implications
august 2010 by Vaguery
"A parameterisation of generalised network clustering, in the form of four-motif prevalences, is presented. This involves three real parameters that are conditional on one- two- and three-motif prevalences. Interpretations of these real parameters are presented that motivate a set of rewiring schemes to create appropriately clustered networks. Finally, the dynamical implications of higher order structure, as parameterised, for a contact process are considered."
clustering
network-theory
complexology
nudge-targets
algorithms
data-analysis
comparison
august 2010 by Vaguery
[1007.1075] Clustering Stability: An Overview
august 2010 by Vaguery
"A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications."
statistics
data-analysis
clustering
nonparametric-statistics
exploratory-data-analysis
heuristics
august 2010 by Vaguery
Environment for DeveLoping KDD-Applications Supported by Index-Structures - Wikipedia, the free encyclopedia
july 2010 by Vaguery
"Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a Knowledge Discovery in Databases (KDD, "data mining") software framework developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures."
clustering
algorithms
libraries
data-analysis
exploratory-data-analysis
statistics
nudge
july 2010 by Vaguery
[0911.4729] Hearing the clusters in a graph: A distributed algorithm
june 2010 by Vaguery
"We propose a novel distributed algorithm to cluster graphs. The algorithm recovers the solution obtained from spectral clustering without the need for expensive eigenvalue/vector computations. We prove that, by propagating waves through the graph, a local fast Fourier transform yields the local component of every eigenvector of the Laplacian matrix, which are used to cluster graphs. For large graphs, the proposed algorithm is orders of magnitude faster than random walk based approaches. We prove the equivalence of the proposed algorithm to spectral clustering and derive convergence rates. We also demonstrate the benefit of using this decentralized clustering algorithm to accelerate distributed estimation for sensor networks and for efficient computation of distributed multi-agent search strategies."
network-theory
graph-theory
clustering
algorithms
numerical-methods
statistics
nudge-targets
june 2010 by Vaguery
[1006.1328] Uncovering the Riffled Independence Structure of Rankings
june 2010 by Vaguery
"… In this paper, we provide a formal introduction to riffled independence and present algorithms for using riffled independence within Fourier-theoretic frameworks which have been explored by a number of recent papers. Additionally, we propose an automated method for discovering sets of items which are riffle independent from a training set of rankings. We show that our clustering-like algorithms can be used to discover meaningful latent coalitions from real preference ranking datasets and to learn the structure of hierarchically decomposable models based on riffled independence."
statistics
ranking
clustering
data-envelopment-analysis
multiobjective-optimization
nudge
numerical-methods
june 2010 by Vaguery
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
june 2010 by Vaguery
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful.…Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
clustering
algorithms
statistics
models
classification
learning-from-data
june 2010 by Vaguery
Getting Started Guide - Google Prediction API - Google Code
may 2010 by Vaguery
"The Prediction API allows you to get more from your data and makes its patterns more accessible. Specifically, the Prediction API leverages Google's machine learning infrastructure to give you the tools to better analyze your data and reveal patterns that are often difficult to manually discover. The API also enables you to use those patterns to predict new outcomes, which facilitates the development of all types of software, from textual analysis systems to recommendation systems. Because the Prediction API is a RESTful HTTP service, you can easily access it from Google App Engine, Apps Script, and other Internet-connected desktop applications."
nudge
machine-learning
models
google
prediction
clustering
learning-from-data
AI
API
open-science
may 2010 by Vaguery
[1003.4002] Spectral Classification; Old and Contemporary
march 2010 by Vaguery
"Beginning with a historical account of the spectral classification, its refinement through additional criteria is presented. The line strengths and ratios used in two dimensional classifications of each spectral class are described. A parallel classification scheme for metal-poor stars and the standards used for classification are presented. The extension of spectral classification beyond M to L and T and spectroscopic classification criteria relevant to these classes are described. Contemporary methods of classifications based upon different automated approaches are introduced."
machine-learning
learning-from-data
science2.0
Nudge
clustering
statistics
astronomy
digitization
march 2010 by Vaguery
The Geomblog: Choosing the number of clusters I: The Elbow Method
march 2010 by Vaguery
"How do we choose k, the number of clusters ?
This topic is so annoying, that I'm going to devote more than one post to it. While choosing k has been the excuse for some of the most violent hacks in clustering, there are at least a few principled directions, and there's a lot of room for further development."
clustering
statistics
tacit-knowledge
visualization
exploratory-data-analysis
contingency-of-all-models
This topic is so annoying, that I'm going to devote more than one post to it. While choosing k has been the excuse for some of the most violent hacks in clustering, there are at least a few principled directions, and there's a lot of room for further development."
march 2010 by Vaguery
AI4R :: Artificial Intelligence for Ruby
march 2010 by Vaguery
"AI4R is a collection of ruby algorithms implementations, covering several Artificial intelligence fields, and simple practical examples using them. A Ruby playground for AI researchers. It implements:..."
artificial-intelligence
clustering
Ruby
AI
algorithms
library
open-source
languishing?
march 2010 by Vaguery
What Is Mathematica Personal Grid Edition
august 2007 by Vaguery
Frickin' expensive. But perhaps, maybe, worthwhile.
Mathematica
clustering
grid-computing
analytics
models
scientific-computing
engineering
business-plan
budgeting
august 2007 by Vaguery
PyGPU - Python for the GPU
july 2007 by Vaguery
"PyGPU is an embedded language in Python, that allow most of Python features (list-comprehensions, higher-order functions, iterators) to be used for constructing GPU algorithms. It uses a image abstraction to abstract away implementation details of the GPU, while still allowing translation to very efficient GPU native-code."
GPU
graphics-processing-unit
python
clustering
parallel
computing
programming
Moore's-Law
Nudge
july 2007 by Vaguery
Jaccard index
may 2007 by Vaguery
Consider using to cluster (or impose a metric upon) del.icio.us posts, blog posts, or other semantic tagged items. Taking into account time-density of topics and usage.
wikipedia
metrics
machine-learning
distance
clustering
statistics
del.icio.us
project
may 2007 by Vaguery
related tags
administration ⊕ AI ⊕ ain't-performance-space ⊕ algorithms ⊕ analytics ⊕ API ⊕ artificial-intelligence ⊕ astronomy ⊕ budgeting ⊕ business-plan ⊕ classification ⊕ clustering ⊖ comparison ⊕ complexology ⊕ computing ⊕ conferences ⊕ contingency-of-all-models ⊕ cure-for-dimensionality ⊕ data-analysis ⊕ data-centers ⊕ data-envelopment-analysis ⊕ database ⊕ del.icio.us ⊕ digitization ⊕ distance ⊕ distributed-processing ⊕ document-clustering ⊕ engineering ⊕ exploratory-data-analysis ⊕ feature-selection ⊕ google ⊕ GPU ⊕ graph-theory ⊕ graphics-processing-unit ⊕ grid-computing ⊕ heuristics ⊕ histograms ⊕ image-processing ⊕ imagemagick ⊕ languishing? ⊕ learning-from-data ⊕ libraries ⊕ library ⊕ machine-learning ⊕ Mathematica ⊕ metrics ⊕ models ⊕ Moore's-Law ⊕ multiobjective-optimization ⊕ music ⊕ mysql ⊕ natural-language-processing ⊕ network-theory ⊕ nonparametric-methods ⊕ nonparametric-statistics ⊕ nudge ⊕ nudge-targets ⊕ numerical-methods ⊕ open-science ⊕ open-source ⊕ papers ⊕ parallel ⊕ performance ⊕ prediction ⊕ preprint ⊕ proceedings ⊕ programming ⊕ project ⊕ python ⊕ ranking ⊕ representation ⊕ research ⊕ resilience ⊕ robustness ⊕ Ruby ⊕ science2.0 ⊕ scientific-computing ⊕ search-engines ⊕ signal-processing ⊕ statistics ⊕ storage ⊕ strings ⊕ tacit-knowledge ⊕ tutorial ⊕ via:arthegall ⊕ via:nelson ⊕ visualization ⊕ wikipedia ⊕Copy this bookmark: