Vaguery + clustering   27

[1203.1065] Subspace clustering of high-dimensional data: a predictive approach
"In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results."
ain't-performance-space  statistics  clustering  cure-for-dimensionality  algorithms 
10 weeks ago by Vaguery
[1202.0077] An Interacting Particle Model for Clustering Euclidean Datasets
"In this paper we propose a method based on interacting particle physics, devised for clustering Euclidean datasets without initial constraints or conditions. We model any dataset as an interacting particle system, whose elements correspond to particles that interact through a simplified version of Lennard-Jones potentials. In so doing, mutual attractive interactions allow to identify groups of proximal particles. The main outcome of this modeling task is an adjacency matrix, taken as input by a community detection algorithm aimed to identify different partitions. The underlying conjecture is that, using a multiresolution analysis, the adopted model allows to find the right number of clusters for any given dataset. Experimental results, performed in comparison with a classical clustering algorithm, confirm this assumption."
clustering  data-analysis  algorithms  nudge-targets  distributed-processing 
february 2012 by Vaguery
[1112.6045] Comparing intermittency and network measurements of words and their dependency on authorship
Many features from texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books from 8 authors who lived in the 19th and 20th centuries, for which the following network measurements were obtained: clustering coefficient, average shortest path lengths, and betweenness. We found that the two factors with stronger dependency on the authors were the skewness in the distribution of word intermittency and the average shortest paths. Other factors such as the betweeness and the Zipf's law exponent show only weak dependency on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was a ca. 65 % accuracy upon combining complex network and intermittency features with the nearest neighbor algorithm. From a detailed analysis of the interdependence of the various metrics it is concluded that the methods used here are complementary for providing short- and long-scale perspectives of texts, which are useful for applications such as identification of topical words and information retrieval.
natural-language-processing  document-clustering  clustering  feature-selection  algorithms  nudge-targets 
january 2012 by Vaguery
[1112.2316] Complexity-entropy causality plane: a useful approach for distinguishing songs
Nowadays we are often faced with huge databases resulting from the rapid growth of data storage technologies. This is particularly true when dealing with music databases. In this context, it is essential to have techniques and tools able to discriminate properties from these massive sets. In this work, we report on a statistical analysis of more than ten thousand songs aiming to obtain a complexity hierarchy. Our approach is based on the estimation of the permutation entropy combined with an intensive complexity measure, building up the complexity-entropy causality plane. The results obtained indicate that this representation space is very promising to discriminate songs as well as to allow a relative quantitative comparison among songs. Additionally, we believe that the here-reported method may be applied in practical situations since it is simple, robust and has a fast numerical implementation.
signal-processing  classification  data-analysis  clustering  representation  music  nudge-targets 
january 2012 by Vaguery
[1112.0826] Clustering under Perturbation Resilience
Recently, Bilu and Linial formalized an implicit assumption often made when choosing a clustering objective: that the optimum clustering to the objective should be preserved under small multiplicative perturbations to distances between points. They showed that for max-cut clustering it is possible to circumvent NP-hardness and obtain polynomial-time algorithms for instances resilient to large (factor $O(sqrt{n})$) perturbations, and subsequently Awasthi et al. considered center-based objectives, giving algorithms for instances resilient to O(1) factor perturbations.
In this paper, we greatly advance this line of work. For center-based objectives, we present an algorithm that can optimally cluster instances resilient to $(1 + sqrt{2})$-factor perturbations, solving an open problem of Awasthi et al. For a commonly used center-based objective $k$-median, we additionally give algorithms for a more relaxed assumption in which we allow the optimal solution to change in a small $epsilon$ fraction of the points after perturbation. We give the first bounds known for this more realistic and more general setting. We also provide positive results for min-sum clustering which is a generally much harder objective than $k$-median (and also non-center-based). Our algorithms are based on new linkage criteria that may be of independent interest.
Additionally, we give sublinear-time algorithms, showing algorithms that can return an implicit clustering from only access to a small random sample.
clustering  statistics  nonparametric-methods  robustness  resilience  algorithms  nudge-targets 
january 2012 by Vaguery
[1109.5664] Deterministic Feature Selection for $k$-means Clustering
"We study feature selection for $k$-means clustering. Although the literature contains many methods with good empirical performance, algorithms with provable theoretical behavior have only recently been developed. Unfortunately, these algorithms are randomized and fail with, say, a constant probability. We address this issue by presenting a emph{deterministic} feature selection algorithm for $k$-means with theoretical guarantees. At the heart of our algorithm lies a deterministic method for decompositions of the identity."
clustering  statistics  algorithms  nudge-targets 
december 2011 by Vaguery
[1107.2379] Data Stability in Clustering: A Closer Look
"This paper considers the model introduced by Bilu and Linial (2010), who study problems for which the optimal clustering does not change when the distances are perturbed by multiplicative factors. They show that even when a problem is NP-hard, it is sometimes possible to obtain polynomial-time algorithms for instances resilient to large perturbations, e.g. on the order of $O(sqrt{n})$ for max-cut clustering. Awasthi et al. (2010) extend this line of work by considering center-based objectives, and Balcan and Liang (2011) consider the $k$-median and min-sum objectives, giving efficient algorithms for instances resilient to certain constant multiplicative perturbations.

Here, we are motivated by the question of to what extent these assumptions can be relaxed while allowing for efficient algorithms. We show there is little room to improve these results by giving NP-hardness lower bounds for both the $k$-median and min-sum objectives. On the other hand, we show that multiplicative resilience parameters, even only on the order of $Theta(1)$, can be so strong as to make the clustering problem trivial, and we exploit these assumptions to present a simple one pass streaming algorithm for the $k$-median objective. We also consider a model of additive perturbations and give a correspondence between additive and multiplicative notions of stability. Our results provide a close examination of the consequences of assuming, even constant, stability in data."
clustering  statistics  algorithms  robustness  nudge-targets 
december 2011 by Vaguery
[1110.1462] Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances
"…To cluster sets of histogram data, we propose to use Dynamic Clustering Algorithm, (based on adaptive squared Wasserstein distances) that is a k-means-like algorithm for clustering a set of individuals into K classes that are apriori fixed. The main aim of this research is to provide a tool for clustering histograms, emphasizing the different contributions of the histogram variables, and their components, to the definition of the clusters. We demonstrate that this can be achieved using adaptive distances.

Two kind of adaptive distances are considered: the first takes into account the variability of each component of each descriptor for the whole set of individuals; the second takes into account the variability of each component of each descriptor in each cluster. We furnish interpretative tools of the obtained partition based on an extension of the classical measures (indexes) to the use of adaptive distances in the clustering criterion function. Applications on synthetic and real-world data corroborate the proposed procedure."
classification  statistics  histograms  metrics  clustering 
october 2011 by Vaguery
[1008.1191] Improved Fast Similarity Search in Dictionaries
"We engineer an algorithm to solve the approximate dictionary matching problem. Given a list of words $\mathcal{W}$, maximum distance $d$ fixed at preprocessing time and a query word $q$, we would like to retrieve all words from $\mathcal{W}$ that can be transformed into $q$ with $d$ or less edit operations. We present data structures that support fault tolerant queries by generating an index. On top of that, we present a generalization of the method that eases memory consumption and preprocessing time significantly. At the same time, running times of queries are virtually unaffected. We are able to match in lists of hundreds of thousands of words and beyond within microseconds for reasonable distances."
nudge-targets  strings  search-engines  clustering  algorithms  heuristics 
august 2010 by Vaguery
[1008.1758] Stochastic Data Clustering
"In 1961 Herbert Simon and Albert Ando published the theory behind the long-term behavior of a dynamical system that can be described by a nearly completely decomposable matrix. Over the past fifty years this theory has been used in a variety of contexts, including queueing theory, computer performance, and ecology. In all these applications, the structure of the system is known and the point of interest is the various states the system passes through on its way to some long-term equilibrium. This paper looks at this problem from the other direction. That is, we develop a technique for using the evolution of the system to tell us about its initial structure, and we use this technique to develop a new algorithm for data clustering."
clustering  data-analysis  exploratory-data-analysis  statistics  algorithms 
august 2010 by Vaguery
[1006.4531] Generalised network clustering and its dynamical implications
"A parameterisation of generalised network clustering, in the form of four-motif prevalences, is presented. This involves three real parameters that are conditional on one- two- and three-motif prevalences. Interpretations of these real parameters are presented that motivate a set of rewiring schemes to create appropriately clustered networks. Finally, the dynamical implications of higher order structure, as parameterised, for a contact process are considered."
clustering  network-theory  complexology  nudge-targets  algorithms  data-analysis  comparison 
august 2010 by Vaguery
[1007.1075] Clustering Stability: An Overview
"A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications."
statistics  data-analysis  clustering  nonparametric-statistics  exploratory-data-analysis  heuristics 
august 2010 by Vaguery
Environment for DeveLoping KDD-Applications Supported by Index-Structures - Wikipedia, the free encyclopedia
"Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a Knowledge Discovery in Databases (KDD, "data mining") software framework developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures."
clustering  algorithms  libraries  data-analysis  exploratory-data-analysis  statistics  nudge 
july 2010 by Vaguery
[0911.4729] Hearing the clusters in a graph: A distributed algorithm
"We propose a novel distributed algorithm to cluster graphs. The algorithm recovers the solution obtained from spectral clustering without the need for expensive eigenvalue/vector computations. We prove that, by propagating waves through the graph, a local fast Fourier transform yields the local component of every eigenvector of the Laplacian matrix, which are used to cluster graphs. For large graphs, the proposed algorithm is orders of magnitude faster than random walk based approaches. We prove the equivalence of the proposed algorithm to spectral clustering and derive convergence rates. We also demonstrate the benefit of using this decentralized clustering algorithm to accelerate distributed estimation for sensor networks and for efficient computation of distributed multi-agent search strategies."
network-theory  graph-theory  clustering  algorithms  numerical-methods  statistics  nudge-targets 
june 2010 by Vaguery
[1006.1328] Uncovering the Riffled Independence Structure of Rankings
"… In this paper, we provide a formal introduction to riffled independence and present algorithms for using riffled independence within Fourier-theoretic frameworks which have been explored by a number of recent papers. Additionally, we propose an automated method for discovering sets of items which are riffle independent from a training set of rankings. We show that our clustering-like algorithms can be used to discover meaningful latent coalitions from real preference ranking datasets and to learn the structure of hierarchically decomposable models based on riffled independence."
statistics  ranking  clustering  data-envelopment-analysis  multiobjective-optimization  nudge  numerical-methods 
june 2010 by Vaguery
[1006.1015] Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees
"Inferential summaries of tree estimates are useful in the setting of evolutionary biology, where phylogenetic trees have been built from DNA data since the 1960's. In bioinformatics, psychometrics and data mining, hierarchical clustering techniques output the same mathematical objects, and practitioners have similar questions about the stability and `generalizability' of these summaries. This paper provides an implementation of the geometric distance between trees developed by Billera, Holmes and Vogtmann (2001) [BHV] equally applicable to phylogenetic trees and hieirarchical clustering trees, and shows some of the applications in statistical inference for which this distance can be useful.…Our method gives a new way of evaluating the influence both of certain columns (positions, variables or genes) and of certain rows (whether species, observations or arrays)."
clustering  algorithms  statistics  models  classification  learning-from-data 
june 2010 by Vaguery
Getting Started Guide - Google Prediction API - Google Code
"The Prediction API allows you to get more from your data and makes its patterns more accessible. Specifically, the Prediction API leverages Google's machine learning infrastructure to give you the tools to better analyze your data and reveal patterns that are often difficult to manually discover. The API also enables you to use those patterns to predict new outcomes, which facilitates the development of all types of software, from textual analysis systems to recommendation systems. Because the Prediction API is a RESTful HTTP service, you can easily access it from Google App Engine, Apps Script, and other Internet-connected desktop applications."
nudge  machine-learning  models  google  prediction  clustering  learning-from-data  AI  API  open-science 
may 2010 by Vaguery
[1003.4002] Spectral Classification; Old and Contemporary
"Beginning with a historical account of the spectral classification, its refinement through additional criteria is presented. The line strengths and ratios used in two dimensional classifications of each spectral class are described. A parallel classification scheme for metal-poor stars and the standards used for classification are presented. The extension of spectral classification beyond M to L and T and spectroscopic classification criteria relevant to these classes are described. Contemporary methods of classifications based upon different automated approaches are introduced."
machine-learning  learning-from-data  science2.0  Nudge  clustering  statistics  astronomy  digitization 
march 2010 by Vaguery
The Geomblog: Choosing the number of clusters I: The Elbow Method
"How do we choose k, the number of clusters ?
This topic is so annoying, that I'm going to devote more than one post to it. While choosing k has been the excuse for some of the most violent hacks in clustering, there are at least a few principled directions, and there's a lot of room for further development."
clustering  statistics  tacit-knowledge  visualization  exploratory-data-analysis  contingency-of-all-models 
march 2010 by Vaguery
AI4R :: Artificial Intelligence for Ruby
"AI4R is a collection of ruby algorithms implementations, covering several Artificial intelligence fields, and simple practical examples using them. A Ruby playground for AI researchers. It implements:..."
artificial-intelligence  clustering  Ruby  AI  algorithms  library  open-source  languishing? 
march 2010 by Vaguery
PyGPU - Python for the GPU
"PyGPU is an embedded language in Python, that allow most of Python features (list-comprehensions, higher-order functions, iterators) to be used for constructing GPU algorithms. It uses a image abstraction to abstract away implementation details of the GPU, while still allowing translation to very efficient GPU native-code."
GPU  graphics-processing-unit  python  clustering  parallel  computing  programming  Moore's-Law  Nudge 
july 2007 by Vaguery
Jaccard index
Consider using to cluster (or impose a metric upon) del.icio.us posts, blog posts, or other semantic tagged items. Taking into account time-density of topics and usage.
wikipedia  metrics  machine-learning  distance  clustering  statistics  del.icio.us  project 
may 2007 by Vaguery

related tags

administration  AI  ain't-performance-space  algorithms  analytics  API  artificial-intelligence  astronomy  budgeting  business-plan  classification  clustering  comparison  complexology  computing  conferences  contingency-of-all-models  cure-for-dimensionality  data-analysis  data-centers  data-envelopment-analysis  database  del.icio.us  digitization  distance  distributed-processing  document-clustering  engineering  exploratory-data-analysis  feature-selection  google  GPU  graph-theory  graphics-processing-unit  grid-computing  heuristics  histograms  image-processing  imagemagick  languishing?  learning-from-data  libraries  library  machine-learning  Mathematica  metrics  models  Moore's-Law  multiobjective-optimization  music  mysql  natural-language-processing  network-theory  nonparametric-methods  nonparametric-statistics  nudge  nudge-targets  numerical-methods  open-science  open-source  papers  parallel  performance  prediction  preprint  proceedings  programming  project  python  ranking  representation  research  resilience  robustness  Ruby  science2.0  scientific-computing  search-engines  signal-processing  statistics  storage  strings  tacit-knowledge  tutorial  via:arthegall  via:nelson  visualization  wikipedia 

Copy this bookmark:



description:


tags: