rybesh + clustering 7
[1203.6402] Scalable K-Means++
9 weeks ago by rybesh
Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.
clustering
machinelearning
9 weeks ago by rybesh
Maximum Margin Temporal Clustering
9 weeks ago by rybesh
Temporal Clustering (TC) refers to the factorization of multiple time series into a set of non-overlapping segments that belong to k temporal clusters. Existing methods based on extensions of generative models such as k -means or Switching Linear Dynamical Systems (SLDS) often lead to intractable inference and lack a mechanism for feature selection, critical when dealing with high dimensional data. To overcome these limitations, this paper proposes Maximum Margin Temporal Clustering (MMTC). MMTC simultaneously determines the start and the end of each segment, while learning a multi-class Support Vector Machine (SVM) to discriminate among temporal clusters. MMTC extends Maximum Margin Clustering in two ways: first, it incorporates the notion of TC, and second, it introduces additional constraints to achieve better balance between clusters. Experiments on clustering human actions and bee dancing motions illustrate the benefits of our approach compared to state-of-the-art methods.
temporality
actions
events
clustering
supervised
machinelearning
9 weeks ago by rybesh
Blei - Introduction to Probabilistic Topic Models
10 weeks ago by rybesh
Probabilistic topic models are a suite of algorithms whose aim is to discover the hidden thematic structure in large archives of documents. In this article, we review the main ideas of this field, survey the current state-of-the-art, and describe some promising future directions. We first describe latent Dirichlet allocation (LDA) [8], which is the simplest kind of topic model. We discuss its connections to probabilistic modeling, and describe two kinds of algorithms for topic discovery. We then survey the growing body of research that extends and applies topic models in interesting ways. These extensions have been developed by relaxing some of the statistical assumptions of LDA, incorporating meta-data into the analysis of the documents, and using similar kinds of models on a diversity of data types such as social networks, images and genetics. Finally, we give our thoughts as to some of the important unexplored directions for topic modeling. These include rigorous methods for checking models built for data exploration, new approaches to visualizing text and other high dimensional data, and moving beyond traditional information engineering applications towards using topic models for more scientific ends.
topicmodels
unsupervised
machinelearning
clustering
10 weeks ago by rybesh
Data Clustering Software | Karypis Lab
january 2012 by rybesh
CLUTO is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters. CLUTO is well-suited for clustering data sets arising in many diverse application areas including information retrieval, customer purchasing transactions, web, GIS, science, and biology.
clustering
datamining
january 2012 by rybesh
Sapping Attention: Fresh set of eyes
february 2011 by rybesh
If we treat each lettered heading in the Library of Congress Catalog as a single, long text, we can ask the computer to find similar genres based on word usage.
classification
clustering
inls520
february 2011 by rybesh
lda: Collapsed Gibbs sampling methods for topic models
november 2009 by rybesh
This package implements latent Dirichlet allocation (LDA) and related models. This includes (but is not limited to) sLDA, corrLDA, and the mixed-membership stochastic blockmodel.
clustering
textanalysis
datamining
R
topicmodels
november 2009 by rybesh
Apache Mahout
november 2009 by rybesh
Mahout's goal is to build scalable machine learning libraries.
machinelearning
opensource
hadoop
apache
recommendation
clustering
classification
datamining
november 2009 by rybesh
related tags
actions ⊕ apache ⊕ classification ⊕ clustering ⊖ datamining ⊕ events ⊕ hadoop ⊕ inls520 ⊕ machinelearning ⊕ opensource ⊕ R ⊕ recommendation ⊕ supervised ⊕ temporality ⊕ textanalysis ⊕ topicmodels ⊕ unsupervised ⊕Copy this bookmark: