rybesh + machinelearning   41

About Campaign 2012 in the Media | Project for Excellence in Journalism (PEJ)
To arrive at the results regarding the tone of coverage, PEJ employed computer coding software developed by Crimson Hexagon along with PEJ's traditional media research methods.

The technology for Crimson Hexagon is rooted in an algorithm created by Gary King, a professor at Harvard University's Institute for Quantitative Social Science. (Click here to view the study explaining the algorithm.)

According to Crimson Hexagon, the purpose of computer coding is to "take as data a potentially large set of text documents, of which a small subset is hand coded into an investigator-chosen set of mutually exclusive and exhaustive categories. As output, the methods give approximately unbiased and statistically consistent estimates of the proportion of all documents in each category."
news  textanalysis  sentiment  machinelearning  classification 
17 hours ago by rybesh
[1203.6402] Scalable K-Means++
Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.
clustering  machinelearning 
9 weeks ago by rybesh
Maximum Margin Temporal Clustering
Temporal Clustering (TC) refers to the factorization of multiple time series into a set of non-overlapping segments that belong to k temporal clusters. Existing methods based on extensions of generative models such as k -means or Switching Linear Dynamical Systems (SLDS) often lead to intractable inference and lack a mechanism for feature selection, critical when dealing with high dimensional data. To overcome these limitations, this paper proposes Maximum Margin Temporal Clustering (MMTC). MMTC simultaneously determines the start and the end of each segment, while learning a multi-class Support Vector Machine (SVM) to discriminate among temporal clusters. MMTC extends Maximum Margin Clustering in two ways: first, it incorporates the notion of TC, and second, it introduces additional constraints to achieve better balance between clusters. Experiments on clustering human actions and bee dancing motions illustrate the benefits of our approach compared to state-of-the-art methods.
temporality  actions  events  clustering  supervised  machinelearning 
9 weeks ago by rybesh
10 MILLION INTERNATIONAL DYADIC EVENTS
When the Palestinians launch a mortar attack into Israel, the Israeli army does not wait until the end of the calendar year to react. Yet, most modern data collections are aggregated to the month or year. The data available here include almost 10 million individual events, each coded to the exact day they occur or become known. Each event is summarized in the data as "Actor A does something to Actor B", with Actors A and B recording about 450 countries and other (within-country) actors and "does something to" coded in an ontology of about 200 types of actions. The data are coded by computer from millions of Reuters news reports. The software system (produced by VRA) that performs this task has been independently evaluated by King and Lowe (2003). This article found that for the numbers of events it was possible to convince humans (trained Harvard undergraduates) to code by hand, the machine did as well as the humans. For much larger numbers of events for which no expert coder could keep up, the machine dominates.
events  politicalscience  data  machinelearning  textanalysis 
10 weeks ago by rybesh
Blei - Introduction to Probabilistic Topic Models
Probabilistic topic models are a suite of algorithms whose aim is to discover the hidden thematic structure in large archives of documents. In this article, we review the main ideas of this field, survey the current state-of-the-art, and describe some promising future directions. We first describe latent Dirichlet allocation (LDA) [8], which is the simplest kind of topic model. We discuss its connections to probabilistic modeling, and describe two kinds of algorithms for topic discovery. We then survey the growing body of research that extends and applies topic models in interesting ways. These extensions have been developed by relaxing some of the statistical assumptions of LDA, incorporating meta-data into the analysis of the documents, and using similar kinds of models on a diversity of data types such as social networks, images and genetics. Finally, we give our thoughts as to some of the important unexplored directions for topic modeling. These include rigorous methods for checking models built for data exploration, new approaches to visualizing text and other high dimensional data, and moving beyond traditional information engineering applications towards using topic models for more scientific ends.
topicmodels  unsupervised  machinelearning  clustering 
10 weeks ago by rybesh
TinySVM: Support Vector Machines
TinySVM is an implementation of Support Vector Machines (SVMs) [Vapnik 95], [Vapnik 98] for the problem of pattern recognition.
svm  machinelearning 
12 weeks ago by rybesh
[1003.0783] Supervised Topic Models
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and the political tone of amendments in the U.S. Senate based on the amendment text. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression.
slda  classification  lda  topicmodels  textanalysis  machinelearning 
12 weeks ago by rybesh
Supervised latent Dirichlet allocation for classification
This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.
c++  slda  classification  topicmodels  lda  machinelearning  textanalysis 
12 weeks ago by rybesh
Latent Dirichlet Allocation in C
This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .

This code contains:

an implementation of variational inference for the per-document topic proportions and per-word topic assignments
a variational EM procedure for estimating the topics and exchangeable Dirichlet hyperparameter
lda  c  linguistics  machinelearning  textanalysis  textmining 
12 weeks ago by rybesh
Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.
During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting--the first comprehensive treatment of this topic in any book.

This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ``wide'' data (italics p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.
statistics  machinelearning  datamining 
12 weeks ago by rybesh
wcauchois / pysvmlight / overview — Bitbucket
A Python binding to the popular "SVM-Light" support vector machine library.
svm  machinelearning  python 
12 weeks ago by rybesh
Conditional Random Fields
Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices. The underlying idea is that of defining a conditional probability distribution over label sequences given a particular observation sequence, rather than a joint distribution over both label and observation sequences. The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world tasks in many fields, including bioinformatics, computational linguistics and speech recognition.
machinelearning  nlp  crf  textmining  metadata 
february 2012 by rybesh
The Meaning and The Mining of Legal Texts
Positive law, inscribed in legal texts, entails an authority not inherent in literary texts, generating legal consequences that can have real effects on a person’s life and liberty. The interpretation of legal texts, necessarily a normative undertaking, resists the mechanical application of rules, though still requiring a measure of predictability, coherence with other relevant legal norms and compliance with constitutional safeguards. The present proliferation of legal texts on the internet (codes, statutes, judgments, treaties, doctrinal treatises) renders the selection of relevant texts and cases next to impossible. We may expect that systems to mine these texts to find arguments that support one’s case, as well as expert systems that support the decision-making process of courts, will end up doing much of the work.

This raises the question of the difference between human interpretation and computational pattern-recognition and the issue of whether this difference makes a difference for the meaning of law. Possibly, data mining will produce patterns that disclose habits of the minds of judges and legislators that would have otherwise gone unnoticed (reinforcing the argument of the ‘legal realists’ at the beginning of the 20th century). Also, after the data analysis it will still be up to the judge to decide how to interpret the results or up to the prosecution which patterns to engage in the construction of evidence (requiring a hermeneutics of computational patterns instead of texts). My focus in this paper regards the fact that the mining process necessarily disambiguates the legal texts in order to transform them into a machine-readable data set, while the algorithms used for the analysis embody a strategy that will co-determine the outcome of the patterns. There seems a major due process concern here to the extent that these patterns are invisible for the naked human eye and will not be contestable in a court of law, due to their hidden complexity and computational nature.

This position paper aims to explain what is at stake in the computational turn with regard to legal texts. This prepares for the question I want to put forward to those involved in distant reading and not-reading of texts: could a visualization of computational patterns constitute a new way of un-hiding the complexity involved, opening the results of computational ‘knowledge’ to citizens’ scrutiny?
textmining  machinelearning  visualization  digitalhumanities  law 
january 2012 by rybesh
Apache Mahout: Scalable machine learning and data mining
Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
apache  hadoop  machinelearning  mapreduce  lda 
august 2011 by rybesh
MADlib
MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.
database  analytics  datamining  statistics  machinelearning  sql 
july 2011 by rybesh
Christopher M. Bishop: Pattern Recognition and Machine Learning
This leading textbook provides a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. This is the first machine learning textbook to include a comprehensive coverage of recent developments such as probabilistic graphical models and deterministic inference methods, and to emphasize a modern Bayesian perspective. It is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. This hard cover book has 738 pages in full colour, and there are 431 graded exercises (with solutions available below). Extensive support is provided for course instructors.
machinelearning  books  patterns  statistics  datamining 
june 2011 by rybesh
maui-indexer - Maui - Multi-purpose automatic topic indexing - Google Project Hosting
Maui automatically identifies main topics in text documents. Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles.

Maui performs the following tasks:

term assignment with a controlled vocabulary (or thesaurus)
subject indexing
topic indexing with terms from Wikipedia
keyphrase extraction
terminology extraction
automatic tagging
It can also be used for terminology extraction and semi-automatic topic indexing.
indexing  vocabulary  tools  nlp  machinelearning  java 
june 2011 by rybesh
Languages - Accentuate.us - Really Easy Computer Input
Accentuate.us uses statistics to predict where special characters are needed on a language-by-language basis.
language  input  python  tools  webservices  api  machinelearning 
april 2011 by rybesh
TiMBL: Tilburg Memory-Based Learner
TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases.

For the past decade, TiMBL has been mostly used in natural language processing as a machine learning classifier component, but its use extends to virtually any supervised machine learning domain. Due to its particular decision-tree-based implementation, TiMBL is in many cases far more efficient in classification than a standard k-nearest neighbor algorithm would be.
nlp  machinelearning  tools 
april 2011 by rybesh
Daisy Zhe Wang: BayesStore
BayesStore is a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. BayesStore represents model and evidence data as relational tables; implements inference algorithms efficiently in SQL; adds probabilistic relational operators to the query engine; optimizes queries with both relational and inference operators. The design goals of BayesStore are: (1) to be able to support efficient query processing over different models compared to the off-the-shelf machine learning libraries; (2) to be able to support extensible API for plugging in new models and inference algorithms; and (3) to be able to scale up to very large data sets.
statistics  bayes  database  machinelearning 
january 2011 by rybesh
Modular toolkit for Data Processing (MDP)
Modular toolkit for Data Processing (MDP) is a Python data processing framework.

From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
datamining  machinelearning  python  tools 
december 2010 by rybesh
MALLET homepage
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
datamining  java  machinelearning  nlp  tools 
october 2010 by rybesh
Google Prediction API - Google Code
The Prediction API enables access to Google's machine learning algorithms to analyze your historic data and predict likely future outcomes.
machinelearning  api  classification 
july 2010 by rybesh
Training Examples Q&A - machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
Where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!
ai  machinelearning  nlp  textanalysis  ir  datamining  search  statistics  infoviz  reference 
june 2010 by rybesh
Apache Mahout
Mahout's goal is to build scalable machine learning libraries.
machinelearning  opensource  hadoop  apache  recommendation  clustering  classification  datamining 
november 2009 by rybesh
Maximum Entropy (GA) Model Optimization Package
Maximum entropy (aka logistic regression) models are very popular, especially in natural language processing. The software here is an implementation of maximum likelihood and maximum a posterior optimization of the parameters of these models. The algorithms used are much more efficient than the iterative scaling techniques used in almost every other maxent package out there.
research  tools  nlp  statistics  machinelearning  ocaml  logreg  maxent 
august 2009 by rybesh
ParsCit: An open-source CRF Reference String Parsing Package
It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism.
nlp  machinelearning  opensource  perl  parsing  recognition  citations 
september 2008 by rybesh
Dawid Weiss
Text clustering, information retrieval, web mining, text processing, NLP.
people  academia  poland  search  datamining  nlp  machinelearning 
november 2007 by rybesh
Map-Reduce for Machine Learning on Multicore
In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms.
machinelearning  distributed  grid  research 
august 2007 by rybesh
gladwell.com: The Perfect and the Good
...one of the most important changes we're going to see in lots of professions over the next few years is the emergence of tools that close the gap between the middle and the top--that allow the decision-maker who is merely competent to avoid his errors a
machinelearning  decisionmaking  future  ideas  tools  expertise 
november 2006 by rybesh
Manifold - Wikipedia, the free encyclopedia
A manifold is an abstract mathematical space in which every point has a neighborhood which resembles Euclidean space, but in which the global structure may be more complicated.
math  machinelearning  statistics 
november 2006 by rybesh
Northrop
A genre categorizer that lets users narrow down searches to particular genres like editorials, financial reports or scientific writing or group search results according to genre.
genre  search  organization  nlp  classification  machinelearning 
october 2006 by rybesh
Statistical Data Mining Tutorials
A set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.
machinelearning  reference  statistics  howto 
september 2006 by rybesh
Milind Naphade
Research interests in content analysis, information extraction, statistical machine learning and graphical modeling and detection and representation of semantic information.
multimedia  analysis  machinelearning  semweb  people  IBM  SSMS2006 
july 2006 by rybesh
Orange
Orange is a component-based data mining software. It includes a range of preprocessing, modelling and data exploration techniques.
machinelearning  classification  code  datamining  python  opensource  tools  nlp  statistics 
october 2005 by rybesh
Data Mining in Python
This is a collection of libraries useful for machine learning and data mining.
python  statistics  machinelearning  nlp  code  opensource  datamining 
october 2005 by rybesh
Divmod.org :: Reverend
Reverend is a general purpose Bayesian classifier. Use the Reverend to quickly add Bayesian smarts to your app.
machinelearning  bayes  classification  python  statistics  opensource  code 
october 2005 by rybesh
MusicStrands
MusicStrands uses statistical machine learning, collaborative filtering, link-based analysis, to provide independent music recommendations based on the listening behavior of individuals and social networks.
music  playlist  social  networking  tools  machinelearning  statistics 
july 2005 by rybesh

Copy this bookmark:



description:


tags: