rybesh + datamining   44

Twelve steps to running your Ruby code across five billion web pages | CommonCrawl
A starting point to write your own Ruby algorithms to analyse the wealth of information that’s buried in the Common Crawl web archive.
ec2  hadoop  web  datamining  textmining 
9 weeks ago by rybesh
Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.
During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting--the first comprehensive treatment of this topic in any book.

This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ``wide'' data (italics p bigger than n), including multiple testing and false discovery rates.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.
statistics  machinelearning  datamining 
12 weeks ago by rybesh
Library Juice » Data Mining
Austin et al. point out that the statistical methods that are at the heart of data mining are not able to distinguish real from spurious associations. Data mining employs the automated examination of enormous bodies of data. Its usefulness is thought to be proportional to the size of the data set that it collates; however, as the data set becomes larger and as the number of attributes that serve as potential relata increases, the number of potential relationships increases exponentially. Importantly, the number of spurious associations also increases. With enough data, no significance test will be stringent enough to provide assurance against the kind of results found in Austin et al. What is needed, according to Austin et al. is a “pre-specified plausible hypothesis.” For statistical analysis to be useful, the researcher must begin with a hypothesis, preferably a plausible one, if the research is to be valuable.

What exactly is a pre-specified plausible hypothesis and how can we generate it if data mining can’t do that for us? The question was posed some sixty years ago by the philosopher Nelson Goodman using different terms: Goodman believed that a critical question for epistemology was to distinguish between “projectible and non-projectible hypotheses.” One can more or less replace “pre-specified plausible hypothesis” with Goodman’s term “projectible hypothesis.” According to Goodman, when we seek to understand what hypothesis is (or is not) projectible, we do not come to the problem “empty-headed but with some stock of knowledge” which we use to determine what is (or is not) projectible. Projectible hypotheses will be those which do not conflict with other hypotheses that have been supported in the past. They will commonly use the same terminology of previously supported hypotheses. The terminology appearing in the hypotheses will have become “entrenched” in the language. This goes a long distance toward explaining why we don’t find the link between one’s astrological sign and medical conditions plausible. Twenty-first century Western medicine is not accustomed to linking astrological signs to ailments and so must find any hypothesis that does so implausible.

If Goodman is correct, then data mining is of little use without an historical understanding of the field of science to which the data pertains.

...

Here, we have another argument for allocating library resources to pay for librarians with deep subject expertise. As e-science develops, vendors will make more and more data sets available, regardless of their actual worth to researchers. To effectively choose the data sets that are of value, librarians must have a thorough understanding of the research needs of their patrons. To do this, they must have a deep understanding of the field. Unfortunately, with the excitement swirling around e-science, the mere access to large data sets threatens to become the be-all and end-all in collection management. If we aren’t careful, we may find ourselves with mountains of data from which everything and nothing can be concluded.
datamining  statistics  knowledge  digitalhumanities  libraries  epistemology 
february 2012 by rybesh
Historical Controversies Now
Instead of going to the library or the archive, we increasingly access history, the past, through the web. But what kind of history or histories, past or pasts are we accessing online? And what does this accessing entail? Following Leong et al., we approach temporality on the web “as a multiplicity of times derived from relations between different elements (2009, 1279)." This project is specifically focused on contentious historical moments, pasts that have had and potentially still have a major emotional impact, and which have been subject of struggle. Moreover, we not interested in sites specifically devoted to history, but in the major platforms on the web.

Confronting the historical events on the various platforms and opening up to a multiplicity of time we immediately realized that the traditional linear conception of time does not work online. First, most platforms do no not work in a chronological fashion, but with a reverse chronology. Second, because the platforms order sources according to ‘relevance’, the chronology of the sources as they are presented to us is radically mixed up. Third, sources do their own trick with time as well. Some focus on the historical event itself, while other rework the event. This reworking happens in a wide variety of ways, for example, by metaphorically invoking the event, by turning it into a historiographic debate, or by incorporating the event in a personal account (reading a history book, visiting a historical site, listening to a song). Crucially, in some of these reworkings, the event is actualized as controversial. These temporal complications directly informed our research, analysis, and visualization.

The above considerations translate in the following research questions:

Source time: Do we primarily find contemporary sources or historical sources in the various spheres? Does this vary across controversies?

Historical time: Do the sources on a platform focus on the historical moment itself, or a contemporary reworking of the moment? Does this vary across controversies?

Heat of the controversy: Is the controversy treated as settled, or is it actualized as still controversial? Does this vary across platforms and controversies?
history  datamining  web  publichistory 
january 2012 by rybesh
Data Clustering Software | Karypis Lab
CLUTO is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters. CLUTO is well-suited for clustering data sets arising in many diverse application areas including information retrieval, customer purchasing transactions, web, GIS, science, and biology.
clustering  datamining 
january 2012 by rybesh
DDupe
Visualizing and analyzing social networks is a challenging problem that has been receiving growing attention. An important first step, before analysis can begin, is ensuring that the data is accurate. A common data quality problem is that the data may inadvertently contain several distinct references to the same underlying entity; the process of reconciling these references is called entity resolution. D-Dupe is an interactive tool that combines data mining algorithms for entity resolution with a task-specific network visualization. Users cope with complexity of cleaning large networks by focusing on a small subnetwork containing a potential duplicate pair. The subnetwork highlights relationships in the social network, making the common relationships easy to visually identify. D-Dupe users resolve ambiguities either by merging nodes or by marking them distinct. The entity resolution process is iterative: as pairs of nodes are resolved, additional duplicates may be revealed; therefore, resolution decisions are often chained together. We give examples of how users can flexibly apply sequences of actions to produce a high quality entity resolution result.
datamining  nlp  networks  visualization 
january 2012 by rybesh
Detecting Novel Associations in Large Data Sets
Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs—far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones?
statistics  relationships  datamining 
december 2011 by rybesh
MADlib
MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.
database  analytics  datamining  statistics  machinelearning  sql 
july 2011 by rybesh
Christopher M. Bishop: Pattern Recognition and Machine Learning
This leading textbook provides a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. This is the first machine learning textbook to include a comprehensive coverage of recent developments such as probabilistic graphical models and deterministic inference methods, and to emphasize a modern Bayesian perspective. It is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. This hard cover book has 738 pages in full colour, and there are 431 graded exercises (with solutions available below). Extensive support is provided for course instructors.
machinelearning  books  patterns  statistics  datamining 
june 2011 by rybesh
PEGASUS: Peta-Scale Graph Mining System
PEGASUS is a Peta-scale graph mining system, fully written in Java. It runs in parallel, distributed manner on top of Hadoop.
graph  datamining  hadoop 
june 2011 by rybesh
Wikipedia Miner - Home
Wikipedia Miner is a toolkit for navigating and making use of the structure and content of Wikipedia. It aims to make it easy for you to integrate Wikipedia's knowledge into your own applications, by:

providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
wikipedia  datamining  api 
june 2011 by rybesh
Wikipedia Miner - Home
Wikipedia Miner is a toolkit for navigating and making use of the structure and content of Wikipedia. It aims to make it easy for you to integrate Wikipedia's knowledge into your own applications, by:

providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
wikipedia  textmining  nlp  webservices  tools  datamining 
may 2011 by rybesh
Goose
Goose aims to create an easy to use, scalable extractor that can plug into any application that needs to extract structure from unstructured web pages.
datamining  text  extraction  java 
may 2011 by rybesh
List of resources: Article text extraction from HTML documents | My tech blog.
A list of research papers, articles, web APIs, libraries and other software for article text extraction.
datamining  extraction  html  scraping 
march 2011 by rybesh
Overview: Extracting article text from HTML documents | My tech blog.
In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.
datamining  extraction  html  scraping 
march 2011 by rybesh
Modular toolkit for Data Processing (MDP)
Modular toolkit for Data Processing (MDP) is a Python data processing framework.

From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
datamining  machinelearning  python  tools 
december 2010 by rybesh
tm - Text Mining Package
tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
R  textmining  datamining  nlp  tools  statistics 
october 2010 by rybesh
MALLET homepage
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
datamining  java  machinelearning  nlp  tools 
october 2010 by rybesh
ScraperWiki
Anyone can write a screen scraper using the online editor, and the code and data are shared with the world.
datamining  opendata  scraping 
july 2010 by rybesh
Chris Heathcote: anti-mega: griotism
Whilst we have the luxury of open APIs to services, it’s rarely rich enough data for interesting stories to be told. APIs tend to be locked in the present – as the present is what a lot of services are fixated on. Use, not stories. Some element of time is normally needed to pull out data that tells interesting stories, often long periods of time.
data  narrative  datamining  history  time 
july 2010 by rybesh
Training Examples Q&A - machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
Where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!
ai  machinelearning  nlp  textanalysis  ir  datamining  search  statistics  infoviz  reference 
june 2010 by rybesh
DBpedia Mappings
This wiki contains the infobox-to-ontology and the table-to-ontology mappings which are used by the DBpedia extraction framework as well as the ontology definition itself. The framework collects the templates defined in this Wiki and extracts the Wikipedia content according to them.
wikipedia  ontology  semweb  datamining  extraction 
march 2010 by rybesh
lda: Collapsed Gibbs sampling methods for topic models
This package implements latent Dirichlet allocation (LDA) and related models. This includes (but is not limited to) sLDA, corrLDA, and the mixed-membership stochastic blockmodel.
clustering  textanalysis  datamining  R  topicmodels 
november 2009 by rybesh
Apache Mahout
Mahout's goal is to build scalable machine learning libraries.
machinelearning  opensource  hadoop  apache  recommendation  clustering  classification  datamining 
november 2009 by rybesh
LingPipe
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
java  nlp  datamining  tools  entitydetection 
may 2009 by rybesh
Data Mining with R: learning by case studies
The main goal of this book is to introduce the reader to the use of R as a tool for performing data mining.
R  datamining  reference 
september 2008 by rybesh
Web of Fate | Share your future
Web of Fate is a social experiment that harnesses the collective intelligence of the web to visualize and uncover hidden relationships among future and historical events.
datamining  forecasting  future  collaboration  nlp  events  extraction  semweb  ontology  prediction 
july 2008 by rybesh
Apache UIMA - Apache UIMA
The Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies.
extraction  recognition  architecture  tools  java  datamining  search 
february 2008 by rybesh
Dawid Weiss
Text clustering, information retrieval, web mining, text processing, NLP.
people  academia  poland  search  datamining  nlp  machinelearning 
november 2007 by rybesh
OpenTextMining
Open Text Mining Interface (OTMI) is an initiative from Nature Publishing Group (NPG). It aims to enable scholarly publishers, among others, to disclose their full text for indexing and text-mining purposes but without giving it away in a form that is rea
academia  publishing  copyright  data  nlp  standards  datamining 
november 2007 by rybesh
//re:digg\\ » Blog Archive » *New* Sections & Data Mining
"...the notion of quantifying a community’s potential bias is nothing short of remarkable."
journalism  statistics  nlp  quantitative  methods  bias  datamining  election 
march 2007 by rybesh
Media @ LSE Group Weblog » Blog Archive » Dangerously overstating the significance of Web 2.0
Web 2.0 enthusiasts believe that the contents of user-content databases represent the preferences and interests of everyone instead of the somewhat self-reinforcing interest clusters of a technologically savvy elite.
web2.0  datamining  social  metadata  ideology  architecture  technology  bias 
february 2007 by rybesh
Topic Modeling Toolbox
Tools for entity recognition, extraction and linking.
nlp  tools  research  statistics  datamining  analysis  matlab 
july 2006 by rybesh
Online Maps: The Next Generation
Media systems scientists at USC rely on geospatial technology to integrate a wealth of information that is accurate and easily accessible for decision-makers in a wide range of fields.
locative  maps  infoviz  datamining  research 
december 2005 by rybesh
JUNG - Java Universal Network/Graph Framework
A software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.
datamining  social  networking  tools  infoviz  java  opensource 
december 2005 by rybesh
Congress votes database | washingtonpost.com
This site lets you browse every vote in the U.S. Congress since 1991.
politics  database  datamining  government  usa 
december 2005 by rybesh
How News is Made, by Dale Dougherty
The Internet allows us to see how news is made, as though we were walking through a factory tour, and we can compare the very similar results of a mass production system.
internet  journalism  media  news  politics  web  datamining 
december 2005 by rybesh
WWW2006 Workshop - Logging Traces of Web Activity: The Mechanics of Data Collection
This one day workshop will examine the trade-offs and challenges inherent to the different logging approaches and provide workshop attendees the opportunity to discuss both previous data collection experiences and upcoming challenges.
web  conference  2006  workshop  statistics  datamining 
december 2005 by rybesh
Enthought Python
A Python distribution that comes with even more useful capabilities already installed and ready for use.
python  windows  science  math  statistics  datamining  tools  opensource  code 
october 2005 by rybesh
The R Project for Statistical Computing
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.
code  datamining  language  math  opensource  statistics  tools 
october 2005 by rybesh
Orange
Orange is a component-based data mining software. It includes a range of preprocessing, modelling and data exploration techniques.
machinelearning  classification  code  datamining  python  opensource  tools  nlp  statistics 
october 2005 by rybesh
Data Mining in Python
This is a collection of libraries useful for machine learning and data mining.
python  statistics  machinelearning  nlp  code  opensource  datamining 
october 2005 by rybesh

Copy this bookmark:



description:


tags: