rybesh + datamining 44
Twelve steps to running your Ruby code across five billion web pages | CommonCrawl
9 weeks ago by rybesh
A starting point to write your own Ruby algorithms to analyse the wealth of information that’s buried in the Common Crawl web archive.
ec2
hadoop
web
datamining
textmining
9 weeks ago by rybesh
Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.
12 weeks ago by rybesh
During the past decade has been an explosion in computation and information technology. With it has come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book descibes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting--the first comprehensive treatment of this topic in any book.
This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ``wide'' data (italics p bigger than n), including multiple testing and false discovery rates.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.
statistics
machinelearning
datamining
This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ``wide'' data (italics p bigger than n), including multiple testing and false discovery rates.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie wrote much of the statistical modeling software in S-PLUS and invented principal curves and surfaces. Tibshirani proposed the Lasso and is co-author of the very successful {italics An Introduct ion to the Bootstrap}. Friedman is the co-inventor of many data-mining tools including CART, MARS, and projection pursuit.
12 weeks ago by rybesh
Library Juice » Data Mining
february 2012 by rybesh
Austin et al. point out that the statistical methods that are at the heart of data mining are not able to distinguish real from spurious associations. Data mining employs the automated examination of enormous bodies of data. Its usefulness is thought to be proportional to the size of the data set that it collates; however, as the data set becomes larger and as the number of attributes that serve as potential relata increases, the number of potential relationships increases exponentially. Importantly, the number of spurious associations also increases. With enough data, no significance test will be stringent enough to provide assurance against the kind of results found in Austin et al. What is needed, according to Austin et al. is a “pre-specified plausible hypothesis.” For statistical analysis to be useful, the researcher must begin with a hypothesis, preferably a plausible one, if the research is to be valuable.
What exactly is a pre-specified plausible hypothesis and how can we generate it if data mining can’t do that for us? The question was posed some sixty years ago by the philosopher Nelson Goodman using different terms: Goodman believed that a critical question for epistemology was to distinguish between “projectible and non-projectible hypotheses.” One can more or less replace “pre-specified plausible hypothesis” with Goodman’s term “projectible hypothesis.” According to Goodman, when we seek to understand what hypothesis is (or is not) projectible, we do not come to the problem “empty-headed but with some stock of knowledge” which we use to determine what is (or is not) projectible. Projectible hypotheses will be those which do not conflict with other hypotheses that have been supported in the past. They will commonly use the same terminology of previously supported hypotheses. The terminology appearing in the hypotheses will have become “entrenched” in the language. This goes a long distance toward explaining why we don’t find the link between one’s astrological sign and medical conditions plausible. Twenty-first century Western medicine is not accustomed to linking astrological signs to ailments and so must find any hypothesis that does so implausible.
If Goodman is correct, then data mining is of little use without an historical understanding of the field of science to which the data pertains.
...
Here, we have another argument for allocating library resources to pay for librarians with deep subject expertise. As e-science develops, vendors will make more and more data sets available, regardless of their actual worth to researchers. To effectively choose the data sets that are of value, librarians must have a thorough understanding of the research needs of their patrons. To do this, they must have a deep understanding of the field. Unfortunately, with the excitement swirling around e-science, the mere access to large data sets threatens to become the be-all and end-all in collection management. If we aren’t careful, we may find ourselves with mountains of data from which everything and nothing can be concluded.
datamining
statistics
knowledge
digitalhumanities
libraries
epistemology
What exactly is a pre-specified plausible hypothesis and how can we generate it if data mining can’t do that for us? The question was posed some sixty years ago by the philosopher Nelson Goodman using different terms: Goodman believed that a critical question for epistemology was to distinguish between “projectible and non-projectible hypotheses.” One can more or less replace “pre-specified plausible hypothesis” with Goodman’s term “projectible hypothesis.” According to Goodman, when we seek to understand what hypothesis is (or is not) projectible, we do not come to the problem “empty-headed but with some stock of knowledge” which we use to determine what is (or is not) projectible. Projectible hypotheses will be those which do not conflict with other hypotheses that have been supported in the past. They will commonly use the same terminology of previously supported hypotheses. The terminology appearing in the hypotheses will have become “entrenched” in the language. This goes a long distance toward explaining why we don’t find the link between one’s astrological sign and medical conditions plausible. Twenty-first century Western medicine is not accustomed to linking astrological signs to ailments and so must find any hypothesis that does so implausible.
If Goodman is correct, then data mining is of little use without an historical understanding of the field of science to which the data pertains.
...
Here, we have another argument for allocating library resources to pay for librarians with deep subject expertise. As e-science develops, vendors will make more and more data sets available, regardless of their actual worth to researchers. To effectively choose the data sets that are of value, librarians must have a thorough understanding of the research needs of their patrons. To do this, they must have a deep understanding of the field. Unfortunately, with the excitement swirling around e-science, the mere access to large data sets threatens to become the be-all and end-all in collection management. If we aren’t careful, we may find ourselves with mountains of data from which everything and nothing can be concluded.
february 2012 by rybesh
Historical Controversies Now
january 2012 by rybesh
Instead of going to the library or the archive, we increasingly access history, the past, through the web. But what kind of history or histories, past or pasts are we accessing online? And what does this accessing entail? Following Leong et al., we approach temporality on the web “as a multiplicity of times derived from relations between different elements (2009, 1279)." This project is specifically focused on contentious historical moments, pasts that have had and potentially still have a major emotional impact, and which have been subject of struggle. Moreover, we not interested in sites specifically devoted to history, but in the major platforms on the web.
Confronting the historical events on the various platforms and opening up to a multiplicity of time we immediately realized that the traditional linear conception of time does not work online. First, most platforms do no not work in a chronological fashion, but with a reverse chronology. Second, because the platforms order sources according to ‘relevance’, the chronology of the sources as they are presented to us is radically mixed up. Third, sources do their own trick with time as well. Some focus on the historical event itself, while other rework the event. This reworking happens in a wide variety of ways, for example, by metaphorically invoking the event, by turning it into a historiographic debate, or by incorporating the event in a personal account (reading a history book, visiting a historical site, listening to a song). Crucially, in some of these reworkings, the event is actualized as controversial. These temporal complications directly informed our research, analysis, and visualization.
The above considerations translate in the following research questions:
Source time: Do we primarily find contemporary sources or historical sources in the various spheres? Does this vary across controversies?
Historical time: Do the sources on a platform focus on the historical moment itself, or a contemporary reworking of the moment? Does this vary across controversies?
Heat of the controversy: Is the controversy treated as settled, or is it actualized as still controversial? Does this vary across platforms and controversies?
history
datamining
web
publichistory
Confronting the historical events on the various platforms and opening up to a multiplicity of time we immediately realized that the traditional linear conception of time does not work online. First, most platforms do no not work in a chronological fashion, but with a reverse chronology. Second, because the platforms order sources according to ‘relevance’, the chronology of the sources as they are presented to us is radically mixed up. Third, sources do their own trick with time as well. Some focus on the historical event itself, while other rework the event. This reworking happens in a wide variety of ways, for example, by metaphorically invoking the event, by turning it into a historiographic debate, or by incorporating the event in a personal account (reading a history book, visiting a historical site, listening to a song). Crucially, in some of these reworkings, the event is actualized as controversial. These temporal complications directly informed our research, analysis, and visualization.
The above considerations translate in the following research questions:
Source time: Do we primarily find contemporary sources or historical sources in the various spheres? Does this vary across controversies?
Historical time: Do the sources on a platform focus on the historical moment itself, or a contemporary reworking of the moment? Does this vary across controversies?
Heat of the controversy: Is the controversy treated as settled, or is it actualized as still controversial? Does this vary across platforms and controversies?
january 2012 by rybesh
Data Clustering Software | Karypis Lab
january 2012 by rybesh
CLUTO is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters. CLUTO is well-suited for clustering data sets arising in many diverse application areas including information retrieval, customer purchasing transactions, web, GIS, science, and biology.
clustering
datamining
january 2012 by rybesh
DDupe
january 2012 by rybesh
Visualizing and analyzing social networks is a challenging problem that has been receiving growing attention. An important first step, before analysis can begin, is ensuring that the data is accurate. A common data quality problem is that the data may inadvertently contain several distinct references to the same underlying entity; the process of reconciling these references is called entity resolution. D-Dupe is an interactive tool that combines data mining algorithms for entity resolution with a task-specific network visualization. Users cope with complexity of cleaning large networks by focusing on a small subnetwork containing a potential duplicate pair. The subnetwork highlights relationships in the social network, making the common relationships easy to visually identify. D-Dupe users resolve ambiguities either by merging nodes or by marking them distinct. The entity resolution process is iterative: as pairs of nodes are resolved, additional duplicates may be revealed; therefore, resolution decisions are often chained together. We give examples of how users can flexibly apply sequences of actions to produce a high quality entity resolution result.
datamining
nlp
networks
visualization
january 2012 by rybesh
Detecting Novel Associations in Large Data Sets
december 2011 by rybesh
Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs—far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones?
statistics
relationships
datamining
december 2011 by rybesh
MADlib
july 2011 by rybesh
MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.
database
analytics
datamining
statistics
machinelearning
sql
july 2011 by rybesh
Christopher M. Bishop: Pattern Recognition and Machine Learning
june 2011 by rybesh
This leading textbook provides a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of pattern recognition or machine learning concepts is assumed. This is the first machine learning textbook to include a comprehensive coverage of recent developments such as probabilistic graphical models and deterministic inference methods, and to emphasize a modern Bayesian perspective. It is suitable for courses on machine learning, statistics, computer science, signal processing, computer vision, data mining, and bioinformatics. This hard cover book has 738 pages in full colour, and there are 431 graded exercises (with solutions available below). Extensive support is provided for course instructors.
machinelearning
books
patterns
statistics
datamining
june 2011 by rybesh
PEGASUS: Peta-Scale Graph Mining System
june 2011 by rybesh
PEGASUS is a Peta-scale graph mining system, fully written in Java. It runs in parallel, distributed manner on top of Hadoop.
graph
datamining
hadoop
june 2011 by rybesh
Wikipedia Miner - Home
june 2011 by rybesh
Wikipedia Miner is a toolkit for navigating and making use of the structure and content of Wikipedia. It aims to make it easy for you to integrate Wikipedia's knowledge into your own applications, by:
providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
wikipedia
datamining
api
providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
june 2011 by rybesh
Wikipedia Miner - Home
may 2011 by rybesh
Wikipedia Miner is a toolkit for navigating and making use of the structure and content of Wikipedia. It aims to make it easy for you to integrate Wikipedia's knowledge into your own applications, by:
providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
wikipedia
textmining
nlp
webservices
tools
datamining
providing simplified, object-oriented access to Wikipedia's structure and content.
measuring how terms and concepts in Wikipedia are connected to each other.
detecting and disambiguating Wikipedia topics when they are mentioned in documents.
may 2011 by rybesh
Goose
may 2011 by rybesh
Goose aims to create an easy to use, scalable extractor that can plug into any application that needs to extract structure from unstructured web pages.
datamining
text
extraction
java
may 2011 by rybesh
List of resources: Article text extraction from HTML documents | My tech blog.
march 2011 by rybesh
A list of research papers, articles, web APIs, libraries and other software for article text extraction.
datamining
extraction
html
scraping
march 2011 by rybesh
Overview: Extracting article text from HTML documents | My tech blog.
march 2011 by rybesh
In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.
datamining
extraction
html
scraping
march 2011 by rybesh
Modular toolkit for Data Processing (MDP)
december 2010 by rybesh
Modular toolkit for Data Processing (MDP) is a Python data processing framework.
From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
datamining
machinelearning
python
tools
From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
december 2010 by rybesh
tm - Text Mining Package
october 2010 by rybesh
tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
R
textmining
datamining
nlp
tools
statistics
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.
october 2010 by rybesh
MALLET homepage
october 2010 by rybesh
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
datamining
java
machinelearning
nlp
tools
october 2010 by rybesh
ScraperWiki
july 2010 by rybesh
Anyone can write a screen scraper using the online editor, and the code and data are shared with the world.
datamining
opendata
scraping
july 2010 by rybesh
Chris Heathcote: anti-mega: griotism
july 2010 by rybesh
Whilst we have the luxury of open APIs to services, it’s rarely rich enough data for interesting stories to be told. APIs tend to be locked in the present – as the present is what a lot of services are fixated on. Use, not stories. Some element of time is normally needed to pull out data that tells interesting stories, often long periods of time.
data
narrative
datamining
history
time
july 2010 by rybesh
Training Examples Q&A - machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
june 2010 by rybesh
Where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!
ai
machinelearning
nlp
textanalysis
ir
datamining
search
statistics
infoviz
reference
june 2010 by rybesh
DBpedia Mappings
march 2010 by rybesh
This wiki contains the infobox-to-ontology and the table-to-ontology mappings which are used by the DBpedia extraction framework as well as the ontology definition itself. The framework collects the templates defined in this Wiki and extracts the Wikipedia content according to them.
wikipedia
ontology
semweb
datamining
extraction
march 2010 by rybesh
lda: Collapsed Gibbs sampling methods for topic models
november 2009 by rybesh
This package implements latent Dirichlet allocation (LDA) and related models. This includes (but is not limited to) sLDA, corrLDA, and the mixed-membership stochastic blockmodel.
clustering
textanalysis
datamining
R
topicmodels
november 2009 by rybesh
Apache Mahout
november 2009 by rybesh
Mahout's goal is to build scalable machine learning libraries.
machinelearning
opensource
hadoop
apache
recommendation
clustering
classification
datamining
november 2009 by rybesh
LingPipe
may 2009 by rybesh
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
java
nlp
datamining
tools
entitydetection
may 2009 by rybesh
Data Mining with R: learning by case studies
september 2008 by rybesh
The main goal of this book is to introduce the reader to the use of R as a tool for performing data mining.
R
datamining
reference
september 2008 by rybesh
Web of Fate | Share your future
july 2008 by rybesh
Web of Fate is a social experiment that harnesses the collective intelligence of the web to visualize and uncover hidden relationships among future and historical events.
datamining
forecasting
future
collaboration
nlp
events
extraction
semweb
ontology
prediction
july 2008 by rybesh
Apache UIMA - Apache UIMA
february 2008 by rybesh
The Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies.
extraction
recognition
architecture
tools
java
datamining
search
february 2008 by rybesh
Dawid Weiss
november 2007 by rybesh
Text clustering, information retrieval, web mining, text processing, NLP.
people
academia
poland
search
datamining
nlp
machinelearning
november 2007 by rybesh
OpenTextMining
november 2007 by rybesh
Open Text Mining Interface (OTMI) is an initiative from Nature Publishing Group (NPG). It aims to enable scholarly publishers, among others, to disclose their full text for indexing and text-mining purposes but without giving it away in a form that is rea
academia
publishing
copyright
data
nlp
standards
datamining
november 2007 by rybesh
//re:digg\\ » Blog Archive » *New* Sections & Data Mining
march 2007 by rybesh
"...the notion of quantifying a community’s potential bias is nothing short of remarkable."
journalism
statistics
nlp
quantitative
methods
bias
datamining
election
march 2007 by rybesh
Media @ LSE Group Weblog » Blog Archive » Dangerously overstating the significance of Web 2.0
february 2007 by rybesh
Web 2.0 enthusiasts believe that the contents of user-content databases represent the preferences and interests of everyone instead of the somewhat self-reinforcing interest clusters of a technologically savvy elite.
web2.0
datamining
social
metadata
ideology
architecture
technology
bias
february 2007 by rybesh
Topic Modeling Toolbox
july 2006 by rybesh
Tools for entity recognition, extraction and linking.
nlp
tools
research
statistics
datamining
analysis
matlab
july 2006 by rybesh
Online Maps: The Next Generation
december 2005 by rybesh
Media systems scientists at USC rely on geospatial technology to integrate a wealth of information that is accurate and easily accessible for decision-makers in a wide range of fields.
locative
maps
infoviz
datamining
research
december 2005 by rybesh
JUNG - Java Universal Network/Graph Framework
december 2005 by rybesh
A software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.
datamining
social
networking
tools
infoviz
java
opensource
december 2005 by rybesh
Congress votes database | washingtonpost.com
december 2005 by rybesh
This site lets you browse every vote in the U.S. Congress since 1991.
politics
database
datamining
government
usa
december 2005 by rybesh
How News is Made, by Dale Dougherty
december 2005 by rybesh
The Internet allows us to see how news is made, as though we were walking through a factory tour, and we can compare the very similar results of a mass production system.
internet
journalism
media
news
politics
web
datamining
december 2005 by rybesh
WWW2006 Workshop - Logging Traces of Web Activity: The Mechanics of Data Collection
december 2005 by rybesh
This one day workshop will examine the trade-offs and challenges inherent to the different logging approaches and provide workshop attendees the opportunity to discuss both previous data collection experiences and upcoming challenges.
web
conference
2006
workshop
statistics
datamining
december 2005 by rybesh
Enthought Python
october 2005 by rybesh
A Python distribution that comes with even more useful capabilities already installed and ready for use.
python
windows
science
math
statistics
datamining
tools
opensource
code
october 2005 by rybesh
The R Project for Statistical Computing
october 2005 by rybesh
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.
code
datamining
language
math
opensource
statistics
tools
october 2005 by rybesh
Orange
october 2005 by rybesh
Orange is a component-based data mining software. It includes a range of preprocessing, modelling and data exploration techniques.
machinelearning
classification
code
datamining
python
opensource
tools
nlp
statistics
october 2005 by rybesh
Data Mining in Python
october 2005 by rybesh
This is a collection of libraries useful for machine learning and data mining.
python
statistics
machinelearning
nlp
code
opensource
datamining
october 2005 by rybesh
related tags
academia ⊕ ai ⊕ analysis ⊕ analytics ⊕ apache ⊕ api ⊕ architecture ⊕ bias ⊕ books ⊕ classification ⊕ clustering ⊕ code ⊕ collaboration ⊕ conference ⊕ copyright ⊕ data ⊕ database ⊕ datamining ⊖ digitalhumanities ⊕ ec2 ⊕ election ⊕ entitydetection ⊕ epistemology ⊕ events ⊕ extraction ⊕ forecasting ⊕ future ⊕ government ⊕ graph ⊕ hadoop ⊕ history ⊕ html ⊕ ideology ⊕ infoviz ⊕ internet ⊕ ir ⊕ java ⊕ journalism ⊕ knowledge ⊕ language ⊕ libraries ⊕ locative ⊕ machinelearning ⊕ maps ⊕ math ⊕ matlab ⊕ media ⊕ metadata ⊕ methods ⊕ narrative ⊕ networking ⊕ networks ⊕ news ⊕ nlp ⊕ ontology ⊕ opendata ⊕ opensource ⊕ patterns ⊕ people ⊕ poland ⊕ politics ⊕ prediction ⊕ publichistory ⊕ publishing ⊕ python ⊕ quantitative ⊕ R ⊕ recognition ⊕ recommendation ⊕ reference ⊕ relationships ⊕ research ⊕ science ⊕ scraping ⊕ search ⊕ semweb ⊕ social ⊕ sql ⊕ standards ⊕ statistics ⊕ technology ⊕ text ⊕ textanalysis ⊕ textmining ⊕ time ⊕ tools ⊕ topicmodels ⊕ usa ⊕ visualization ⊕ web ⊕ web2.0 ⊕ webservices ⊕ wikipedia ⊕ windows ⊕ workshop ⊕Copy this bookmark: