will.brien + datamining 6
boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages - Google Project Hosting
12 weeks ago by will.brien
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
datamining
java
html
rss
cli
linux
google
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
12 weeks ago by will.brien
List of resources: Article text extraction from HTML documents
12 weeks ago by will.brien
Following up to my overview of article text extractors, I’ll try to compile a list of research papers, articles, web APIs, libraries and other software that I encountered during my research.
datamining
python
html
rss
cli
linux
12 weeks ago by will.brien
How Companies Learn Your Secrets
february 2012 by will.brien
Almost every major retailer, from grocery chains to investment banks to the U.S. Postal Service, has a “predictive analytics” department devoted to understanding not just consumers’ shopping habits but also their personal habits, so as to more efficiently market to them. “But Target has always been one of the smartest at this,” says Eric Siegel, a consultant and the chairman of a conference called Predictive Analytics World. “We’re living through a golden age of behavioral research. It’s amazing how much we can figure out about how people think now.”
marketing
psychology
shopping
datamining
database
research
february 2012 by will.brien
The Whitburn Project: 120 Years of Music Chart History data set | Infochimps
october 2011 by will.brien
For the last ten years, obsessive record collectors in Usenet have been working on the Whitburn Project — a huge undertaking to preserve and share high-quality recordings of every popular song since the 1890s. To assist their efforts, they’ve created a spreadsheet of 37,000 songs and 112 columns of raw data, including each song’s duration, beats-per-minute, songwriters, label, and week-by-week chart position. It’s 25 megs of OCD, and it’s awesome.
As far as I know, this is the first time the project and its data have ever been discussed outside of Usenet. Despite its illegality, they’ve created a wonderful resource and you can do some fun things with the data. (from Andy Baio’s waxy.org )
music
database
datamining
reference
history
lists
As far as I know, this is the first time the project and its data have ever been discussed outside of Usenet. Despite its illegality, they’ve created a wonderful resource and you can do some fun things with the data. (from Andy Baio’s waxy.org )
october 2011 by will.brien
ScraperWiki
july 2011 by will.brien
ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it's a wiki, other programmers can contribute to and improve the code.
datamining
reference
python
api
google
diy
july 2011 by will.brien
Pattern | CLiPS
march 2011 by will.brien
Pattern is a web mining module for the Python programming language.
It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).
The module is bundled with 30+ example scripts.
python
linux
cli
database
archives
datamining
It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).
The module is bundled with 30+ example scripts.
march 2011 by will.brien
Copy this bookmark: