will.brien + datamining   6

boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages - Google Project Hosting
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
datamining  java  html  rss  cli  linux  google 
12 weeks ago by will.brien
List of resources: Article text extraction from HTML documents
Following up to my overview of article text extractors, I’ll try to compile a list of research papers, articles, web APIs, libraries and other software that I encountered during my research.
datamining  python  html  rss  cli  linux 
12 weeks ago by will.brien
How Companies Learn Your Secrets
Almost every major retailer, from grocery chains to investment banks to the U.S. Postal Service, has a “predictive analytics” department devoted to understanding not just consumers’ shopping habits but also their personal habits, so as to more efficiently market to them. “But Target has always been one of the smartest at this,” says Eric Siegel, a consultant and the chairman of a conference called Predictive Analytics World. “We’re living through a golden age of behavioral research. It’s amazing how much we can figure out about how people think now.”
marketing  psychology  shopping  datamining  database  research 
february 2012 by will.brien
The Whitburn Project: 120 Years of Music Chart History data set | Infochimps
For the last ten years, obsessive record collectors in Usenet have been working on the Whitburn Project — a huge undertaking to preserve and share high-quality recordings of every popular song since the 1890s. To assist their efforts, they’ve created a spreadsheet of 37,000 songs and 112 columns of raw data, including each song’s duration, beats-per-minute, songwriters, label, and week-by-week chart position. It’s 25 megs of OCD, and it’s awesome.

As far as I know, this is the first time the project and its data have ever been discussed outside of Usenet. Despite its illegality, they’ve created a wonderful resource and you can do some fun things with the data. (from Andy Baio’s waxy.org )
music  database  datamining  reference  history  lists 
october 2011 by will.brien
ScraperWiki
ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it's a wiki, other programmers can contribute to and improve the code.
datamining  reference  python  api  google  diy 
july 2011 by will.brien
Pattern | CLiPS
Pattern is a web mining module for the Python programming language.

It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).

The module is bundled with 30+ example scripts.
python  linux  cli  database  archives  datamining 
march 2011 by will.brien

Copy this bookmark:



description:


tags: