martinkenny + text-extraction 2
boilerpipe - Project Hosting on Google Code
december 2010 by martinkenny
"The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page."
html
boilerplate
text-extraction
december 2010 by martinkenny
Doc⚡split
august 2010 by martinkenny
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf
text
documents
text-extraction
august 2010 by martinkenny