martinkenny + text 2
Doc⚡split
august 2010 by martinkenny
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf
text
documents
text-extraction
august 2010 by martinkenny
Inside PDF: Text Content in PDF Files
october 2008 by martinkenny
To extract text from PDF documents is a rather difficult and a highly technical task and I hope to explain, here, why that is the case.
pdf
adobe
informationretrieval
indexing
text
october 2008 by martinkenny
Copy this bookmark: