martinkenny + text   2

Doc⚡split
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf  text  documents  text-extraction 
august 2010 by martinkenny
Inside PDF: Text Content in PDF Files
To extract text from PDF documents is a rather difficult and a highly technical task and I hope to explain, here, why that is the case.
pdf  adobe  informationretrieval  indexing  text 
october 2008 by martinkenny

Copy this bookmark:



description:


tags: