anderser's pydocsplit at master - GitHub
january 2010 by bycoffe
Python implementation of DocumentCloud's Docsplit utility
pdf
data
documentcloud
docsplit
january 2010 by bycoffe
Doc⚡split
december 2009 by bycoffe
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf
data
documents
december 2009 by bycoffe
Python Package Index : pdfminer 20090330
march 2009 by bycoffe
"PDFMiner is a suite of programs that aims to help extracting or analyzing text data from PDF documents. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other layout information such as font size or font name, which could be useful for analyzing the document. It can be also used as a basis for a full-fledged PDF interpreter."
python
pdf
data
march 2009 by bycoffe
Copy this bookmark: