martinkenny + pdf 8
ogrisel's paper2ebook at master - GitHub
november 2010 by martinkenny
"Utility to re-structure research papers published in US Letter or A4 format PDF files to typically remove the 2 columns layout."
pdf
converter
papers
ebooks
kindle
november 2010 by martinkenny
Doc⚡split
august 2010 by martinkenny
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf
text
documents
text-extraction
august 2010 by martinkenny
http://prawn.majesticseacreature.com/
january 2009 by martinkenny
"Building printable documents doesn't have to be hard
If you've ever needed to produce PDF documents before, in Ruby or another language, you probably know how much it can suck. Prawn takes the pain out of generating beautiful printable documents, while still remaining fast, tiny and nimble. It is also named after a majestic sea creature, and that has to count for something."
pdf
ruby
library
programming
If you've ever needed to produce PDF documents before, in Ruby or another language, you probably know how much it can suck. Prawn takes the pain out of generating beautiful printable documents, while still remaining fast, tiny and nimble. It is also named after a majestic sea creature, and that has to count for something."
january 2009 by martinkenny
Inside PDF: Text Content in PDF Files
october 2008 by martinkenny
To extract text from PDF documents is a rather difficult and a highly technical task and I hope to explain, here, why that is the case.
pdf
adobe
informationretrieval
indexing
text
october 2008 by martinkenny
PDFMiner
august 2008 by martinkenny
PDFMiner is a suite of programs that aims to help analyzing text data from PDF documents. It includes a PDF parser, a PDF renderer (though only rendering text is supported for now), and a couple of nice tools to extract texts. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other layout information such as font size or font name, which could be useful for analyzing the document. It also infers text running within a page by using clustering technique.
pdf
python
library
open-source
parser
august 2008 by martinkenny
Copy this bookmark: