martinkenny + pdf   8

ogrisel's paper2ebook at master - GitHub
"Utility to re-structure research papers published in US Letter or A4 format PDF files to typically remove the 2 columns layout."
pdf  converter  papers  ebooks  kindle 
november 2010 by martinkenny
Doc⚡split
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf  text  documents  text-extraction 
august 2010 by martinkenny
http://prawn.majesticseacreature.com/
"Building printable documents doesn't have to be hard

If you've ever needed to produce PDF documents before, in Ruby or another language, you probably know how much it can suck. Prawn takes the pain out of generating beautiful printable documents, while still remaining fast, tiny and nimble. It is also named after a majestic sea creature, and that has to count for something."
pdf  ruby  library  programming 
january 2009 by martinkenny
Inside PDF: Text Content in PDF Files
To extract text from PDF documents is a rather difficult and a highly technical task and I hope to explain, here, why that is the case.
pdf  adobe  informationretrieval  indexing  text 
october 2008 by martinkenny
PDFMiner
PDFMiner is a suite of programs that aims to help analyzing text data from PDF documents. It includes a PDF parser, a PDF renderer (though only rendering text is supported for now), and a couple of nice tools to extract texts. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other layout information such as font size or font name, which could be useful for analyzing the document. It also infers text running within a page by using clustering technique.
pdf  python  library  open-source  parser 
august 2008 by martinkenny

Copy this bookmark:



description:


tags: