Doc⚡split


67 bookmarks. First posted by dwillis december 2009.


Docsplit.extract_pdf
pdf  ruby  doc  docx  gem 
october 2011 by joren
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby  programming 
july 2011 by dlo
Converts documents to their text parts.
ruby  pdf 
april 2011 by buckett
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

Docsplit is currently at version 0.5.0.

Docsplit is an open-source component of DocumentCloud.

Usage

The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.

images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.

docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag.

docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.

docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.

docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.

docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36
document  ocr  pdf  ruby  parsing  processing  tools  cli 
march 2011 by michaelfox
amazing looking document processing project
document  processing  library  split  ocr 
february 2011 by plhw
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
ruby  pdf  document  parsing  ocr  documents  data  processing  split  from delicious
december 2010 by jonty
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
from delicious
december 2010 by hubpin
command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby  tools  textproc 
december 2010 by olleolleolle
Looks great for a little project involving web comics that I've always wanted to do. Will have a look see :)

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby  library  ocr  text  images  documents 
october 2010 by boywhoroared
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pdf  ocr  ruby 
october 2010 by aheaume
feature extraction
pdf  ruby  extraction 
september 2010 by ithkuil
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby  ocr  library  document  image 
september 2010 by berberich
Doc⚡split - handy Ruby tool for deconstructing docs
from twitter
august 2010 by jleitess
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf  text  documents  text-extraction 
august 2010 by martinkenny
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pdf  ruby  tools  metadata 
august 2010 by alpyne
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby  ocr  from delicious
august 2010 by pjaspers
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component part
text  tools  CLI  ruby 
august 2010 by seflaherty
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
pdf  ruby 
august 2010 by tomd
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata"
pdf  ruby  data  gem  library  text  tool  parsing  imagery 
august 2010 by garrettc
Outstanding! RT : Just released the 0.3 version of Docsplit. Now with transparent OCR:
from twitter
august 2010 by brianboyer
Just released the 0.3 version of Docsplit, our pull-the-images-and-text-out-of-docs utility. Now with transparent OCR:
from twitter_favs
august 2010 by gkamp
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby  pdf  tools 
april 2010 by eby
A command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pdf  ruby 
january 2010 by awstewart
Doc-Split: a command-line utility and Ruby library for splitting apart documents into their component parts http://bit.ly/72Yp0I
twitter_fav  @dcarli 
january 2010 by amy
Interesting ruby lib that breaks up docs into text, images and such.
ruby  railstips 
december 2009 by jnunemaker
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pdf  ruby  images  gems 
december 2009 by harrylove
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
ruby  split  document  parse  search  utility  library  pdf  thumbnail  metadata  text 
december 2009 by sstrudeau
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf  data  documents 
december 2009 by bycoffe