67 bookmarks. First posted by dwillis december 2009.
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby
programming
july 2011 by dlo
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
Docsplit is currently at version 0.5.0.
Docsplit is an open-source component of DocumentCloud.
Usage
The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.
images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.
docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag.
docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.
docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.
docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.
docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36
document
ocr
pdf
ruby
parsing
processing
tools
cli
Docsplit is currently at version 0.5.0.
Docsplit is an open-source component of DocumentCloud.
Usage
The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.
images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.
docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag.
docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.
docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.
docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.
docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36
march 2011 by michaelfox
amazing looking document processing project
document
processing
library
split
ocr
february 2011 by plhw
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
ruby
pdf
document
parsing
ocr
documents
data
processing
split
from delicious
december 2010 by jonty
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
from delicious
december 2010 by hubpin
command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby
tools
textproc
december 2010 by olleolleolle
Looks great for a little project involving web comics that I've always wanted to do. Will have a look see :)
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby
library
ocr
text
images
documents
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
october 2010 by boywhoroared
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pdf
ocr
ruby
october 2010 by aheaume
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby
ocr
library
document
image
september 2010 by berberich
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf
text
documents
text-extraction
august 2010 by martinkenny
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pdf
ruby
tools
metadata
august 2010 by alpyne
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby
ocr
from delicious
august 2010 by pjaspers
august 2010
by joem
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component part
text
tools
CLI
ruby
august 2010 by seflaherty
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
pdf
ruby
august 2010 by tomd
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata"
pdf
ruby
data
gem
library
text
tool
parsing
imagery
august 2010 by garrettc
Outstanding! RT @documentcloud: Just released the 0.3 version of Docsplit. Now with transparent OCR:
from twitter
august 2010 by brianboyer
Just released the 0.3 version of Docsplit, our pull-the-images-and-text-out-of-docs utility. Now with transparent OCR:
from twitter_favs
august 2010 by gkamp
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
ruby
pdf
tools
april 2010 by eby
A command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pdf
ruby
january 2010 by awstewart
Doc-Split: a command-line utility and Ruby library for splitting apart documents into their component parts http://bit.ly/72Yp0I
twitter_fav
@dcarli
january 2010 by amy
Interesting ruby lib that breaks up docs into text, images and such.
ruby
railstips
december 2009 by jnunemaker
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
pdf
ruby
images
gems
december 2009 by harrylove
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
ruby
split
document
parse
search
utility
library
pdf
thumbnail
metadata
text
december 2009 by sstrudeau
"Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)"
pdf
data
documents
december 2009 by bycoffe
tags
@dcarli apra binary cli cloud clu code conversion cool data-mining data doc docsplit document-scanning document documentcloud documents docx extraction formats gem gems image imagery images indexing languages library metadata nlp ocr parse parser parsing pdf pdfs processing programming project rails railstips reader ruby search split text-extraction text textproc thumbnail tool tools twitter_fav utility via:ithkuil webdev