michaelfox + pdf 18
150 dpi Automator Service - All this
12 days ago by michaelfox
from And now it’s all this http://www.leancrew.com/all-this Author: Dr. Drang Date: May 18, 2012 at 12:48PM
ifttt
googlereader
automator
osx
automation
pdf
compression
optimization
dpi
colorsync
12 days ago by michaelfox
Pandoc - About pandoc
april 2011 by michaelfox
If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Need to generate a man page from a markdown file? No problem. LaTeX to Docbook? Sure. HTML to MediaWiki? Yes, that too. Pandoc can read markdown and (subsets of) reStructuredText, textile, HTML, and LaTeX, and it can write plain text, markdown, reStructuredText, HTML, LaTeX, ConTeXt, PDF, RTF, DocBook XML, OpenDocument XML, ODT, GNU Texinfo, MediaWiki markup, textile, groff man pages, Emacs org-mode, EPUB ebooks, and S5 and Slidy HTML slide shows. PDF output (via LaTeX) is also supported with the included markdown2pdf wrapper script.
Pandoc understands a number of useful markdown syntax extensions, including document metadata (title, author, date); footnotes; tables; definition lists; superscript and subscript; strikeout; enhanced ordered lists (start number and numbering style are significant); delimited code blocks; markdown inside HTML blocks; and TeX math. Other options include “smart” punctuation, syntax highlighting, automatically generated tables of contents, and automatically generated citations (using citeproc-hs). If strict markdown compatibility is desired, all of these extensions can be turned off with a command-line flag.
Pandoc includes a Haskell library and a standalone executable. The library includes separate modules for each input and output format, so adding a new input or output format just requires adding a new module.
html
latex
markdown
markup
pdf
textile
text
convert
docbook
Pandoc understands a number of useful markdown syntax extensions, including document metadata (title, author, date); footnotes; tables; definition lists; superscript and subscript; strikeout; enhanced ordered lists (start number and numbering style are significant); delimited code blocks; markdown inside HTML blocks; and TeX math. Other options include “smart” punctuation, syntax highlighting, automatically generated tables of contents, and automatically generated citations (using citeproc-hs). If strict markdown compatibility is desired, all of these extensions can be turned off with a command-line flag.
Pandoc includes a Haskell library and a standalone executable. The library includes separate modules for each input and output format, so adding a new input or output format just requires adding a new module.
april 2011 by michaelfox
Doc⚡split
march 2011 by michaelfox
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
Docsplit is currently at version 0.5.0.
Docsplit is an open-source component of DocumentCloud.
Usage
The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.
images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.
docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag.
docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.
docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.
docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.
docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36
document
ocr
pdf
ruby
parsing
processing
tools
cli
Docsplit is currently at version 0.5.0.
Docsplit is an open-source component of DocumentCloud.
Usage
The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.
images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.
docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag.
docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.
docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.
docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.
docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36
march 2011 by michaelfox
Ruby Best Practices - Full Book Now Available For Free!
may 2010 by michaelfox
# Chapter 1: Driving Code Through Tests
# Chapter 2: Designing Beautiful APIS / Chapter 3: Mastering the Dynamic Toolkit
# Chapter 4: Text Processing and File Management
# Chapter 5: Functional Programming Techniques
# Chapter 6: When Things Go Wrong
# Chapter 7: Reducing Cultural Barriers
book
pdf
programming
ruby
download
ebooks
# Chapter 2: Designing Beautiful APIS / Chapter 3: Mastering the Dynamic Toolkit
# Chapter 4: Text Processing and File Management
# Chapter 5: Functional Programming Techniques
# Chapter 6: When Things Go Wrong
# Chapter 7: Reducing Cultural Barriers
may 2010 by michaelfox
related tags
applescript ⊕ auth ⊕ authentication ⊕ automation ⊕ automator ⊕ bestpractices ⊕ book ⊕ cli ⊕ codeigniter ⊕ colorsync ⊕ compression ⊕ convert ⊕ css ⊕ css3 ⊕ development ⊕ display ⊕ docbook ⊕ document ⊕ dom ⊕ download ⊕ dpi ⊕ ebooks ⊕ embed ⊕ embeddable ⊕ excel ⊕ extend ⊕ extension ⊕ filter ⊕ framework ⊕ generator ⊕ google ⊕ googlereader ⊕ html ⊕ ifttt ⊕ javascript ⊕ latex ⊕ library ⊕ lifehacker ⊕ mail ⊕ markdown ⊕ markup ⊕ microformats ⊕ ocr ⊕ online ⊕ onlineos ⊕ opensource ⊕ optimization ⊕ osx ⊕ parser ⊕ parsing ⊕ password ⊕ pdf ⊖ php ⊕ plugin ⊕ processing ⊕ productivity ⊕ programming ⊕ quartz ⊕ resources ⊕ ruby ⊕ sass ⊕ scraper ⊕ scss ⊕ security ⊕ session ⊕ shell ⊕ styleguide ⊕ text ⊕ textile ⊕ tools ⊕ viewer ⊕ webdev ⊕ whitepaper ⊕ xml ⊕Copy this bookmark: