michaelfox + document 13
DocMgr | Download DocMgr software for free at SourceForge.net
may 2011 by michaelfox
PHP/Postgresql based document management system (DMS) with pdf and ocr-based indexing, and optional tsearch2 support. It also has access control lists, user permissions assignment, file discussion board, and multi-level file grouping.
document
organization
documentmanagement
may 2011 by michaelfox
Doc⚡split
march 2011 by michaelfox
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
Docsplit is currently at version 0.5.0.
Docsplit is an open-source component of DocumentCloud.
Usage
The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.
images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.
docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag.
docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.
docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.
docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.
docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36
document
ocr
pdf
ruby
parsing
processing
tools
cli
Docsplit is currently at version 0.5.0.
Docsplit is an open-source component of DocumentCloud.
Usage
The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.
images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.
docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag.
docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.
docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.
docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.
docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36
march 2011 by michaelfox
Plurk Open Source - LightCloud - Distributed and persistent key value database
april 2010 by michaelfox
# Built on Tokyo Tyrant. One of the fastest key-value databases [benchmark]. Tokyo Tyrant has been in development for many years and is used in production by Plurk.com, mixi.jp and scribd.com (to name a few)...
# Great performance (comparable to memcached!)
# Can store millions of keys on very few servers - tested in production
# Scale out by just adding nodes
# Nodes are replicated via master-master replication. Automatic failover and load balancing is supported from the start
# Ability to script and extend using Lua. Included extensions are incr and a fixed list
# Hot backups and restore: Take backups and restore servers without shutting them down
# LightCloud manager can control nodes, take backups and give you a status on how your nodes are doing
# Very small foot print (lightcloud client is around ~500 lines and manager about ~400)
# Python only, but LightCloud should be easy to port to other languages.
# Ruby port under development!
database
keyvalue
nosql
document
memcache
# Great performance (comparable to memcached!)
# Can store millions of keys on very few servers - tested in production
# Scale out by just adding nodes
# Nodes are replicated via master-master replication. Automatic failover and load balancing is supported from the start
# Ability to script and extend using Lua. Included extensions are incr and a fixed list
# Hot backups and restore: Take backups and restore servers without shutting them down
# LightCloud manager can control nodes, take backups and give you a status on how your nodes are doing
# Very small foot print (lightcloud client is around ~500 lines and manager about ~400)
# Python only, but LightCloud should be easy to port to other languages.
# Ruby port under development!
april 2010 by michaelfox
related tags
admin ⊕ ajax ⊕ bestpractices ⊕ book ⊕ business ⊕ cli ⊕ codingstyle ⊕ collaboration ⊕ database ⊕ db ⊕ design ⊕ development ⊕ display ⊕ document ⊖ documentation ⊕ documentmanagement ⊕ download ⊕ ebooks ⊕ editor ⊕ embed ⊕ embeddable ⊕ entrepreneur ⊕ example ⊕ forms ⊕ google ⊕ javascript ⊕ keyvalue ⊕ legal ⊕ memcache ⊕ mongodb ⊕ nosql ⊕ notepad ⊕ ocr ⊕ opensource ⊕ organization ⊕ parsing ⊕ password ⊕ pdf ⊕ php ⊕ processing ⊕ resources ⊕ ruby ⊕ scripts ⊕ security ⊕ sharing ⊕ startup ⊕ style ⊕ styleguide ⊕ tools ⊕ utility ⊕ viewer ⊕ webapp ⊕ webdev ⊕ webservice ⊕ whitepaper ⊕ wiki ⊕ writing ⊕Copy this bookmark: