michaelfox + parsing   24

chriso/node.io - GitHub
node.io is a distributed data scraping and processing framework

Jobs are written in Javascript or Coffeescript and run in Node.JS - jobs are concise, asynchronous and FAST
Includes a robust framework for scraping, selecting and traversing data from the web (choose between jQuery or SoupSelect)
Includes a data validation and sanitization framework
Easily handle a variety of input / output - files, databases, streams, stdin/stdout, etc.
Speed up execution by distributing work across multiple processes and (soon) other servers
Manage & run jobs through a web interface
Follow @nodeio or visit http://node.io/ for updates.

Scrape example

Let's pull the front page stories from reddit

require('node.io').scrape(function() {
this.getHtml('http://www.reddit.com/', function(err, $) {
var stories = [];
$('a.title').each(function(title) {
stories.push(title.text);
});
this.emit(stories);
});
});
If you want to incorporate timeouts, retries, batch-type jobs, etc. head over the the wiki for documentation.

Built-in modules

node.io comes with some built-in scraping modules.

Find the pagerank of a domain

$ echo "mastercard.com" | node.io pagerank
=> mastercard.com,7
..or a list of URLs

$ node.io pagerank < urls.txt
Quickly check the http code for each URL in a list

$ node.io statuscode < urls.txt
Grab the front page stories from reddit

$ node.io query "http://www.reddit.com/" a.title
javascript  node.js  nodejs  scraping  parsing  spider  scraper  parser 
april 2011 by michaelfox
Doc⚡split
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

Docsplit is currently at version 0.5.0.

Docsplit is an open-source component of DocumentCloud.

Usage

The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.

images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.

docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])
text--pages --ocr --no-ocr --no-clean Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag.

docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')
pages--pages Ruby: extract_pages
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.

docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)
pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.

docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')
author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.

docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36
document  ocr  pdf  ruby  parsing  processing  tools  cli 
march 2011 by michaelfox
HTML Parsing and Screen Scraping with the Simple HTML DOM Library | Nettuts+
If you need to parse HTML, regular expressions aren’t the way to go. In this tutorial, you’ll learn how to use an open source, easily learned parser, to read, modify, and spit back out HTML from external sources. Using nettuts as an example, you’ll learn how to get a list of all the articles published on the site and display them.
php  dom  scrape  scraping  screenscraping  parsing  html  library  webdev  simplehtmldom 
may 2010 by michaelfox

Copy this bookmark:



description:


tags: