jpfinley + scripting   8

Overview: Extracting article text from HTML documents | My tech blog.
In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.

In the following chapters I’ll try to review some article text extraction methods that are applicable to today’s websites. They mostly leverage on machine learning, statistics and a wide rage of heuristics.
html  scrape  scraping  extraction  text  scripting 
march 2011 by jpfinley
Chicago Deep Dish
For those who couldn’t be there, and for those who were there and seek to savor the memories, here is An Event Apart Chicago, all wrapped up in a pretty bow:

AEA Chicago – official photo set
By John Morrison, subism studios llc. See also (and contribute to) An Event Apart Chicago 2009 Pool, a user group on Flickr.
A Feed Apart Chicago
Live tweeting from the show, captured forever and still being updated. Includes complete blow-by-blow from Whitney Hess.
Luke W’s Notes on the Show
Smart note-taking by Luke Wroblewski, design lead for Yahoo!, frequent AEA speaker, and author of Web Form Design: Filling in the Blanks (Rosenfeld Media, 2008):

Jeffrey Zeldman: A Site Redesign
Jason Santa Maria: Thinking Small
Kristina Halvorson: Content First
Dan Brown: Concept Models -A Tool for Planning Websites
Whitney Hess: DIY UX -Give Your Users an Upgrade
Andy Clarke: Walls Come Tumbling Down
Eric Meyer: JavaScript Will Save Us All (not captured)
Aaron Gustafson: Using CSS3 Today with eCSStender (not captured)
Simon Willison: Building Things Fast
Luke Wroblewski: Web Form Design in Action (download slides)
Dan Rubin: Designing Virtual Realism
Dan Cederholm: Progressive Enrichment With CSS3 (not captured)
Three years of An Event Apart Presentations

Note: Comment posting here is a bit wonky at the moment. We are investigating the cause. Normal commenting has been restored. Thank you, Noel Jackson.

Short URL: zeldman.com/?p=2695
A_List_Apart  An_Event_Apart  Appearances  Authoring  Browsers  CSS  Career  Chicago  Code  Community  Compatibility  DOM  Design  Education  Fonts  Formats  HTML  HTML5  Happy_Cog™  Information_architecture  Jason_Santa_Maria  Markup  Real_type_on_the_web  Scripting  Search  Standards  State_of_the_Web  architecture  art_direction  bugs  cities  conferences  content  content_strategy  creativity  development  downloads  editorial  engagement  eric_meyer  events  flickr  glamorous  industry  javascript  photography  social_networking  speaking  spec  from google
october 2009 by jpfinley

Copy this bookmark:



description:


tags: