Welcome to the Scrapes project!

Scrapes is a framework for crawling and scraping multi-page web sites.

Unlike other scraping frameworks, Scrapes is designed to work with “dirty” web sites. That is, web sites that were not designed to have their data extracted programmatically.

It includes features for both the initial development of a scraper, and the continued maintenance of that scraper.

Feature Highlights

  • Rule based selection and extraction of data that can use CSS selectors or pseudo XPath expressions
  • Caching system so that during development you don’t have to continuously download pages from a web server while you experiment with your selectors and extractors
  • Validation system that helps detect web site changes that would otherwise invalidate your extraction rules
  • Support for initiating a session with the web server, and passing session cookies back to the web server
  • When all else fails, you can run a web page through the xsltproc XSLT processor to generate an XML document that can then be run through your rule based parser
  • Useful set of post-processing methods such as normalize_name

Updated Jan 30, 2007 by Peter Jones

Tags:

This page hasn't been tagged yet.

Comments:

Have something to say? Login to post a comment.