Quick Start
You start by writing a class for parsing a single page:
1 2 3 4 5 6 7 8 |
# process the Google.com index.html page class GoogleMain < Scrapes::Page # make sure that the :about_link rule matched the web page validates_presence_of(:about_link) # extract the link to the about page rule(:about_link, 'a[@href*="about"]', '@href', 1) end |
1 2 3 4 5 6 7 8 |
# process the Google.com about page class GoogleAbout < Scrapes::Page # ensure the :title rule below matches the web page validates_presence_of(:title) # extract the text inside the <title></title> tag rule(:title, 'title', 'text()', 1) end |
Then you start a scraping session and use those classes to process the web site:
1 2 3 4 5 6 7 |
Scrapes::Session.start do |session| session.page(GoogleMain, 'http://google.com') do |main_page| session.page(GoogleAbout, main_page.about_link) do |about_page| puts about_page.title + ': ' + session.absolute_uri(main_page.about_link) end end end |
On my machine, this code produces:
About Google: http://www.google.com/intl/en/about.htmlFor more information, please review the following classes:
- Scrapes::Session
- Scrapes::Page
- Scrapes::RuleParser
- Scrapes::Hpricot::Extractors
Updated Feb 21, 2007 by Peter Jones
Tags:
This page hasn't been tagged yet.
Comments:
Have something to say? Login to post a comment.