Quick Start

You start by writing a class for parsing a single page:

1
2
3
4
5
6
7
8
 # process the Google.com index.html page
 class GoogleMain < Scrapes::Page
   # make sure that the :about_link rule matched the web page
   validates_presence_of(:about_link)

   # extract the link to the about page
   rule(:about_link, 'a[@href*="about"]', '@href', 1)
 end
1
2
3
4
5
6
7
8
 # process the Google.com about page
 class GoogleAbout < Scrapes::Page
   # ensure the :title rule below matches the web page
   validates_presence_of(:title)

   # extract the text inside the <title></title> tag
   rule(:title, 'title', 'text()', 1)
 end

Then you start a scraping session and use those classes to process the web site:

1
2
3
4
5
6
7
 Scrapes::Session.start do |session|
   session.page(GoogleMain, 'http://google.com') do |main_page|
     session.page(GoogleAbout, main_page.about_link) do |about_page|
       puts about_page.title + ': ' + session.absolute_uri(main_page.about_link)
     end
   end
 end

On my machine, this code produces:

 About Google: http://www.google.com/intl/en/about.html
For more information, please review the following classes:
  • Scrapes::Session
  • Scrapes::Page
  • Scrapes::RuleParser
  • Scrapes::Hpricot::Extractors

Updated Feb 21, 2007 by Peter Jones

Tags:

This page hasn't been tagged yet.

Comments:

Have something to say? Login to post a comment.