Development Tips

Add something like this to your .irbrc:

  require 'rubygems'
  require 'yaml'
  require 'open-uri'
  require 'hpricot'
  require 'scrapes'
  def h(url) Hpricot(open(url)) end

Then use like this in irb to understand how Hpricot selectors work:

  doc = h '"http://www.foobar.com/'":http://www.foobar.com/'
  links = doc.search('table/a[@href]')  # for example

To understand the text extractors:

  texts(links)
  word(links.first)  # etc..

Converting normal Xpath to Hpricot Xpath, sort of:

There are various add-ons to firefox, for example, that display the Xpath to a selected node. Hpricot uses a different sytanx however, (http://code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions).). The following method is a first try at the conversion:

  def xpath_to_hpricot path
    path.split('/').reject{|e|e=~/^(html|tbody)$/ or e.blank?}.map do |e|
      res = e.sub(/\[/,':eq(').sub(/\]/,')')
      res.sub(/\d+/, (/(\d+)/.match(res).to_s.to_i - 1).to_s)
    end.join('//')
  end

Hpricot bugs

  • This selector will hang, ‘a[href=”this”]’ and this one won’t, ‘a[href="this"]'. Just make sure you have the '’ in front of the attribute name.

Updated Jan 30, 2007 by Peter Jones

Tags:

This page hasn't been tagged yet.

Comments:

Have something to say? Login to post a comment.