Development Tips
Add something like this to your .irbrc:
require 'rubygems' require 'yaml' require 'open-uri' require 'hpricot' require 'scrapes' def h(url) Hpricot(open(url)) end
Then use like this in irb to understand how Hpricot selectors work:
doc = h '"http://www.foobar.com/'":http://www.foobar.com/' links = doc.search('table/a[@href]') # for example
To understand the text extractors:
texts(links) word(links.first) # etc..
Converting normal Xpath to Hpricot Xpath, sort of:
There are various add-ons to firefox, for example, that display the Xpath to a selected node. Hpricot uses a different sytanx however, (http://code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions).). The following method is a first try at the conversion:
def xpath_to_hpricot path
path.split('/').reject{|e|e=~/^(html|tbody)$/ or e.blank?}.map do |e|
res = e.sub(/\[/,':eq(').sub(/\]/,')')
res.sub(/\d+/, (/(\d+)/.match(res).to_s.to_i - 1).to_s)
end.join('//')
end
Hpricot bugs
- This selector will hang, ‘a[href=”this”]’ and this one won’t, ‘a[
href="this"]'. Just make sure you have the '’ in front of the attribute name.
Updated Jan 30, 2007 by Peter Jones
Tags:
This page hasn't been tagged yet.
Comments:
Have something to say? Login to post a comment.