简体繁体中英

Help on Hpricot parsing html

原文 2011-09-12 05:30:56 5 2 ruby-on-rails/ ruby/ ruby-on-rails-3

I am trying to use Hpricot to parse amazon mobile.website but I found that the source code I get from browser(IE, FF, Chrome and opera) is different from that parsed by Hpricot

for example: http://www.amazon.com/gp/aw/d/0534243126

I am trying to extract the after-discount price. By looking at the source code from any browser, this is a very very easy job: doc.at("span[@class='dpOurPrice']").inner_text

However, it turns out open-uri/hpricot gets a completely different source code and the price has NO html tag on it. Could anyone tell me what is going on here?

2 answers

The Amazon mobile website server propabily uses the user agent string to determine what tipe of content to return and the user agent for ruby is not what the server expected. Set it with:

open("http://xxx.com", 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }

to by the same as the browser so that server returns the same source it would to a browser.

You're probably getting a different page because you're not passing a User-Agent HTTP Header along with your request. Try passing one along and see if that alters what you get back. (Hint: "Steal" Chrome's).

Next, I would not use Hpricot anymore as it is unmaintained and superseded by Nokogiri . With Nokogiri you would find this element using doc.css("span.dpOurPrice") , for instance.

HTML Scraping with Hpricot (Using Ruby on Rails)

Html / Script Scraping Google Map using Hpricot (Ruby On Rails)

Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails)

How can I get Hpricot to play nice with HTML5?

how to remove event attribute from html using Hpricot?

Hpricot and Rails

hpricot problem

Help needed with URL Parsing

Problem with loading the hpricot gem

Non greedy searches with Hpricot?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question HTML Scraping with Hpricot (Using Ruby on Rails) Html / Script Scraping Google Map using Hpricot (Ruby On Rails) Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails) How can I get Hpricot to play nice with HTML5? how to remove event attribute from html using Hpricot? Hpricot and Rails hpricot problem Help needed with URL Parsing Problem with loading the hpricot gem Non greedy searches with Hpricot?

Related Tags

Help on Hpricot parsing html

Question

2 answers

solution1
0 ACCPTED 2011-09-12 05:36:45

solution2
0 2011-09-12 05:38:29

Help on Hpricot parsing html

Question

2 answers

solution1 0 ACCPTED 2011-09-12 05:36:45

solution2 0 2011-09-12 05:38:29

solution1
0 ACCPTED 2011-09-12 05:36:45

solution2
0 2011-09-12 05:38:29