简体   繁体   中英

Help on Hpricot parsing html

I am trying to use Hpricot to parse amazon mobile.website but I found that the source code I get from browser(IE, FF, Chrome and opera) is different from that parsed by Hpricot

for example: http://www.amazon.com/gp/aw/d/0534243126

I am trying to extract the after-discount price. By looking at the source code from any browser, this is a very very easy job: doc.at("span[@class='dpOurPrice']").inner_text

However, it turns out open-uri/hpricot gets a completely different source code and the price has NO html tag on it. Could anyone tell me what is going on here?

The Amazon mobile website server propabily uses the user agent string to determine what tipe of content to return and the user agent for ruby is not what the server expected. Set it with:

open("http://xxx.com", 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }

to by the same as the browser so that server returns the same source it would to a browser.

You're probably getting a different page because you're not passing a User-Agent HTTP Header along with your request. Try passing one along and see if that alters what you get back. (Hint: "Steal" Chrome's).

Next, I would not use Hpricot anymore as it is unmaintained and superseded by Nokogiri . With Nokogiri you would find this element using doc.css("span.dpOurPrice") , for instance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM