简体   繁体   English

open-uri + hpricot&nokogiri不能正确解析html

[英]open-uri + hpricot & nokogiri don't parse html correctly

I'm trying to parse a webpage using open-uri + hpricot but it seems to be a problem in the parsing proccess as the gems don't bring me the things I want. 我正在尝试使用open-uri + hpricot解析一个网页,但它似乎是解析过程中的一个问题,因为宝石没有带给我我想要的东西。

Specifically I want to get this div (whose id is 'pasajes' ) in this url: 具体来说,我想在这个网址中得到这个div (其id为'pasajes' ):

http://www.despegar.com.ar http://www.despegar.com.ar

I write this code: 我写这段代码:

require 'nokogiri'
require 'hpricot'
require 'open-uri'

document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI

pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")

But it bring NOTHING! 但它带来了什么! I've tried lot of things in both hpricot and nokogiri: 我在hpricot和nokogiri都尝试过很多东西:

  1. I try giving the absolute path to that div 我尝试给出该​​div的绝对路径
  2. I try CSS path with selectors 我尝试使用选择器的CSS路径
  3. I try with hpricot search shortcut (doc//"div#pasajes") 我尝试使用hpricot搜索快捷方式(doc //“div#pasajes”)
  4. Almost every posible relative path to reach the 'pasajes' div 几乎每一个可行的相对路径到达'pasajes'div

Finally i found a horrible solution. 最后我找到了一个可怕的解决方案。 I have used the watir library and after open a web browser, i have passed the html to hpricot. 我使用了watir库,打开网页浏览器之后,我已经将html传递给了hpricot。 In this way hpricot DO RECOGNIZE the 'pasajes' div. 通过这种方式hpricot DO RECOGNIZE'pasajes'div。 But i don't want just to open a web-browsere only for parsing purposes... 但我不想仅仅为解析目的打开一个web-browsere ...

What I'm doing wrong? 我做错了什么? Is open-uri working bad? open-uri工作不好吗? Is hpricot? 是hpricot?

There's no DIV with the id pasajes in the static HTML page. 在静态HTML页面中没有带有id pasajes的DIV。 If you are running *nix you can see that by doing: 如果你正在运行* nix,你可以看到:

curl http://www.despegar.com.ar/ | grep pasajes

My guess is that it's JavaScript-generated. 我的猜测是它是由JavaScript生成的。

If you are using MacRuby you could try Lyndon . 如果您使用的是MacRuby,可以试试Lyndon

There's no div with id 'pasajes' in that page. 该页面中没有id为'pasajes'的div。 That's the problem. 那就是问题所在。

This fits more as an additional comment on Jonas' answer above rather than an answer in itself... But I am new to SO and do not have the "commenting powers" yet :) 这更适合作为对Jonas上面的回答的补充评论而不是答案本身...但我是SO的新手并且还没有“评论能力”:)

You can use Selenium RC to download the full HTML and then use nokogiri on the downloaded file. 您可以使用Selenium RC下载完整的HTML,然后在下载的文件上使用nokogiri。 Note that this will work only if the content is being generated/modified by Javascript. 请注意,这仅在Javascript生成/修改内容时才有效。 If the webpage depends on cookies to setup the content your options would be Selenium (in the browser) or watir as you have noted. 如果网页依赖于cookie来设置内容,您的选项将是Selenium(在浏览器中)或watir,如您所述。

I would love to hear a better solution to this (want to parse webpage with nokogiri, but the page is modified by JS). 我很想听到一个更好的解决方案(想用nokogiri解析网页,但页面由JS修改)。

I ran into a similar issue with Nokogiri but on OS X 10.5. 我遇到了与Nokogiri类似的问题,但是在OS X 10.5上。 However, I first tried open-uri to open the pages in question which have lots of HTML div, p whatever. 但是,我首先尝试使用open-uri来打开有大量HTML div的页面,无论如何。 I found by using: 我发现使用:

urldoc = open('http://hivelogic.com/articles/using_usr_local')
urldoc.readlines{|line| puts line}

I would see lots of wonderful HTML. 我会看到很多精彩的HTML。 I also found by doing read of the "file" into a string and passing that to Nokogiri I could get that to work fine. 我还通过将“文件”读入字符串并将其传递给Nokogiri我发现我可以正常工作。 I even had to modify the very demo they use on rubyforge to teach you about Nokogiri. 我甚至不得不修改他们在rubyforge上使用的演示来教你Nokogiri。

Using their own example I get this: 使用他们自己的例子我得到这个:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> <!DOCTYPE html>

>> doc.children
=> 

YUCK! YUCK!

If I tweak to read in the url to a string, I get good stuff: 如果我调整将URL读入字符串,我会得到好的东西:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)
=> <!DOCTYPE html>
<html>
<head>
..... TONS OF HTML HERE ........
</div>
</body>
</html>

Note I do see this lovely warning when I use irb to play: 注意当我使用irb播放时,我确实看到了这个可爱的警告:

HI. HI。 You're using libxml2 version 2.6.16 which is over 4 years old and has plenty of bugs. 您正在使用libxml2版本2.6.16,它已超过4年,并且有很多错误。 We suggest that for maximum HTML/XML parsing pleasure, you upgrade your version of libxml2 and re-install nokogiri. 我们建议,为了获得最大的HTML / XML解析乐趣,请升级您的libxml2版本并重新安装nokogiri。 If you like using libxml2 version 2.6.16, but don't like this warning, please define the constant I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring nokogiri. 如果您喜欢使用libxml2版本2.6.16,但不喜欢此警告,请在请求nokogiri之前定义常量I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2。

But I am not in the mood to deal with the horrors and various expert but contradicting advice on fixing libxml in /usr/local blah blah. 但我没有心情处理恐怖和各种专家但是在/ usr / local blah blah修复libxml的矛盾建议。 A post on link text has a great explanation of it, but then another *nix wizard attacks the very concept with some sound warnings and concerns. 关于链接文本的帖子有一个很好的解释,但是然后另一个* nix向导用一些声音警告和关注来攻击这个概念。 So I say, "no way". 所以我说,“没办法”。

Why do I write this? 为什么我这样写? Because IMO I think there might be a link between my Nokogiri blues and the libxml warning. 因为IMO我认为我的Nokogiri蓝调和libxml警告之间可能存在联系。 OS X 10.5 is on old stuff and they may have issues with that. OS X 10.5是旧的东西,他们可能有问题。

QUESTION

Do any other OS X 10.5 users have this issue with Nokogiri? 其他OS X 10.5用户是否与Nokogiri有此问题?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM