简体   繁体   English

用Nokogiri :: HTML进行爬网-无法从XPATH获取文本

[英]Scraping with Nokogiri::HTML - Can't get text from XPATH

I'm trying to scrape html with Nokogiri. 我正在尝试用Nokogiri抓取html。 This is the html source: 这是html来源:

<span id="J_WlAreaInfo" class="wl-areacon">
    <span id="J-From">山东济南</span>
    至
    <span id="J-To">
        <span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
            全国
            <s></s>
        </span>
    </span>
</span> 

I need to get the following text: 山东济南 我需要输入以下文本:山东济南

Checked shortest XPATH with firebug: 使用Firebug检查了最短的XPATH:

//*[@id="J-From"]

Here is my ruby code: 这是我的红宝石代码:

doc = Nokogiri::HTML(open("http://foo.html"), "UTF-8")
area = doc.xpath('//*[@id="J-From"]')
puts area.text

However, it returns nothing. 但是,它什么也不返回。 What am I doing wrong? 我究竟做错了什么?

However, it returns nothing. 但是,它什么也不返回。 What am I doing wrong? 我究竟做错了什么?

xpath() returns an array containing the matches (it's actually called a NodeSet): xpath()返回一个包含匹配项的数组(实际上称为NodeSet):

require 'nokogiri'


html = %q{
<span id="J_WlAreaInfo" class="wl-areacon">
    <span id="J-From">山东济南</span>
    至
    <span id="J-To">
        <span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
            全国
            <s></s>
        </span>
    </span>
</span> 
}

doc = Nokogiri::HTML(html)
target_tags = doc.xpath('//*[@id="J-From"]')

target_tags.each do |target_tag|
  puts target_tag.text
end

--output:--
山东济南

Edit: You can actually call text() on the Array, but it will return the concatenated results of the text for each match in the array--which is not something I've ever found useful--but because there is only one match you should have gotten the result 山东济南 . 编辑:您实际上可以在Array上调用text() ,但是它将为数组中的每个匹配返回文本的串联结果-这不是我发现的有用的东西-但因为只有一个匹配您应该已经得到了山东济南的结果。 There is nothing in your post that indicates why you didn't get that result. 您的帖子中没有任何内容表明您为什么没有得到该结果。

If you only want a single result from your xpath, ie the first match, then you can use at_xpath() : 如果只希望从xpath获得单个结果,即第一个匹配项,则可以使用at_xpath()

target_tag = doc.at_xpath('//*[@id="J-From"]')
puts target_tag.text

--output:--
山东济南

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM