用Nokogiri :: HTML进行爬网-无法从XPATH获取文本

Question

I'm trying to scrape html with Nokogiri. 我正在尝试用Nokogiri抓取html。 This is the html source: 这是html来源：

<span id="J_WlAreaInfo" class="wl-areacon">
    <span id="J-From">山东济南</span>
    至
    <span id="J-To">
        <span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
            全国
            <s></s>
        </span>
    </span>
</span>

I need to get the following text: 山东济南 我需要输入以下文本：山东济南

Checked shortest XPATH with firebug: 使用Firebug检查了最短的XPATH：

//*[@id="J-From"]

Here is my ruby code: 这是我的红宝石代码：

doc = Nokogiri::HTML(open("http://foo.html"), "UTF-8")
area = doc.xpath('//*[@id="J-From"]')
puts area.text

However, it returns nothing. 但是，它什么也不返回。 What am I doing wrong? 我究竟做错了什么？

Answer 1

However, it returns nothing. 但是，它什么也不返回。 What am I doing wrong? 我究竟做错了什么？

xpath() returns an array containing the matches (it's actually called a NodeSet): xpath（）返回一个包含匹配项的数组（实际上称为NodeSet）：

require 'nokogiri'


html = %q{
<span id="J_WlAreaInfo" class="wl-areacon">
    <span id="J-From">山东济南</span>
    至
    <span id="J-To">
        <span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
            全国
            <s></s>
        </span>
    </span>
</span> 
}

doc = Nokogiri::HTML(html)
target_tags = doc.xpath('//*[@id="J-From"]')

target_tags.each do |target_tag|
  puts target_tag.text
end

--output:--
山东济南

Edit: You can actually call text() on the Array, but it will return the concatenated results of the text for each match in the array--which is not something I've ever found useful--but because there is only one match you should have gotten the result 山东济南 . 编辑：您实际上可以在Array上调用text() ，但是它将为数组中的每个匹配返回文本的串联结果-这不是我发现的有用的东西-但因为只有一个匹配您应该已经得到了山东济南的结果。 There is nothing in your post that indicates why you didn't get that result. 您的帖子中没有任何内容表明您为什么没有得到该结果。

If you only want a single result from your xpath, ie the first match, then you can use at_xpath() : 如果只希望从xpath获得单个结果，即第一个匹配项，则可以使用at_xpath() ：

target_tag = doc.at_xpath('//*[@id="J-From"]')
puts target_tag.text

--output:--
山东济南

用Nokogiri :: HTML进行爬网-无法从XPATH获取文本

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-06-07 04:32:17

用Nokogiri :: HTML进行爬网-无法从XPATH获取文本

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-06-07 04:32:17

解决方案1
2 已采纳 2015-06-07 04:32:17