如何使用Nokogiri在标签下获取所有文本？

Question

In this example I am trying to get the text from within the <td> tag of a table. 在此示例中，我尝试从表的<td>标记中获取文本。 First, the html code. 首先，是html代码。

<table>
  <tbody>
  <tr>
    <td>Single line of text</td>
  </tr>
  <tr>
    <td>Text here<p>First line</p><p>Second line</p></td>
  </tr>
  </tbody>
</table>

Then the ruby code here. 然后是红宝石代码。

require 'nokogiri'
require 'pp'

html = File.open('test.html').read
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table[1]/tbody/tr')

data = rows.collect do |row|
  row.at_xpath('td[1]/text()').to_s
end

pp data

And the result that I get is. 我得到的结果是。

["Single line of text", "Text here"]

How can I get all of the text in the second <td> tag? 如何获得第二个<td>标记中的所有文本？

Answer 1

There are two changes you will need to make to get all the text nodes. 要获取所有text节点，需要进行两项更改。 First at_xpath will only ever return a single node, so to get multiple nodes you'll need to use xpath . 首先at_xpath只会返回单个节点，因此要获得多个节点，您将需要使用xpath 。

Second, to get all descendant nodes, not just child nodes, use // instead of / . 其次，要获取所有后代节点，而不仅仅是子节点，请使用//而不是/ 。

Combining these, the line of code would be: 结合这些，代码行将是：

row.xpath('td[1]//text()').to_s

This will concatenate all the text nodes together, giving the result: 这会将所有文本节点连接在一起，得到结果：

["Single line of text", "Text hereFirst lineSecond line"]

which may not be what you want. 这可能不是您想要的。 Rather than just call to_s on the resulting nodeset you will need to process to fit your needs. 您不仅需要在结果节点集上调用to_s ，还需要进行处理以满足您的需求。

Answer 2

How about this? 这个怎么样？

pp doc.search("//tr[2]//td//text()").map { |item| item.text }

As matt says, you can get all descendants using // . 正如马特所说，您可以使用//获得所有后代。

You can also index the second tr if you want that one specifically. 如果需要，可以将第二个tr索引。 Just leave out the indexing to get all the tr s. 只需省略索引即可获得所有tr 。

And you can filter the resulting text objects to get only those that have a td upstream. 而且，您可以过滤生成的文本对象，以仅获取上游具有td文本对象。

Finally, map over each Nokogiri object, plucking out the text into the final array, which looks like this: 最后，在每个Nokogiri对象上进行映射，将文本提取到最终数组中，如下所示：

["Text here", "First line", "Second line"]

Answer 3

You want the text method of Nokogiri::XML::Node if you want to get all the text for any element: 如果要获取任何元素的所有文本，则需要Nokogiri::XML::Node的text方法：

p doc.xpath('//table[1]/tbody/tr').map{ |tr| tr.text.strip }
#=> ["Single line of text", "Text hereFirst lineSecond line"]

(The strip method just gets rid of leading and trailing whitespace.) （ strip方法只是摆脱了前导和尾随空格。）

如何使用Nokogiri在标签下获取所有文本？

问题描述

3 个解决方案

解决方案1
3 已采纳 2016-05-10 23:29:35

解决方案2
0 2016-05-10 23:43:08

解决方案3
0 2016-05-17 19:38:43

如何使用Nokogiri在标签下获取所有文本？

问题描述

3 个解决方案

解决方案1 3 已采纳 2016-05-10 23:29:35

解决方案2 0 2016-05-10 23:43:08

解决方案3 0 2016-05-17 19:38:43

解决方案1
3 已采纳 2016-05-10 23:29:35

解决方案2
0 2016-05-10 23:43:08

解决方案3
0 2016-05-17 19:38:43