简体   繁体   English

如何使用Nokogiri在标签下获取所有文本?

[英]How do I get all the text under a tag with Nokogiri?

In this example I am trying to get the text from within the <td> tag of a table. 在此示例中,我尝试从表的<td>标记中获取文本。 First, the html code. 首先,是html代码。

<table>
  <tbody>
  <tr>
    <td>Single line of text</td>
  </tr>
  <tr>
    <td>Text here<p>First line</p><p>Second line</p></td>
  </tr>
  </tbody>
</table>

Then the ruby code here. 然后是红宝石代码。

require 'nokogiri'
require 'pp'

html = File.open('test.html').read
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table[1]/tbody/tr')

data = rows.collect do |row|
  row.at_xpath('td[1]/text()').to_s
end

pp data

And the result that I get is. 我得到的结果是。

["Single line of text", "Text here"]

How can I get all of the text in the second <td> tag? 如何获得第二个<td>标记中的所有文本?

There are two changes you will need to make to get all the text nodes. 要获取所有text节点,需要进行两项更改。 First at_xpath will only ever return a single node, so to get multiple nodes you'll need to use xpath . 首先at_xpath只会返回单个节点,因此要获得多个节点,您将需要使用xpath

Second, to get all descendant nodes, not just child nodes, use // instead of / . 其次,要获取所有后代节点,而不仅仅是子节点,请使用//而不是/

Combining these, the line of code would be: 结合这些,代码行将是:

row.xpath('td[1]//text()').to_s

This will concatenate all the text nodes together, giving the result: 这会将所有文本节点连接在一起,得到结果:

["Single line of text", "Text hereFirst lineSecond line"]

which may not be what you want. 这可能不是您想要的。 Rather than just call to_s on the resulting nodeset you will need to process to fit your needs. 您不仅需要在结果节点集上调用to_s ,还需要进行处理以满足您的需求。

How about this? 这个怎么样?

pp doc.search("//tr[2]//td//text()").map { |item| item.text }

As matt says, you can get all descendants using // . 正如马特所说,您可以使用//获得所有后代。

You can also index the second tr if you want that one specifically. 如果需要,可以将第二个tr索引。 Just leave out the indexing to get all the tr s. 只需省略索引即可获得所有tr

And you can filter the resulting text objects to get only those that have a td upstream. 而且,您可以过滤生成的文本对象,以仅获取上游具有td文本对象。

Finally, map over each Nokogiri object, plucking out the text into the final array, which looks like this: 最后,在每个Nokogiri对象上进行映射,将文本提取到最终数组中,如下所示:

["Text here", "First line", "Second line"]

You want the text method of Nokogiri::XML::Node if you want to get all the text for any element: 如果要获取任何元素的所有文本,则需要Nokogiri::XML::Nodetext方法:

p doc.xpath('//table[1]/tbody/tr').map{ |tr| tr.text.strip }
#=> ["Single line of text", "Text hereFirst lineSecond line"]

(The strip method just gets rid of leading and trailing whitespace.) strip方法只是摆脱了前导和尾随空格。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM