[英]How do I get all the text under a tag with Nokogiri?
In this example I am trying to get the text from within the <td>
tag of a table. 在此示例中,我尝试从表的
<td>
标记中获取文本。 First, the html code. 首先,是html代码。
<table>
<tbody>
<tr>
<td>Single line of text</td>
</tr>
<tr>
<td>Text here<p>First line</p><p>Second line</p></td>
</tr>
</tbody>
</table>
Then the ruby code here. 然后是红宝石代码。
require 'nokogiri'
require 'pp'
html = File.open('test.html').read
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table[1]/tbody/tr')
data = rows.collect do |row|
row.at_xpath('td[1]/text()').to_s
end
pp data
And the result that I get is. 我得到的结果是。
["Single line of text", "Text here"]
How can I get all of the text in the second <td>
tag? 如何获得第二个
<td>
标记中的所有文本?
There are two changes you will need to make to get all the text
nodes. 要获取所有
text
节点,需要进行两项更改。 First at_xpath
will only ever return a single node, so to get multiple nodes you'll need to use xpath
. 首先
at_xpath
只会返回单个节点,因此要获得多个节点,您将需要使用xpath
。
Second, to get all descendant nodes, not just child nodes, use //
instead of /
. 其次,要获取所有后代节点,而不仅仅是子节点,请使用
//
而不是/
。
Combining these, the line of code would be: 结合这些,代码行将是:
row.xpath('td[1]//text()').to_s
This will concatenate all the text nodes together, giving the result: 这会将所有文本节点连接在一起,得到结果:
["Single line of text", "Text hereFirst lineSecond line"]
which may not be what you want. 这可能不是您想要的。 Rather than just call
to_s
on the resulting nodeset you will need to process to fit your needs. 您不仅需要在结果节点集上调用
to_s
,还需要进行处理以满足您的需求。
How about this? 这个怎么样?
pp doc.search("//tr[2]//td//text()").map { |item| item.text }
As matt says, you can get all descendants using //
. 正如马特所说,您可以使用
//
获得所有后代。
You can also index the second tr
if you want that one specifically. 如果需要,可以将第二个
tr
索引。 Just leave out the indexing to get all the tr
s. 只需省略索引即可获得所有
tr
。
And you can filter the resulting text objects to get only those that have a td
upstream. 而且,您可以过滤生成的文本对象,以仅获取上游具有
td
文本对象。
Finally, map over each Nokogiri object, plucking out the text into the final array, which looks like this: 最后,在每个Nokogiri对象上进行映射,将文本提取到最终数组中,如下所示:
["Text here", "First line", "Second line"]
You want the text
method of Nokogiri::XML::Node
if you want to get all the text for any element: 如果要获取任何元素的所有文本,则需要
Nokogiri::XML::Node
的text
方法:
p doc.xpath('//table[1]/tbody/tr').map{ |tr| tr.text.strip }
#=> ["Single line of text", "Text hereFirst lineSecond line"]
(The strip
method just gets rid of leading and trailing whitespace.) (
strip
方法只是摆脱了前导和尾随空格。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.