[英]Parse table using Nokogiri
I would like to parse a table using Nokogiri. 我想用Nokogiri解析一张桌子。 I'm doing it this way
我是这样做的
def parse_table_nokogiri(html)
doc = Nokogiri::HTML(html)
doc.search('table > tr').each do |row|
row.search('td/font/text()').each do |col|
p col.to_s
end
end
end
Some of the table that I have have rows like this: 我有一些表有这样的行:
<tr>
<td>
Some text
</td>
</tr>
...and some have this. ......有些人有这个。
<tr>
<td>
<font> Some text </font>
</td>
</tr>
My XPath expression works for the second scenario but not the first. 我的XPath表达式适用于第二种情况,但不适用于第一种情况。 Is there an XPath expression that I could use that would give me the text from the innermost node of the cell so that I can handle both scenarios?
是否有一个我可以使用的XPath表达式,它将从单元格的最内层节点给我文本,以便我可以处理这两种情况?
I've incorporated the changes into my snippet 我已将更改合并到我的代码段中
def parse_table_nokogiri(html)
doc = Nokogiri::HTML(html)
table = doc.xpath('//table').max_by {|table| table.xpath('.//tr').length}
rows = table.search('tr')[1..-1]
rows.each do |row|
cells = row.search('td//text()').collect {|text| CGI.unescapeHTML(text.to_s.strip)}
cells.each do |col|
puts col
puts "_____________"
end
end
end
Use : 用途 :
td//text()[normalize-space()]
This selects all non-white-space-only text node descendents of any td
child of the current node (the tr
already selected in your code). 这将选择当前节点的任何
td
子节点(在代码中已选择的tr
)的所有非空白空间文本节点后代。
Or if you want to select all text-node descendents, regardles whether they are white-space-only or not: 或者,如果要选择所有文本节点后代,请考虑它们是否仅为空白空间:
td//text()
UPDATE : 更新 :
The OP has signaled in a comment that he is getting an unwanted td
with content just a ' '
该任择议定书已经标志着评论说,他越来越不想要
td
的内容只是一个' '
(aka non-breaking space). (又名不间断的空间)。
To exclude also td
s whose content is composed only of (one or more) nbsp characters, use: 要排除其内容仅由(一个或多个)字符组成的
td
,请使用:
td//text()[translate(normalize-space(), ' ', '')]
Simple: 简单:
doc.search('//td').each do |cell|
puts cell.content
end
Simple (but not DRY) way of using alternation: 简单(但不是干)使用交替的方式:
require 'nokogiri'
doc = Nokogiri::HTML <<ENDHTML
<body><table><thead><tr><td>NOT THIS</td></tr></thead><tr>
<td>foo</td>
<td><font>bar</font></td>
</tr></table></body>
ENDHTML
p doc.xpath( '//table/tr/td/text()|//table/tr/td/font/text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=> #<Nokogiri::XML::Text:0x804286fc "bar">]
See XPath with optional element in hierarchy for a more DRY answer. 有关更干的答案,请参阅层次结构中包含可选元素的XPath 。
In this case, however, you can simply do: 但是,在这种情况下,您可以简单地执行:
p doc.xpath( '//table/tr/td//text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=> #<Nokogiri::XML::Text:0x804286fc "bar">]
Note that your table structure (and mine above) which does not have an explicit tbody
element is invalid for XHTML. 请注意,没有显式
tbody
元素的表结构(以及我的上面)对XHTML无效。 Given your explicit table > tr
above, however, I assume that you have a reason for this. 鉴于您的显式
table > tr
,但我认为您有理由这样做。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.