简体   繁体   English

使用Nokogiri解析表

[英]Parse table using Nokogiri

I would like to parse a table using Nokogiri. 我想用Nokogiri解析一张桌子。 I'm doing it this way 我是这样做的

def parse_table_nokogiri(html)

    doc = Nokogiri::HTML(html)

    doc.search('table > tr').each do |row|
        row.search('td/font/text()').each do |col|
            p col.to_s
        end
    end

end

Some of the table that I have have rows like this: 我有一些表有这样的行:

<tr>
  <td>
     Some text
  </td>
</tr>

...and some have this. ......有些人有这个。

<tr>
  <td>
     <font> Some text </font>
  </td>
</tr>

My XPath expression works for the second scenario but not the first. 我的XPath表达式适用于第二种情况,但不适用于第一种情况。 Is there an XPath expression that I could use that would give me the text from the innermost node of the cell so that I can handle both scenarios? 是否有一个我可以使用的XPath表达式,它将从单元格的最内层节点给我文本,以便我可以处理这两种情况?


I've incorporated the changes into my snippet 我已将更改合并到我的代码段中

def parse_table_nokogiri(html)

    doc = Nokogiri::HTML(html)
    table = doc.xpath('//table').max_by {|table| table.xpath('.//tr').length}

    rows = table.search('tr')[1..-1]
    rows.each do |row|

        cells = row.search('td//text()').collect {|text| CGI.unescapeHTML(text.to_s.strip)}
        cells.each do |col|

            puts col
            puts "_____________"

        end

    end

end

Use : 用途

td//text()[normalize-space()]

This selects all non-white-space-only text node descendents of any td child of the current node (the tr already selected in your code). 这将选择当前节点的任何td子节点(在代码中已选择的tr )的所有非空白空间文本节点后代。

Or if you want to select all text-node descendents, regardles whether they are white-space-only or not: 或者,如果要选择所有文本节点后代,请考虑它们是否仅为空白空间:

td//text()

UPDATE : 更新

The OP has signaled in a comment that he is getting an unwanted td with content just a '&#160;' 该任择议定书已经标志着评论说,他越来越不想要td的内容只是一个'&#160;' (aka non-breaking space). (又名不间断的空间)。

To exclude also td s whose content is composed only of (one or more) nbsp characters, use: 要排除其内容仅由(一个或多个)字符组成的td ,请使用:

td//text()[translate(normalize-space(), '&#160;', '')]

Simple: 简单:

doc.search('//td').each do |cell|
  puts cell.content
end

Simple (but not DRY) way of using alternation: 简单(但不是干)使用交替的方式:

require 'nokogiri'

doc = Nokogiri::HTML <<ENDHTML
<body><table><thead><tr><td>NOT THIS</td></tr></thead><tr>
  <td>foo</td>
  <td><font>bar</font></td>
</tr></table></body>
ENDHTML

p doc.xpath( '//table/tr/td/text()|//table/tr/td/font/text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=>  #<Nokogiri::XML::Text:0x804286fc "bar">]

See XPath with optional element in hierarchy for a more DRY answer. 有关更干的答案,请参阅层次结构中包含可选元素的XPath

In this case, however, you can simply do: 但是,在这种情况下,您可以简单地执行:

p doc.xpath( '//table/tr/td//text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=>  #<Nokogiri::XML::Text:0x804286fc "bar">]

Note that your table structure (and mine above) which does not have an explicit tbody element is invalid for XHTML. 请注意,没有显式tbody元素的表结构(以及我的上面)对XHTML无效。 Given your explicit table > tr above, however, I assume that you have a reason for this. 鉴于您的显式table > tr ,但我认为您有理由这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM