Nokogiri：解析html表中沒有打開標簽的行

Question

我需要解析具有以下格式的html表：

require 'nokogiri'

html_table = '<table>
    <tbody>
        <tr>
            <td>Some text in the first row!</td>
            <td>More text in the first row!</td>
        </tr>
        <td>Some text in the second row!</td>
        <td>More text in the second row!</td> </tr>
        <td>Some text in the third row!</td>
        <td>More text in the third row!</td>  </tr>
    </tbody>
</table>'

如您所見，最后兩行沒有打開的<tr>標記。 當我嘗試使用puts Nokogiri::HTML(html_table).css('table tr')獲取所有三行時，將清除代碼，最后兩行成為td節點：

<tr>
    <td>Some text in the first row!</td>
    <td>More text in the first row!</td>
</tr>

當沒有結束標記</tr> ，我已經在網絡上找到了一些解決此問題的方法，但反之則沒有。 是否有使用Nokogiri修復此問題的簡單方法？

Answer 1

我認為這是由於Nokogiri解析錯誤。 一種可能的解決方案是使用Nokogumbo gem，它可以擴展nokogiri的解析能力。 通過以下方式安裝：

gem install nokogumbo

比起使用nokogiri，您可以使用：

require 'nokogumbo'# nokogumbo will also load Nokogiri, so no need to put: require 'nokogiri'
Nokogiri::HTML5(source_code).css('table tr').each do |row|
  p row
end

請注意，您必須使用網站上的源代碼，該源代碼正確地到處都有標簽。 您可以按以下方式使用網站的源代碼，但當然，網站頁面上只有一個表是必需的。

require 'open-uri'
source_code = open('http://www.url_to_website_I_want_to_parse.com')

確保在課程開始時聲明變量source_code 。

Nokogiri：解析html表中沒有打開標簽的行

問題描述

1 個解決方案

解決方案1
1 已采納 2014-09-27 15:43:50

Nokogiri：解析html表中沒有打開標簽的行

問題描述

1 個解決方案

解決方案1 1 已采納 2014-09-27 15:43:50

解決方案1
1 已采納 2014-09-27 15:43:50