简体   繁体   English

Nokogiri:解析html表中没有打开标签的行

[英]Nokogiri: Parsing html table's rows with no open tag

I need to parse an html table with a format like this: 我需要解析具有以下格式的html表:

require 'nokogiri'

html_table = '<table>
    <tbody>
        <tr>
            <td>Some text in the first row!</td>
            <td>More text in the first row!</td>
        </tr>
        <td>Some text in the second row!</td>
        <td>More text in the second row!</td> </tr>
        <td>Some text in the third row!</td>
        <td>More text in the third row!</td>  </tr>
    </tbody>
</table>'

As you can see, the last two rows do not have the open <tr> tag. 如您所见,最后两行没有打开的<tr>标记。 When I try to get all three rows using puts Nokogiri::HTML(html_table).css('table tr') , the code is cleaned and the last two rows become td nodes: 当我尝试使用puts Nokogiri::HTML(html_table).css('table tr')获取所有三行时,将清除代码,最后两行成为td节点:

<tr>
    <td>Some text in the first row!</td>
    <td>More text in the first row!</td>
</tr>

I have found in the web some ways to fix this when there is no closing tag </tr> , but not the reverse. 当没有结束标记</tr> ,我已经在网络上找到了一些解决此问题的方法,但反之则没有。 Is there a simple way to fix this using Nokogiri? 是否有使用Nokogiri修复此问题的简单方法?

I think this is due to an error in parsing by Nokogiri. 我认为这是由于Nokogiri解析错误。 A possible solution would be using the Nokogumbo gem which expands the capability of nokogiri to parse more correctly. 一种可能的解决方案是使用Nokogumbo gem,它可以扩展nokogiri的解析能力。 Install this by: 通过以下方式安装:

gem install nokogumbo

Than instead of using nokogiri you use: 比起使用nokogiri,您可以使用:

require 'nokogumbo'# nokogumbo will also load Nokogiri, so no need to put: require 'nokogiri'
Nokogiri::HTML5(source_code).css('table tr').each do |row|
  p row
end

Note that you have to use the source code from the website which does correctly have the tags everywhere. 请注意,您必须使用网站上的源代码,该源代码正确地到处都有标签。 You can use the source code of the website as follows, but it requires off course that there is only one table on the website page. 您可以按以下方式使用网站的源代码,但当然,网站页面上只有一个表是必需的。

require 'open-uri'
source_code = open('http://www.url_to_website_I_want_to_parse.com')

Make sure you declare the variable source_code in the beginning off course. 确保在课程开始时声明变量source_code

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM