简体   繁体   中英

Nokogiri: Parsing html table's rows with no open tag

I need to parse an html table with a format like this:

require 'nokogiri'

html_table = '<table>
    <tbody>
        <tr>
            <td>Some text in the first row!</td>
            <td>More text in the first row!</td>
        </tr>
        <td>Some text in the second row!</td>
        <td>More text in the second row!</td> </tr>
        <td>Some text in the third row!</td>
        <td>More text in the third row!</td>  </tr>
    </tbody>
</table>'

As you can see, the last two rows do not have the open <tr> tag. When I try to get all three rows using puts Nokogiri::HTML(html_table).css('table tr') , the code is cleaned and the last two rows become td nodes:

<tr>
    <td>Some text in the first row!</td>
    <td>More text in the first row!</td>
</tr>

I have found in the web some ways to fix this when there is no closing tag </tr> , but not the reverse. Is there a simple way to fix this using Nokogiri?

I think this is due to an error in parsing by Nokogiri. A possible solution would be using the Nokogumbo gem which expands the capability of nokogiri to parse more correctly. Install this by:

gem install nokogumbo

Than instead of using nokogiri you use:

require 'nokogumbo'# nokogumbo will also load Nokogiri, so no need to put: require 'nokogiri'
Nokogiri::HTML5(source_code).css('table tr').each do |row|
  p row
end

Note that you have to use the source code from the website which does correctly have the tags everywhere. You can use the source code of the website as follows, but it requires off course that there is only one table on the website page.

require 'open-uri'
source_code = open('http://www.url_to_website_I_want_to_parse.com')

Make sure you declare the variable source_code in the beginning off course.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM