I need to parse an html table with a format like this:
require 'nokogiri'
html_table = '<table>
<tbody>
<tr>
<td>Some text in the first row!</td>
<td>More text in the first row!</td>
</tr>
<td>Some text in the second row!</td>
<td>More text in the second row!</td> </tr>
<td>Some text in the third row!</td>
<td>More text in the third row!</td> </tr>
</tbody>
</table>'
As you can see, the last two rows do not have the open <tr>
tag. When I try to get all three rows using puts Nokogiri::HTML(html_table).css('table tr')
, the code is cleaned and the last two rows become td
nodes:
<tr>
<td>Some text in the first row!</td>
<td>More text in the first row!</td>
</tr>
I have found in the web some ways to fix this when there is no closing tag </tr>
, but not the reverse. Is there a simple way to fix this using Nokogiri?
I think this is due to an error in parsing by Nokogiri. A possible solution would be using the Nokogumbo gem which expands the capability of nokogiri to parse more correctly. Install this by:
gem install nokogumbo
Than instead of using nokogiri you use:
require 'nokogumbo'# nokogumbo will also load Nokogiri, so no need to put: require 'nokogiri'
Nokogiri::HTML5(source_code).css('table tr').each do |row|
p row
end
Note that you have to use the source code from the website which does correctly have the tags everywhere. You can use the source code of the website as follows, but it requires off course that there is only one table on the website page.
require 'open-uri'
source_code = open('http://www.url_to_website_I_want_to_parse.com')
Make sure you declare the variable source_code
in the beginning off course.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.