简体   繁体   中英

ruby nokogiri HTML table scraping using xpath

I am trying to get "cell4" value that is written in a HTML table like the following using ruby xpath and nokogiri:

<html>
<body>

<h1>Heading</h1>

<p>paragraph.</p>

<h4>Two rows and three columns:</h4>
<table border="0">
<tr>
  <td>cell1</td>
  <td>cell2</td>
</tr>
<tr>
  <td>cell3</td>
  <td>cell4</td>
</tr>

</table>

</body>
</html>

I have the following simple code but it brings []. This question must be simple enough but couldnt find anything that hits the spot on the google

#!/usr/bin/ruby -w

require 'rubygems'
require 'nokogiri'

page1 = Nokogiri::HTML('test_simple.html')

a = page1.xpath("//html/body/table/tr[2]/td[2]")
p a

the xpath works as intended on REXML therefore it is correct, but does not on nokogiri. Since this is going to be used for larger htmls REXML cannot be used. The problem does not seem to be only with the tables only other tag contents

or cannot be scraped as well.

IMHO it is a lot asier to work with the CSS API in Nokogiri (XPath is always giving me headaches):

page.css('td') # should return an array of 4 table cell nodes
page.css('td')[3] # return the 4th 'td' node, counting starts at 0

thanks to taro`s comment, I was able to solve the issue with some little effort

Here goes the correct code:

#!/usr/bin/ruby -w
require 'rubygems'
require 'nokogiri'
page1 = Nokogiri::HTML(open('test_simple.html'))
a = page1.xpath("/html/body/table/tr[2]/td[2]").text
p a

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM