简体   繁体   English

使用xpath的ruby nokogiri HTML表格刮擦

[英]ruby nokogiri HTML table scraping using xpath

I am trying to get "cell4" value that is written in a HTML table like the following using ruby xpath and nokogiri: 我正在尝试使用ruby xpath和nokogiri获取写在HTML表中的“cell4”值,如下所示:

<html>
<body>

<h1>Heading</h1>

<p>paragraph.</p>

<h4>Two rows and three columns:</h4>
<table border="0">
<tr>
  <td>cell1</td>
  <td>cell2</td>
</tr>
<tr>
  <td>cell3</td>
  <td>cell4</td>
</tr>

</table>

</body>
</html>

I have the following simple code but it brings []. 我有以下简单的代码,但它带来了[]。 This question must be simple enough but couldnt find anything that hits the spot on the google 这个问题必须足够简单,但无法找到任何可以在谷歌上点击的地方

#!/usr/bin/ruby -w

require 'rubygems'
require 'nokogiri'

page1 = Nokogiri::HTML('test_simple.html')

a = page1.xpath("//html/body/table/tr[2]/td[2]")
p a

the xpath works as intended on REXML therefore it is correct, but does not on nokogiri. xpath在REXML上按预期工作,因此它是正确的,但不在nokogiri上。 Since this is going to be used for larger htmls REXML cannot be used. 由于这将用于更大的htmls,因此无法使用REXML。 The problem does not seem to be only with the tables only other tag contents 问题似乎不仅仅是表中的其他标记内容

or cannot be scraped as well. 或者也不能被刮掉。

IMHO it is a lot asier to work with the CSS API in Nokogiri (XPath is always giving me headaches): 恕我直言,使用Nokogiri中的CSS API非常简单(XPath总是令我头疼):

page.css('td') # should return an array of 4 table cell nodes
page.css('td')[3] # return the 4th 'td' node, counting starts at 0

thanks to taro`s comment, I was able to solve the issue with some little effort 感谢taro的评论,我能够通过一些努力来解决这个问题

Here goes the correct code: 这是正确的代码:

#!/usr/bin/ruby -w
require 'rubygems'
require 'nokogiri'
page1 = Nokogiri::HTML(open('test_simple.html'))
a = page1.xpath("/html/body/table/tr[2]/td[2]").text
p a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM