简体   繁体   English

使用Ruby / Mechanize(和Nokogiri)从HTML中提取单个字符串

[英]extract single string from HTML using Ruby/Mechanize (and Nokogiri)

I am extracting data from a forum. 我正从论坛中提取数据。 My script based on is working fine. 我的脚本基于工作正常。 Now I need to extract date and time (21 Dec 2009, 20:39) from single post. 现在我需要从单个帖子中提取日期和时间(2009年12月21日,20:39)。 I cannot get it work. 我无法让它发挥作用。 I used FireXPath to determine the xpath. 我使用FireXPath来确定xpath。

Sample code: 示例代码:

 require 'rubygems'
 require 'mechanize'

   post_agent = WWW::Mechanize.new
    post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
    puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts  post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts post_page.parser.xpath('//[@id="post1960370"]/tbody/tr[1]/td/div[2]/text()')

all my attempts end with empty string or an error. 我的所有尝试都以空字符串或错误结束。


I cannot find any documentation on using Nokogiri within Mechanize. 我找不到有关在Mechanize中使用Nokogiri的任何文档。 The Mechanize documentation says at the bottom of the page: Mechanize文档在页面底部显示:

After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods. 使用Mechanize导航到需要刮擦的页面后,使用Nokogiri方法刮取它。

But what methods? 但有什么方法呢? Where can I read about them with samples and explained syntax? 我在哪里可以通过样本和解释语法阅读它们? I did not find anything on Nokogiri's site either. 我也没有在Nokogiri的网站上找到任何东西。

Radek. 拉德克。 I'm going to show you how to fish. 我要告诉你如何钓鱼。

When you call Mechanize::Page::parser , it's giving you the Nokogiri document. 当您调用Mechanize::Page::parser ,它会为您提供Nokogiri文档。 So your " xpath " and " at_xpath " calls are invoking Nokogiri. 所以你的“ xpath ”和“ at_xpath ”调用正在调用Nokogiri。 The problem is in your xpaths. 问题出在你的xpaths中。 In general, start out with the most general xpath you can get to work, and then narrow it down. 一般情况下,从最常用的xpath开始,然后缩小范围。 So, for example, instead of this: 所以,例如,而不是这个:

puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip

start with this: 从这开始:

puts post_page.parser.xpath('//table').to_html

This gets the any tables, anywhere, and then prints them as html. 这可以在任何地方获取任何表,然后将它们打印为html。 Examine the HTML, to see what tables it brought back. 检查HTML,查看它带回的表。 It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. 当你只想要一个时,它可能会抓住几个,所以你需要告诉它如何挑选你想要的一个表。 If, for example, you notice that the table you want has CSS class " userdata ", then try this: 例如,如果您注意到所需的表具有CSS类“ userdata ”,请尝试以下操作:

puts post_page.parser.xpath("//table[@class='userdata']").to_html

Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. 任何时候你没有返回一个数组,你就搞砸了xpath,所以在程序之前修复它。 Once you're getting the table you want, then try to get the rows: 一旦你得到你想要的表,然后尝试获取行:

puts post_page.parser.xpath("//table[@class='userdata']//tr").to_html

If that worked, then take off the " to_html " and you now have an array of Nokogiri nodes, each one a table row. 如果有效,那么取下“ to_html ”,你现在有一个Nokogiri节点数组,每个节点都有一个表行。

And that's how you do it. 这就是你如何做到的。

I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again. 我想你已经从Firebug中复制了这个,firebug给你一个额外的tbody,它可能不在实际的代码中...所以我的建议是删除那个tbody然后再试一次。 if it still doesn't work ... then follow Wayne Conrad's process that's the best! 如果它仍然不起作用...那么按照韦恩康拉德的过程,这是最好的!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM