繁体   English   中英

Nokogiri和xpath解析HTML表

[英]Nokogiri and xpath parsing an HTML table

我可以设置解析,并连接到一个站点但是,当我运行脚本时,它返回一个空的NodeSet:

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'ap'

time = Time.new

url = <<-EOS
'http://www.events.psu.edu/cgi-bin/cal/webevent.cgi?cmd=listday&y=%d&m=%d&d=%d&cat=&sib=1&sort=m,e,t&ws=0&cf=list&set=1&swe=1&sa=1&de=1&tf=0&sb=1&stz=Default&cal=cal299' % [time.year, time.month, time.day]
EOS

page = Nokogiri::HTML(url)

rows =  page.xpath('/html/body/p/table/tbody/tr/td[3]/p/table/tbody/tr[2]')
details = rows.collect do |row|
detail = {}
[
 [:time, 'td[3]/p/text()'],
 [:name, 'td[4]/div/a/b/font/text()'],
 [:location, 'td[4]/div[2]/text()'],
 [:details, 'td[4]/div[4]/text()'],
].collect do |name, xpath|

detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
ap details

返回值为"[]"

这是表/html/body/p/table/tbody/tr/td[3]/p之前的HTML文件:

<TABLE BORDER=0 CELLPADDING=3 WIDTH="100%">

<!--Begin Event-->
<TR>
  <TD WIDTH="2%">
    <P></P>
  </TD>
  <TD WIDTH="10%">
    <P></P>
  </TD>
  <TD WIDTH="19%">

    <P></P>
  </TD>
  <TD WIDTH="60%">
    <P></P>
  </TD>
</TR>
<TR>
<!--Icon Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="2%">
    <P CLASS="listeventicon">&nbsp;</P>

  </TD>
<!--Date Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="10%">
    <P CLASS="listeventdate">Mar 14</P>
  </TD>
<!--Time Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="19%">
    <P CLASS="listeventtime">8:30 a.m. - 4:30 p.m.<BR>
</P>

  </TD>
<!--Main Event Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="60%">
<div class=listeventtitlelarge><A HREF="http://www.pennstatehershey.org/web/diabetesresearch/home">
<B><font color="#0000CC">2011 Diabetes and Obesity Research Spring Summit</FONT></B></A>
</div>
<div class=listeventtitle><B>Calendar:</B> HHD Seminars<BR>
<B>Posted by:</B> <A HREF="mailto:luk10%40psu.edu">Lauren Kipp</A><BR><B>Location:</B> The Nittany Lion Inn<BR>

</div>
<DIV CLASS="listeventspacer"> </DIV>
<DIV CLASS="listeventdetails">
<B>Details:</B><BR>Registration and Abstract Deadline: February 15, 2011<BR>        <BR>Registration: Please follow the link for more details and access to on line registration. Space is limited, so please register early to ensure your seat at the conference.<BR><BR>The Keynote Speaker for this year’s event is <b>Dr. Robert Sherwin, from the Yale School of Medicine.</b>  Dr. Sherwin is known for his research in the effect of insulin on brain function and immune mechanisms leading to type 1 diabetes.  The topic of his presentation is <i>Pathophysiological Mechanisms in Diabetes, from Laboratory to Bedside.</i><BR><BR>A welcome to the University Park campus will be offered by <b>Eugene Marsh, MD</b>, Senior Associate Dean for the Penn State College of Medicine Regional Medical Campus and Associate Director of the Penn State Hershey Medical Group in State College<BR><BR>Abstract Submission<BR>Please follow the link for formatting details and to register your intent to submit an abstract using the on-line form.<BR>    <BR>You will receive a confirmation immediately upon submission of your on-line form. Subsequently, the final formatted abstract must be sent directly to Continuing Ed by email attachment (see website instructions). Within 48 hours of sending your abstract in final format, you will receive an email confirmation from ContinuingEd@hmc.psu.edu indicating that both your form & the abstract attachment have been received.<BR><BR><i>All abstracts will be considered for poster presentations. A subset of these abstracts will be selected and invited for brief oral presentations during the “Poster Headlines” plenary sessions. To be considered for an oral presentation, please be sure to meet the submission deadline for submission of your final abstract. Prizes will be awarded for the top three posters from by post-doc/fellow/student presenters.</i>

</div>
</TD>
<!--EndEvent-->
......Followed by more of the same format

我想知道事件的名称,时间,地点和事件的描述。

这是我如何去做的简化版本。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'ap'

time = Time.new

url = 'http://www.events.psu.edu/cgi-bin/cal/webevent.cgi?cmd=listday&y=%d&m=%d&d=%d&cat=&sib=1&sort=m,e,t&ws=0&cf=list&set=1&swe=1&sa=1&de=1&tf=0&sb=1&stz=Default&cal=cal299' % [time.year, time.month, time.day]

page = Nokogiri::HTML(open(url))

details = page.search('//tr/td[@class="listeventbg"]/..').map do |row|
  time     = row.at( 'p.listeventtime'         ).text.strip rescue ''
  name     = row.at( 'div.listeventtitlelarge' ).text.strip rescue ''
  location = row.at( 'div.listeventtitle'      ).text.strip rescue ''
  details  = row.at( 'div.listeventdetails'    ).text.strip rescue ''

  {
    :time     => time,
    :name     => name,
    :location => location,
    :details  => details
  }
end

ap details

而不是依赖于长XPath访问器,通常更容易分解搜索。 这循环遍历行,然后,对于每一行,对单元格进行简单查找。

通常情况下我不会使用“ rescue ''但为了快速和肮脏,这没关系。 为了生产,我设置了真正的异常处理。

您的示例代码需要Mechanize,但没有使用它,因此我删除了此示例。 它没有包含让Nokogiri检索HTML的方法,所以我添加了Open-URI。

Nokogiri允许使用CSS和XPath访问器。 很多时候CSS会导致更简单的搜索。 XPath具有更强大的功能,但这可能以复杂性为代价。 /tr/td[@class="listeventbg"]/..查找包含嵌入单元格的行,然后返回行级别。

您可以使用XPath而不是CSS访问器,如下所示:

//div[@class='listeventtitlelarge']

但是,请记住,这是一个全文匹配,所以foobar也将被捕获。 在任何情况下,您都可以使用一些简单的正则表达式函数来修改它,或者只是不要使用太相似的类名。 或者,您也可以使用来自pivotall的“ XPATH CSS CLASS MATCHING ”。

看起来您正在解析整个结构,而不是使用文档中给出的类。 我会使用文档创建者放入的CSS类,如下所示:

page = Nokogiri::HTML(url)
eventdate = page.at_css("p.listeventdate").content
eventtime = page.at_css("p.listeventtime").content
details =   page.at_css("div.listeventdetails").content

如果您在更大的文档上执行此操作,将返回多个结果,则使用css并迭代结果而不是at_css 后者只找到标签和类的一个实例。

看起来你想要的一切都有一个比直接路径更有意义的选择器。 它还使它更具弹性,因为如果它们改变了结构并保持相同的类,那么你的解析仍然有效。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM