简体   繁体   English

Nokogiri和xpath解析HTML表

[英]Nokogiri and xpath parsing an HTML table

I can set up parsing, and connect to a site but, when I run the script, it returns an empty NodeSet: 我可以设置解析,并连接到一个站点但是,当我运行脚本时,它返回一个空的NodeSet:

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'ap'

time = Time.new

url = <<-EOS
'http://www.events.psu.edu/cgi-bin/cal/webevent.cgi?cmd=listday&y=%d&m=%d&d=%d&cat=&sib=1&sort=m,e,t&ws=0&cf=list&set=1&swe=1&sa=1&de=1&tf=0&sb=1&stz=Default&cal=cal299' % [time.year, time.month, time.day]
EOS

page = Nokogiri::HTML(url)

rows =  page.xpath('/html/body/p/table/tbody/tr/td[3]/p/table/tbody/tr[2]')
details = rows.collect do |row|
detail = {}
[
 [:time, 'td[3]/p/text()'],
 [:name, 'td[4]/div/a/b/font/text()'],
 [:location, 'td[4]/div[2]/text()'],
 [:details, 'td[4]/div[4]/text()'],
].collect do |name, xpath|

detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
ap details

The returned value is "[]" . 返回值为"[]"

This is the HTML file before the table /html/body/p/table/tbody/tr/td[3]/p : 这是表/html/body/p/table/tbody/tr/td[3]/p之前的HTML文件:

<TABLE BORDER=0 CELLPADDING=3 WIDTH="100%">

<!--Begin Event-->
<TR>
  <TD WIDTH="2%">
    <P></P>
  </TD>
  <TD WIDTH="10%">
    <P></P>
  </TD>
  <TD WIDTH="19%">

    <P></P>
  </TD>
  <TD WIDTH="60%">
    <P></P>
  </TD>
</TR>
<TR>
<!--Icon Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="2%">
    <P CLASS="listeventicon">&nbsp;</P>

  </TD>
<!--Date Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="10%">
    <P CLASS="listeventdate">Mar 14</P>
  </TD>
<!--Time Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="19%">
    <P CLASS="listeventtime">8:30 a.m. - 4:30 p.m.<BR>
</P>

  </TD>
<!--Main Event Section-->
  <TD CLASS="listeventbg" VALIGN=top WIDTH="60%">
<div class=listeventtitlelarge><A HREF="http://www.pennstatehershey.org/web/diabetesresearch/home">
<B><font color="#0000CC">2011 Diabetes and Obesity Research Spring Summit</FONT></B></A>
</div>
<div class=listeventtitle><B>Calendar:</B> HHD Seminars<BR>
<B>Posted by:</B> <A HREF="mailto:luk10%40psu.edu">Lauren Kipp</A><BR><B>Location:</B> The Nittany Lion Inn<BR>

</div>
<DIV CLASS="listeventspacer"> </DIV>
<DIV CLASS="listeventdetails">
<B>Details:</B><BR>Registration and Abstract Deadline: February 15, 2011<BR>        <BR>Registration: Please follow the link for more details and access to on line registration. Space is limited, so please register early to ensure your seat at the conference.<BR><BR>The Keynote Speaker for this year’s event is <b>Dr. Robert Sherwin, from the Yale School of Medicine.</b>  Dr. Sherwin is known for his research in the effect of insulin on brain function and immune mechanisms leading to type 1 diabetes.  The topic of his presentation is <i>Pathophysiological Mechanisms in Diabetes, from Laboratory to Bedside.</i><BR><BR>A welcome to the University Park campus will be offered by <b>Eugene Marsh, MD</b>, Senior Associate Dean for the Penn State College of Medicine Regional Medical Campus and Associate Director of the Penn State Hershey Medical Group in State College<BR><BR>Abstract Submission<BR>Please follow the link for formatting details and to register your intent to submit an abstract using the on-line form.<BR>    <BR>You will receive a confirmation immediately upon submission of your on-line form. Subsequently, the final formatted abstract must be sent directly to Continuing Ed by email attachment (see website instructions). Within 48 hours of sending your abstract in final format, you will receive an email confirmation from ContinuingEd@hmc.psu.edu indicating that both your form & the abstract attachment have been received.<BR><BR><i>All abstracts will be considered for poster presentations. A subset of these abstracts will be selected and invited for brief oral presentations during the “Poster Headlines” plenary sessions. To be considered for an oral presentation, please be sure to meet the submission deadline for submission of your final abstract. Prizes will be awarded for the top three posters from by post-doc/fellow/student presenters.</i>

</div>
</TD>
<!--EndEvent-->
......Followed by more of the same format

I am trying to get the name of the event, the time, location and the description of the event. 我想知道事件的名称,时间,地点和事件的描述。

This is a simplified version of how I'd go about it. 这是我如何去做的简化版本。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'ap'

time = Time.new

url = 'http://www.events.psu.edu/cgi-bin/cal/webevent.cgi?cmd=listday&y=%d&m=%d&d=%d&cat=&sib=1&sort=m,e,t&ws=0&cf=list&set=1&swe=1&sa=1&de=1&tf=0&sb=1&stz=Default&cal=cal299' % [time.year, time.month, time.day]

page = Nokogiri::HTML(open(url))

details = page.search('//tr/td[@class="listeventbg"]/..').map do |row|
  time     = row.at( 'p.listeventtime'         ).text.strip rescue ''
  name     = row.at( 'div.listeventtitlelarge' ).text.strip rescue ''
  location = row.at( 'div.listeventtitle'      ).text.strip rescue ''
  details  = row.at( 'div.listeventdetails'    ).text.strip rescue ''

  {
    :time     => time,
    :name     => name,
    :location => location,
    :details  => details
  }
end

ap details

Rather than rely on long XPath accessors, often it's easier to break down the search. 而不是依赖于长XPath访问器,通常更容易分解搜索。 This loops over the rows, then, for each row, does a simple lookup for the cells. 这循环遍历行,然后,对于每一行,对单元格进行简单查找。

Normally I wouldn't use rescue '' but for quick and dirty it's OK. 通常情况下我不会使用“ rescue ''但为了快速和肮脏,这没关系。 For production I'd set up real exception handling. 为了生产,我设置了真正的异常处理。

Your sample code required Mechanize, but didn't use it, so I removed it for this example. 您的示例代码需要Mechanize,但没有使用它,因此我删除了此示例。 It didn't include a way to have Nokogiri retrieve the HTML, so I added Open-URI. 它没有包含让Nokogiri检索HTML的方法,所以我添加了Open-URI。

Nokogiri allows use of CSS and XPath accessors. Nokogiri允许使用CSS和XPath访问器。 A lot of times CSS will result in a simpler search. 很多时候CSS会导致更简单的搜索。 XPath has more power, but that can come at the price of complexity. XPath具有更强大的功能,但这可能以复杂性为代价。 /tr/td[@class="listeventbg"]/.. looks for rows with the embedded cells, then steps back to the row level. /tr/td[@class="listeventbg"]/..查找包含嵌入单元格的行,然后返回行级别。

You can use XPath instead of the CSS accessor like so: 您可以使用XPath而不是CSS访问器,如下所示:

//div[@class='listeventtitlelarge']

but, remember, this is a full text match so foobar will also be caught. 但是,请记住,这是一个全文匹配,所以foobar也将被捕获。 In any case, you can modify it with a few simple regex functions or just don't use too similar class names. 在任何情况下,您都可以使用一些简单的正则表达式函数来修改它,或者只是不要使用太相似的类名。 Or, you could also go with " XPATH CSS CLASS MATCHING " from the pivotall guys. 或者,您也可以使用来自pivotall的“ XPATH CSS CLASS MATCHING ”。

It looks like you are parsing through the structure instead of using the classes that are given in the document. 看起来您正在解析整个结构,而不是使用文档中给出的类。 I would use the CSS classes the document creator put in, like this: 我会使用文档创建者放入的CSS类,如下所示:

page = Nokogiri::HTML(url)
eventdate = page.at_css("p.listeventdate").content
eventtime = page.at_css("p.listeventtime").content
details =   page.at_css("div.listeventdetails").content

If you are doing this on a larger document, where multiple results will be returned, then use css and iterate through the results instead of at_css . 如果您在更大的文档上执行此操作,将返回多个结果,则使用css并迭代结果而不是at_css The latter only finds one instance of the tag and class. 后者只找到标签和类的一个实例。

It looks like everything you want has a selector that makes more sense than the direct path. 看起来你想要的一切都有一个比直接路径更有意义的选择器。 It also makes it more resilient to change because, if they change the structure and keep the same classes, then your parsing still works. 它还使它更具弹性,因为如果它们改变了结构并保持相同的类,那么你的解析仍然有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM