简体   繁体   English

如何在Ruby中使用Nokogiri解析日期

[英]How to parse a date using Nokogiri in Ruby

I am trying to parse this page and pull the date that begins after 我正在尝试解析此页面并拉起之后的日期

>p>From Date:

I get the error 我得到错误

Invalid predicate: //b[text() = '<p>From Date: ' (Nokogiri::XML::XPath::SyntaxError)

The xpath from "inspect element" is 来自“检查元素”的xpath是

/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p

This is an example of the code: 这是代码示例:

#/usr/bin/ruby

require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
noko.xpath("//b[text() = '<p>From Date: ").each do |b|
puts b.next_sibling.content.strip
end

This is file://china.html 这是文件://china.html


  
 
  
  
  
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

    <html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        
        <title>File </title>

    
      </head>
      <body>
        
            <div id ="timelineItems">
    <H2 id="telegram1"> Title </H2>
            <p><table cellspacing="0">
    <tr>
    <td width="2%">&nbsp;</td>
    <td width="75%">
    <table cellspacing="0" cellpadding="0" class="resultsTypes">
    <tr>
    <td width="5%" class="hide">&nbsp;</td>
    <td width="70%">
    <p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
    <p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>

    <p>recipient: David Ben Gurion</p>
    <p>sender: Prime Minister of Union of Burma, Rangoon</p>
    <p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
    <p>From Date: 02/14/1936</p>
    <p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
    </td>
    </tr>
    <tr>
    <td colspan="2">
    </td>
    </tr>
    </table></td>
    <td class="actions">&nbsp;</td>
    </tr>
    </table>

    </p>
          </div>
          
    
    </body></html>

Amadan's answer original.rb 阿玛丹的答案original.rb

 #/usr/bin/ruby require 'Nokogiri' noko = Nokogiri::HTML('china.html') date = noko.at_xpath("//p[starts-with(text(),'From Date: ')]").text() puts date formatted = date[/From Date: (.*)/, 1] puts formatted 
gives an error original.rb:5:in '<main>': undefined method 'text' for nil:NilClass (NoMethodError) 给出错误original.rb:5:in '<main>': undefined method 'text' for nil:NilClass (NoMethodError)

You can't use 你不能用

noko = Nokogiri::HTML('china.html')

Nokogiri::HTML is a shortcut to Nokogiri::HTML::Document.parse . Nokogiri::HTMLNokogiri::HTML::Document.parse的快捷方式。 The documentation says: 该文件说:

 .parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML) {|options| ... } ⇒ Object` 

... string_or_io may be a String, or any object that responds to read and close such as an IO, or StringIO. ... string_or_io可以是String,也可以是响应读取和关闭的任何对象,例如IO或StringIO。 ... ...

While 'china.html' is a String, it's not HTML. 'china.html'是一个字符串,但不是HTML。 It appears you're thinking that a filename will suffice, however Nokogiri doesn't open anything, it only understands strings containing markup, either HTML or XML, or an IO-type object that responds to the read method. 看来您在想文件名就足够了,但是Nokogiri并没有打开任何东西,它只理解包含标记的字符串(HTML或XML)或响应read方法的IO类型对象。 Compare these: 比较这些:

require 'nokogiri'

doc = Nokogiri::HTML('china.html')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>china.html</p></body></html>\n"

versus: 与:

doc = Nokogiri::HTML('<html><body><p>foo</p></body></html>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"

and: 和:

doc = Nokogiri::HTML(open('http://www.example.org'))
doc.to_html[0..99]
# => "<!DOCTYPE html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset=\"utf-8\">\n    <met"

The last works because OpenURI adds the ability to read URLs to open , which responds to read : 最后一个有效,因为OpenURI添加了读取URL的功能来open ,从而响应read

open('http://www.example.org').respond_to?(:read) # => true

Moving on to the question: 接下来的问题:

require 'nokogiri'
require 'open-uri'

html = <<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    <title>File </title>


  </head>
  <body>

        <div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
        <p><table cellspacing="0">
<tr>
<td width="2%">&nbsp;</td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide">&nbsp;</td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>

<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions">&nbsp;</td>
</tr>
</table>

</p>
      </div>


</body></html>
EOT

doc = Nokogiri::HTML(html)

Once the document is parsed, it's easy to find a particular <p> tag using the 解析完文档后,可以使用轻松找到特定的<p>标记。

<table cellspacing="0" cellpadding="0" class="resultsTypes">

as a placemarker: 作为地标:

from_date = doc.at('table.resultsTypes p[6]').text
# => "From Date: 02/14/1936"

It looks like its going to be tougher pulling the title = "Meeting in China" and link = "bing.com"; 似乎更难获得标题=“ Meeting in China”和link =“ bing.com”; since they are on the same line. 因为它们在同一条线上。

I'm using CSS selectors to define the path to the desired text. 我正在使用CSS选择器定义所需文本的路径。 CSS is more easily read than XPath, though XPath is more powerful and descriptive. CSS比XPath更容易阅读,尽管XPath更强大和更具描述性。 Nokogiri allows us to use either, and lets us use search or at with either. Nokogiri允许我们使用两者之一,并允许我们使用searchat两者一起使用。 at is equivalent to search('some selector').first . at等效于search('some selector').first There are also CSS and XPath specific versions of search and at , described in Nokogiri::XML::Node . Nokogiri::XML::Node介绍了searchat CSS和XPath特定版本。

title_link = doc.at('table.resultsTypes p[2] a')['href'] # => "http://www.bing.com"
title = doc.at('table.resultsTypes p[2] span').text # => "Meeting in China"

You're trying to use the XPath: 您正在尝试使用XPath:

/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p

however, it's not valid for the HTML you're working with. 但是,它对于正在使用的HTML无效。

Notice tbody in the selector. 注意选择器中的tbody Look at the HTML, immediately after either of the <table> tags, neither occurrence has a <tbody> tag, so the XPath is wrong. 看HTML,在<table>标记中的任何一个之后,都没有<tbody>标记,因此XPath是错误的。 I suspect that was generated by your browser, which is doing a fix-up of the HTML to add <tbody> according to the specification, however Nokogiri doesn't do a fix-up to add <tbody> and the HTML doesn't match, causing the search to fail. 我怀疑这是由您的浏览器生成的,浏览器正在根据规范对HTML进行修正以添加<tbody> ,但是Nokogiri并未对HTML进行修正以添加<tbody>匹配,导致搜索失败。 So, don't rely on the selector defined by the browser, nor should you trust the browser's idea of the actual HTML source. 因此,不要依赖浏览器定义的选择器,也不要相信浏览器对实际HTML源代码的想法。


Instead of using an explicit selector, it's better, easier, and smarter, to look for specific way-points in the markup, and use those to navigate to the node(s) you want. 与其使用显式选择器,不如使用更好,更轻松,更智能的方法,在标记中查找特定的航路点,并使用这些航路点导航至所需的节点。 Here's an example of doing everything above, only using a placeholder, and a mix of XPath and CSS: 这是仅使用占位符以及XPath和CSS的混合来完成上述所有操作的示例:

doc.at('//p[starts-with(., "Title:")]').text  # => "Title: Meeting in China"
title_node = doc.at('//p[starts-with(., "Title:")]')
title_url = title_node.at('a')['href'] # => "http://www.bing.com"
title = title_node.at('span').text # => "Meeting in China"

So, it's fine to mix and match CSS and XPath. 因此,可以混合使用CSS和XPath。

from_date = noko.at_xpath('//p[starts-with(text(), "From Date:")]').text()
date = from_date[/From Date: (.*)/, 1]
# => "02/14/1936"

EDIT: 编辑:

Explanation: Get the first node ( #at_xpath ) anywhere in the document ( // ) such that ( [...] ) text content ( text() ) starts with ( starts-with(string, stringStart) ) "From Date" ( "From Date:" ), and take its text content ( #text() ), storing it ( = ) into the variable from_date ( from_date ). 说明:获取文档( // )中任何位置的第一个节点( #at_xpath ),以使[...] )文本内容( text() )以( starts-with(string, stringStart)"From Date" starts-with(string, stringStart)"From Date:" ),并获取其文本内容( #text() ),并将其( = )存储到变量from_datefrom_date )中。 Then, extract the first group ( #[regexp, 1] ) from that text ( from_date ) by using the regular expression ( /.../ ) that matches the literal characters "From Date: " , followed by any number ( * ) of any characters ( . ), that will be captured ( (...) ) in the first capture group to be extracted by #[regexp, 1] . 然后,使用与文字字符"From Date: "匹配的正则表达式( /.../ ),从该文本( from_date )中提取第一组( #[regexp, 1] ),后跟任意数字( * )的任何字符( . )中的第一个捕获组中将捕获的( (...) ),将由#[regexp, 1]提取。

Also, 也,

Amadan's answer [...] gives an error Amadan的答案给出了一个错误

I did not notice that your Nokogiri construction is broken, as explained by the Tin Man. 正如锡曼所解释的,我没有注意到您的Nokogiri结构已损坏。 The line noko = Nokogiri::HTML('china.html') (which was not a part of my answer) will give you a single node document that only has the text "china.html" in it, and no <p> nodes at all. noko = Nokogiri::HTML('china.html') (这不是我的答案的一部分)将为您提供一个单节点文档,其中仅包含文本"china.html" ,而没有<p>节点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM