简体   繁体   中英

How to parse a date using Nokogiri in Ruby

I am trying to parse this page and pull the date that begins after

>p>From Date:

I get the error

Invalid predicate: //b[text() = '<p>From Date: ' (Nokogiri::XML::XPath::SyntaxError)

The xpath from "inspect element" is

/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p

This is an example of the code:

#/usr/bin/ruby

require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
noko.xpath("//b[text() = '<p>From Date: ").each do |b|
puts b.next_sibling.content.strip
end

This is file://china.html


  
 
  
  
  
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

    <html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        
        <title>File </title>

    
      </head>
      <body>
        
            <div id ="timelineItems">
    <H2 id="telegram1"> Title </H2>
            <p><table cellspacing="0">
    <tr>
    <td width="2%">&nbsp;</td>
    <td width="75%">
    <table cellspacing="0" cellpadding="0" class="resultsTypes">
    <tr>
    <td width="5%" class="hide">&nbsp;</td>
    <td width="70%">
    <p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
    <p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>

    <p>recipient: David Ben Gurion</p>
    <p>sender: Prime Minister of Union of Burma, Rangoon</p>
    <p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
    <p>From Date: 02/14/1936</p>
    <p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
    </td>
    </tr>
    <tr>
    <td colspan="2">
    </td>
    </tr>
    </table></td>
    <td class="actions">&nbsp;</td>
    </tr>
    </table>

    </p>
          </div>
          
    
    </body></html>

Amadan's answer original.rb

 #/usr/bin/ruby require 'Nokogiri' noko = Nokogiri::HTML('china.html') date = noko.at_xpath("//p[starts-with(text(),'From Date: ')]").text() puts date formatted = date[/From Date: (.*)/, 1] puts formatted 
gives an error original.rb:5:in '<main>': undefined method 'text' for nil:NilClass (NoMethodError)

You can't use

noko = Nokogiri::HTML('china.html')

Nokogiri::HTML is a shortcut to Nokogiri::HTML::Document.parse . The documentation says:

 .parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML) {|options| ... } ⇒ Object` 

... string_or_io may be a String, or any object that responds to read and close such as an IO, or StringIO. ...

While 'china.html' is a String, it's not HTML. It appears you're thinking that a filename will suffice, however Nokogiri doesn't open anything, it only understands strings containing markup, either HTML or XML, or an IO-type object that responds to the read method. Compare these:

require 'nokogiri'

doc = Nokogiri::HTML('china.html')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>china.html</p></body></html>\n"

versus:

doc = Nokogiri::HTML('<html><body><p>foo</p></body></html>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"

and:

doc = Nokogiri::HTML(open('http://www.example.org'))
doc.to_html[0..99]
# => "<!DOCTYPE html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset=\"utf-8\">\n    <met"

The last works because OpenURI adds the ability to read URLs to open , which responds to read :

open('http://www.example.org').respond_to?(:read) # => true

Moving on to the question:

require 'nokogiri'
require 'open-uri'

html = <<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    <title>File </title>


  </head>
  <body>

        <div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
        <p><table cellspacing="0">
<tr>
<td width="2%">&nbsp;</td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide">&nbsp;</td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>

<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions">&nbsp;</td>
</tr>
</table>

</p>
      </div>


</body></html>
EOT

doc = Nokogiri::HTML(html)

Once the document is parsed, it's easy to find a particular <p> tag using the

<table cellspacing="0" cellpadding="0" class="resultsTypes">

as a placemarker:

from_date = doc.at('table.resultsTypes p[6]').text
# => "From Date: 02/14/1936"

It looks like its going to be tougher pulling the title = "Meeting in China" and link = "bing.com"; since they are on the same line.

I'm using CSS selectors to define the path to the desired text. CSS is more easily read than XPath, though XPath is more powerful and descriptive. Nokogiri allows us to use either, and lets us use search or at with either. at is equivalent to search('some selector').first . There are also CSS and XPath specific versions of search and at , described in Nokogiri::XML::Node .

title_link = doc.at('table.resultsTypes p[2] a')['href'] # => "http://www.bing.com"
title = doc.at('table.resultsTypes p[2] span').text # => "Meeting in China"

You're trying to use the XPath:

/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p

however, it's not valid for the HTML you're working with.

Notice tbody in the selector. Look at the HTML, immediately after either of the <table> tags, neither occurrence has a <tbody> tag, so the XPath is wrong. I suspect that was generated by your browser, which is doing a fix-up of the HTML to add <tbody> according to the specification, however Nokogiri doesn't do a fix-up to add <tbody> and the HTML doesn't match, causing the search to fail. So, don't rely on the selector defined by the browser, nor should you trust the browser's idea of the actual HTML source.


Instead of using an explicit selector, it's better, easier, and smarter, to look for specific way-points in the markup, and use those to navigate to the node(s) you want. Here's an example of doing everything above, only using a placeholder, and a mix of XPath and CSS:

doc.at('//p[starts-with(., "Title:")]').text  # => "Title: Meeting in China"
title_node = doc.at('//p[starts-with(., "Title:")]')
title_url = title_node.at('a')['href'] # => "http://www.bing.com"
title = title_node.at('span').text # => "Meeting in China"

So, it's fine to mix and match CSS and XPath.

from_date = noko.at_xpath('//p[starts-with(text(), "From Date:")]').text()
date = from_date[/From Date: (.*)/, 1]
# => "02/14/1936"

EDIT:

Explanation: Get the first node ( #at_xpath ) anywhere in the document ( // ) such that ( [...] ) text content ( text() ) starts with ( starts-with(string, stringStart) ) "From Date" ( "From Date:" ), and take its text content ( #text() ), storing it ( = ) into the variable from_date ( from_date ). Then, extract the first group ( #[regexp, 1] ) from that text ( from_date ) by using the regular expression ( /.../ ) that matches the literal characters "From Date: " , followed by any number ( * ) of any characters ( . ), that will be captured ( (...) ) in the first capture group to be extracted by #[regexp, 1] .

Also,

Amadan's answer [...] gives an error

I did not notice that your Nokogiri construction is broken, as explained by the Tin Man. The line noko = Nokogiri::HTML('china.html') (which was not a part of my answer) will give you a single node document that only has the text "china.html" in it, and no <p> nodes at all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM