I am trying to parse this page and pull the date that begins after
>p>From Date:
I get the error
Invalid predicate: //b[text() = '<p>From Date: ' (Nokogiri::XML::XPath::SyntaxError)
The xpath from "inspect element" is
/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
This is an example of the code:
#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
noko.xpath("//b[text() = '<p>From Date: ").each do |b|
puts b.next_sibling.content.strip
end
This is file://china.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>File </title>
</head>
<body>
<div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
<p><table cellspacing="0">
<tr>
<td width="2%"> </td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide"> </td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p> Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות 1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions"> </td>
</tr>
</table>
</p>
</div>
</body></html>
Amadan's answer original.rb
#/usr/bin/ruby require 'Nokogiri' noko = Nokogiri::HTML('china.html') date = noko.at_xpath("//p[starts-with(text(),'From Date: ')]").text() puts date formatted = date[/From Date: (.*)/, 1] puts formatted
original.rb:5:in '<main>': undefined method 'text' for nil:NilClass (NoMethodError)
You can't use
noko = Nokogiri::HTML('china.html')
Nokogiri::HTML
is a shortcut to Nokogiri::HTML::Document.parse
. The documentation says:
.parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML) {|options| ... } ⇒ Object`
...
string_or_io
may be a String, or any object that responds to read and close such as an IO, or StringIO. ...
While 'china.html'
is a String, it's not HTML. It appears you're thinking that a filename will suffice, however Nokogiri doesn't open anything, it only understands strings containing markup, either HTML or XML, or an IO-type object that responds to the read
method. Compare these:
require 'nokogiri'
doc = Nokogiri::HTML('china.html')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>china.html</p></body></html>\n"
versus:
doc = Nokogiri::HTML('<html><body><p>foo</p></body></html>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"
and:
doc = Nokogiri::HTML(open('http://www.example.org'))
doc.to_html[0..99]
# => "<!DOCTYPE html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset=\"utf-8\">\n <met"
The last works because OpenURI adds the ability to read URLs to open
, which responds to read
:
open('http://www.example.org').respond_to?(:read) # => true
Moving on to the question:
require 'nokogiri'
require 'open-uri'
html = <<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>File </title>
</head>
<body>
<div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
<p><table cellspacing="0">
<tr>
<td width="2%"> </td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide"> </td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p> Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות 1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions"> </td>
</tr>
</table>
</p>
</div>
</body></html>
EOT
doc = Nokogiri::HTML(html)
Once the document is parsed, it's easy to find a particular <p>
tag using the
<table cellspacing="0" cellpadding="0" class="resultsTypes">
as a placemarker:
from_date = doc.at('table.resultsTypes p[6]').text
# => "From Date: 02/14/1936"
It looks like its going to be tougher pulling the title = "Meeting in China" and link = "bing.com"; since they are on the same line.
I'm using CSS selectors to define the path to the desired text. CSS is more easily read than XPath, though XPath is more powerful and descriptive. Nokogiri allows us to use either, and lets us use search
or at
with either. at
is equivalent to search('some selector').first
. There are also CSS and XPath specific versions of search
and at
, described in Nokogiri::XML::Node
.
title_link = doc.at('table.resultsTypes p[2] a')['href'] # => "http://www.bing.com"
title = doc.at('table.resultsTypes p[2] span').text # => "Meeting in China"
You're trying to use the XPath:
/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
however, it's not valid for the HTML you're working with.
Notice tbody
in the selector. Look at the HTML, immediately after either of the <table>
tags, neither occurrence has a <tbody>
tag, so the XPath is wrong. I suspect that was generated by your browser, which is doing a fix-up of the HTML to add <tbody>
according to the specification, however Nokogiri doesn't do a fix-up to add <tbody>
and the HTML doesn't match, causing the search to fail. So, don't rely on the selector defined by the browser, nor should you trust the browser's idea of the actual HTML source.
Instead of using an explicit selector, it's better, easier, and smarter, to look for specific way-points in the markup, and use those to navigate to the node(s) you want. Here's an example of doing everything above, only using a placeholder, and a mix of XPath and CSS:
doc.at('//p[starts-with(., "Title:")]').text # => "Title: Meeting in China"
title_node = doc.at('//p[starts-with(., "Title:")]')
title_url = title_node.at('a')['href'] # => "http://www.bing.com"
title = title_node.at('span').text # => "Meeting in China"
So, it's fine to mix and match CSS and XPath.
from_date = noko.at_xpath('//p[starts-with(text(), "From Date:")]').text()
date = from_date[/From Date: (.*)/, 1]
# => "02/14/1936"
EDIT:
Explanation: Get the first node ( #at_xpath
) anywhere in the document ( //
) such that ( [...]
) text content ( text()
) starts with ( starts-with(string, stringStart)
) "From Date"
( "From Date:"
), and take its text content ( #text()
), storing it ( =
) into the variable from_date
( from_date
). Then, extract the first group ( #[regexp, 1]
) from that text ( from_date
) by using the regular expression ( /.../
) that matches the literal characters "From Date: "
, followed by any number ( *
) of any characters ( .
), that will be captured ( (...)
) in the first capture group to be extracted by #[regexp, 1]
.
Also,
Amadan's answer [...] gives an error
I did not notice that your Nokogiri construction is broken, as explained by the Tin Man. The line noko = Nokogiri::HTML('china.html')
(which was not a part of my answer) will give you a single node document that only has the text "china.html"
in it, and no <p>
nodes at all.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.