I am parsing a document with Nokogiri, using XPath. I am interested in the contents of a list whose structure is:
<ul>
<li>
<div>
<!-- Some data I'm not interested in -->
</div>
<span>
<a href="some_url">A name I already got easily</a>
<br>
Some text I need to get but just can't
</span>
</li>
<li>
<div>
<!-- Some data I'm not interested in again -->
</div>
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
</li>
.
.
.
</ul>
I'm doing this using:
politicians = Array.new
rows = doc.xpath('//ul/li')
rows.each do |row|
politician = OpenStruct.new
politician.name = row.at_xpath('span/a/text()').to_s.strip.upcase
politician.url = row.at_xpath('span/a/@href').to_s.strip
politician.party = row.at_xpath('span').to_s.strip
politicians.push(politician)
end
This works fine for politician.name
and politician.url
, but when it comes to politician.party
, which is the text after the <br>
tag, I can't isolate the text. Using
row.at_xpath('span').to_s.strip
gives me all the contents of the <span>
tag, including the other HTML elements.
Any suggestions about how to get this text?
span/text()
returns empty because the first text node within the <span>
is whitespaces (newline and spaces) located between the span opening tag and the <a/>
element. Try using the following XPath instead :
span/text()[normalize-space()]
This XPath should return non-empty text nodes that is direct child of the <span>
I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
EOT
doc.at('span br').next.text # => "\n Some other text I need to get but just can't\n"
or
doc.at('//span/br').next.text # => "\n Some other text I need to get but just can't\n"
Cleaning that resulting string is easy:
"\n Some other text I need to get but just can't\n".strip # => "Some other text I need to get but just can't"
The problem your code has is you're not looking deeply enough into the DOM to get what you want, plus you're doing the wrong thing:
doc.at_xpath('//span').to_s # => "<span>\n <a href=\"some_other_url\">Another name I already got easily</a>\n <br>\n Some other text I need to get but just can't\n</span>"
to_s
is the same as to_html
and returns the node as it was in the original markup. Using text
will get rid of the tags, which gets you closer, but, again, you're standing too far back:
doc.at_xpath('//span').text # => "\n Another name I already got easily\n \n Some other text I need to get but just can't\n"
Because <br>
isn't a container you can't get its text, but you can still use it to navigate, then get the next
node, which is the Text node, and retrieve it:
doc.at('span br').next.class # => Nokogiri::XML::Text
When parsing XML/HTML, it's really important to point to the actual node you want, and then use the appropriate method. Failing to do that will force you to jump through hoops trying to get the actual data you want.
Putting that all together, I'd do something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span>
<a href="some_other_url">Another name I already got easily</a>
<br>
Some other text I need to get but just can't
</span>
EOT
data = doc.search('span').map{ |span|
name = span.at('a').text
url = span.at('a')['href']
party = span.at('br').next.text.strip
{
name: name,
url: url,
party: party
}
}
# => [{:name=>"Another name I already got easily", :url=>"some_other_url", :party=>"Some other text I need to get but just can't"}]
You can fold/spindle/mutilate to bend it to your will.
Finally, don't do search('//path/to/some/node/text()').text
. You're wasting keypresses and CPU:
doc = Nokogiri::HTML(<<EOT)
<p>
Some other text I need to get but just can't
</p>
EOT
doc.at('//p') # => #<Nokogiri::XML::Element:0x3fed0841edf0 name="p" children=[#<Nokogiri::XML::Text:0x3fed0841e918 "\n Some other text I need to get but just can't\n">]>
doc.at('//p/text()') # => #<Nokogiri::XML::Text:0x3fed0841e918 "\n Some other text I need to get but just can't\n">
text()
returns a text node, but it doesn't return the text.
As a result you're forced to do:
doc.at('//p/text()').text # => "\n Some other text I need to get but just can't\n"
Instead, point at what you want and tell Nokogiri get it:
doc.at('//p').text # => "\n Some other text I need to get but just can't\n"
XPath can point to the node, but that doesn't help when we want the text, so simplify the selector.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.