How to get only the text of an element which contains other elements with XPath?

Question

I am parsing a document with Nokogiri, using XPath. I am interested in the contents of a list whose structure is:

<ul>
  <li>
    <div>
      <!-- Some data I'm not interested in -->
    </div>
    <span>
      <a href="some_url">A name I already got easily</a>
      <br>
      Some text I need to get but just can't
    </span>
  </li>
  <li>
    <div>
      <!-- Some data I'm not interested in again -->
    </div>
    <span>
      <a href="some_other_url">Another name I already got easily</a>
      <br>
      Some other text I need to get but just can't
    </span>
  </li>
  .
  .
  .
</ul>

I'm doing this using:

politicians = Array.new
rows = doc.xpath('//ul/li')
rows.each do |row|
  politician = OpenStruct.new
  politician.name = row.at_xpath('span/a/text()').to_s.strip.upcase
  politician.url = row.at_xpath('span/a/@href').to_s.strip
  politician.party = row.at_xpath('span').to_s.strip
  politicians.push(politician)
end

This works fine for politician.name and politician.url , but when it comes to politician.party , which is the text after the   tag, I can't isolate the text. Using

row.at_xpath('span').to_s.strip

gives me all the contents of the  tag, including the other HTML elements.

Any suggestions about how to get this text?

Answer 1

span/text() returns empty because the first text node within the  is whitespaces (newline and spaces) located between the span opening tag and the <a/> element. Try using the following XPath instead :

span/text()[normalize-space()]

This XPath should return non-empty text nodes that is direct child of the 

Answer 2

I'd do it like this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<span>
  <a href="some_other_url">Another name I already got easily</a>
  <br>
  Some other text I need to get but just can't
</span>
EOT

doc.at('span br').next.text # => "\n  Some other text I need to get but just can't\n"

or

doc.at('//span/br').next.text # => "\n  Some other text I need to get but just can't\n"

Cleaning that resulting string is easy:

"\n  Some other text I need to get but just can't\n".strip # => "Some other text I need to get but just can't"

The problem your code has is you're not looking deeply enough into the DOM to get what you want, plus you're doing the wrong thing:

doc.at_xpath('//span').to_s # => "<span>\n  <a href=\"some_other_url\">Another name I already got easily</a>\n  <br>\n  Some other text I need to get but just can't\n</span>"

to_s is the same as to_html and returns the node as it was in the original markup. Using text will get rid of the tags, which gets you closer, but, again, you're standing too far back:

doc.at_xpath('//span').text # => "\n  Another name I already got easily\n  \n  Some other text I need to get but just can't\n"

Because   isn't a container you can't get its text, but you can still use it to navigate, then get the next node, which is the Text node, and retrieve it:

doc.at('span br').next.class # => Nokogiri::XML::Text

When parsing XML/HTML, it's really important to point to the actual node you want, and then use the appropriate method. Failing to do that will force you to jump through hoops trying to get the actual data you want.

Putting that all together, I'd do something like:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<span>
  <a href="some_other_url">Another name I already got easily</a>
  <br>
  Some other text I need to get but just can't
</span>
EOT

data = doc.search('span').map{ |span|
  name = span.at('a').text
  url = span.at('a')['href']
  party = span.at('br').next.text.strip

  {
    name: name,
    url: url,
    party: party
  }
}
# => [{:name=>"Another name I already got easily", :url=>"some_other_url", :party=>"Some other text I need to get but just can't"}]

You can fold/spindle/mutilate to bend it to your will.

Finally, don't do search('//path/to/some/node/text()').text . You're wasting keypresses and CPU:

doc = Nokogiri::HTML(<<EOT)
<p>
  Some other text I need to get but just can't
</p>
EOT

doc.at('//p')        # => #<Nokogiri::XML::Element:0x3fed0841edf0 name="p" children=[#<Nokogiri::XML::Text:0x3fed0841e918 "\n  Some other text I need to get but just can't\n">]>
doc.at('//p/text()') # => #<Nokogiri::XML::Text:0x3fed0841e918 "\n  Some other text I need to get but just can't\n">

text() returns a text node, but it doesn't return the text.

As a result you're forced to do:

doc.at('//p/text()').text # => "\n  Some other text I need to get but just can't\n"

Instead, point at what you want and tell Nokogiri get it:

doc.at('//p').text  # => "\n  Some other text I need to get but just can't\n"

XPath can point to the node, but that doesn't help when we want the text, so simplify the selector.

How to get only the text of an element which contains other elements with XPath?

Question

2 answers

solution1
4 ACCPTED 2016-05-07 23:03:48

solution2
1 2016-05-09 17:43:04

How to get only the text of an element which contains other elements with XPath?

Question

2 answers

solution1 4 ACCPTED 2016-05-07 23:03:48

solution2 1 2016-05-09 17:43:04

solution1
4 ACCPTED 2016-05-07 23:03:48

solution2
1 2016-05-09 17:43:04