简体   繁体   中英

Parse HTML with Nokogiri and get text with the closest “<div>”

I have HTML containing:

<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
   </div>
</div> 

I need get all text with the closest <div> with class "s" .

For example, I'm trying to get:

array = []
html.css("s").each do |element|
  array << element.text.strip
end

It is all good except that in my array appears "text2" , and I don't want those. So for "text2" , the closest <div> had class "i" , and I don't want see it in my array.

How can I resolve this? There can be different class names, and deeper nesting, for example:

<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
      <div class = "s">
         <p> text3 </p>
         <div class = "p">
           <p> text4 </p>
         </div>
      </div> 
   </div>
</div> 

from this, I want to get an array with: ["text1", "text3"]

This is a better XPath-only answer. My original answer is below.

# Given a Nokogiri::HTML document in the `html` variable:
html.xpath("//text()[normalize-space() and ancestor::div[1][@class='s']]").map(&:text).map(&:strip)

This just finds all non-blank text nodes whose nearest div ancestor has a class of s . It's the same thing as my original answer, except it's entirely done in XPath.

<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
   </div>
</div>
# => ["text1"]

<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
      <div class = "s">
         <p> text3 </p>
         <div class = "p">
           <p> text4 </p>
         </div>
      </div>
   </div>
</div>
# => ["text1", "text3"]

<div class = "s">
  <div class='p'>
    text 1
  </div>
  text 2
</div>
# => ["text 2"]

Original answer:

html.search("//div[@class='s']//text()").
  select {|t| t.ancestors("div").first.attr("class") == "s" }.
  map(&:text).join.squeeze.strip
# => "text1"

The basic idea here is that we find all text nodes which descend from div.s then find the nearest div ancestor for each text node, and only accept the nodes which have a nearest div ancestor with a class of s .

It's a bit CPU-intensive, but it fulfills the strict requirements.

你可以做:

html.css('.s > p').map {|node| node.text.strip }

I'd start with this:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT, &:noblanks)
<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
      <div class = "s">
         <p> text3 </p>
         <div class = "p">
           <p> text4 </p>
         </div>
      </div> 
   </div>
</div> 
EOT

doc.search('.s').map{ |div| div.child.text.strip } 
# => ["text1", "text3"]

I think what makes it difficult to find the appropriate nodes, the child of the ".s" nodes, is the next Text node containing the "\\n" due to the HTML formatting. Ignoring them is difficult because they might not be text, they might be a node you want returned.

The trick is to tell Nokogiri to strip out blank nodes as it parses the document, which effectively would flatten the HTML, removing all indentation, making it possible to trust that the next node after a target is one that is wanted.


foobar would cause this technique to fail.

Yep, it would, and would require additional logic to weed those out:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT, &:noblanks)
<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
      <div class = "s">
         <p> text3 </p>
         <div class = "p">
           <p> text4 </p>
         </div>
      </div> 
      <div class='s'><div class='i'>foobar</div></div>
   </div>
</div> 
EOT

Here's the old logic:

doc.search('.s').map{ |div| div.child.text.strip } 
# => ["text1", "text3", "foobar"]

And a quick test to weed out the unwanted:

doc.search('.s').reject{ |div| div.child['class'] == 'i' }.map{ |div| div.child.text.strip } 
# => ["text1", "text3"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM