简体   繁体   English

使用 Nokogiri 解析 HTML 并获取最接近的文本“<div> ”

[英]Parse HTML with Nokogiri and get text with the closest “<div>”

I have HTML containing:我有包含以下内容的 HTML:

<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
   </div>
</div> 

I need get all text with the closest <div> with class "s" .我需要使用最接近的<div>"s"获取所有文本。

For example, I'm trying to get:例如,我试图得到:

array = []
html.css("s").each do |element|
  array << element.text.strip
end

It is all good except that in my array appears "text2" , and I don't want those.一切都很好,只是在我的数组中出现了"text2" ,而我不想要那些。 So for "text2" , the closest <div> had class "i" , and I don't want see it in my array.所以对于"text2" ,最近的<div>有类"i" ,我不想在我的数组中看到它。

How can I resolve this?我该如何解决这个问题? There can be different class names, and deeper nesting, for example:可以有不同的类名和更深的嵌套,例如:

<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
      <div class = "s">
         <p> text3 </p>
         <div class = "p">
           <p> text4 </p>
         </div>
      </div> 
   </div>
</div> 

from this, I want to get an array with: ["text1", "text3"]由此,我想得到一个数组: ["text1", "text3"]

This is a better XPath-only answer.这是一个更好的 XPath-only 答案。 My original answer is below.我的原始答案如下。

# Given a Nokogiri::HTML document in the `html` variable:
html.xpath("//text()[normalize-space() and ancestor::div[1][@class='s']]").map(&:text).map(&:strip)

This just finds all non-blank text nodes whose nearest div ancestor has a class of s .这只是找到所有非空白文本节点,其最近的div祖先具有s类。 It's the same thing as my original answer, except it's entirely done in XPath.它与我的原始答案相同,只是它完全是在 XPath 中完成的。

<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
   </div>
</div>
# => ["text1"]

<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
      <div class = "s">
         <p> text3 </p>
         <div class = "p">
           <p> text4 </p>
         </div>
      </div>
   </div>
</div>
# => ["text1", "text3"]

<div class = "s">
  <div class='p'>
    text 1
  </div>
  text 2
</div>
# => ["text 2"]

Original answer:原答案:

html.search("//div[@class='s']//text()").
  select {|t| t.ancestors("div").first.attr("class") == "s" }.
  map(&:text).join.squeeze.strip
# => "text1"

The basic idea here is that we find all text nodes which descend from div.s then find the nearest div ancestor for each text node, and only accept the nodes which have a nearest div ancestor with a class of s .这里的基本思想是我们找到所有从div.s下降的文本节点,然后为每个文本节点找到最近的div祖先,并且只接受具有最近 div 祖先的节点,并且类为s

It's a bit CPU-intensive, but it fulfills the strict requirements.它有点 CPU 密集型,但它满足严格的要求。

你可以做:

html.css('.s > p').map {|node| node.text.strip }

I'd start with this:我会从这个开始:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT, &:noblanks)
<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
      <div class = "s">
         <p> text3 </p>
         <div class = "p">
           <p> text4 </p>
         </div>
      </div> 
   </div>
</div> 
EOT

doc.search('.s').map{ |div| div.child.text.strip } 
# => ["text1", "text3"]

I think what makes it difficult to find the appropriate nodes, the child of the ".s" nodes, is the next Text node containing the "\\n" due to the HTML formatting.我认为很难找到合适的节点, ".s"节点的child节点,是下一个包含“\\n”的文本节点,因为HTML格式。 Ignoring them is difficult because they might not be text, they might be a node you want returned.忽略它们很困难,因为它们可能不是文本,它们可能是您想要返回的节点。

The trick is to tell Nokogiri to strip out blank nodes as it parses the document, which effectively would flatten the HTML, removing all indentation, making it possible to trust that the next node after a target is one that is wanted.诀窍是告诉 Nokogiri 在解析文档时去除空白节点,这将有效地展平 HTML,删除所有缩进,从而可以相信目标之后的下一个节点是想要的节点。


foobar would cause this technique to fail. foob​​ar 会导致此技术失败。

Yep, it would, and would require additional logic to weed those out:是的,它会并且需要额外的逻辑来清除它们:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT, &:noblanks)
<div class = "s">
   <p> text1 </p>
   <div class = "i">
      <p> text2 </p>
      <div class = "s">
         <p> text3 </p>
         <div class = "p">
           <p> text4 </p>
         </div>
      </div> 
      <div class='s'><div class='i'>foobar</div></div>
   </div>
</div> 
EOT

Here's the old logic:这是旧逻辑:

doc.search('.s').map{ |div| div.child.text.strip } 
# => ["text1", "text3", "foobar"]

And a quick test to weed out the unwanted:并进行快速测试以清除不需要的:

doc.search('.s').reject{ |div| div.child['class'] == 'i' }.map{ |div| div.child.text.strip } 
# => ["text1", "text3"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM