简体   繁体   English

Nokogiri HTML嵌套元素提取类和文本

[英]Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). 我有一个基本的页面结构,其中的元素(span的)嵌套在其他元素(div的和span的)之下。 Here's an example: 这是一个例子:

html = "<html>
  <body>
    <div class="item">
         <div class="profile">
      <span class="itemize">
         <div class="r12321">Plains</div>
          <div class="as124223">Trains</div>
           <div class="qwss12311232">Automobiles</div>
      </div>
      <div class="profile">
        <span class="itemize">
          <div class="lknoijojkljl98799999">Love</div>
           <div class="vssdfsd0809809">First</div>
            <div class="awefsaf98098">Sight</div>
      </div>
    </div>
  </body>
</html>"

Notice that the class names are random. 请注意,类名称是随机的。 Notice also that there is whitespace and tabs in the html. 另请注意,html中有空格和制表符。

I want to extract the children and end up with a hash like so: 我想提取孩子并最终得到一个像这样的哈希:

page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
  children = divs.children
  children.each do |child|
    itemhash[child['class']] = child.text
  end
end

Result should be similar to: 结果应类似于:

 {\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}

But I'm ending up with a mess like this: 但是我最终陷入了这样的混乱:

 {nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}

This is because of the tabs and whitespace in the HTML. 这是因为HTML中的选项卡和空格。 I don't have any control over how the HTML is generated so I'm trying to work around the issue. 我对HTML的生成方式没有任何控制权,因此我正在尝试解决此问题。 I've tried noblanks but that's not working. 我已经尝试过noblanks,但这是行不通的。 I've also tried gsub but that only destroys my markup. 我也尝试过gsub,但这只会破坏我的标记。

How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs? 如何在完全忽略空格和制表符的同时提取这些嵌套元素的类和值?

PS I'm not hung up on Nokogiri - so if another gem can do it better I'm game. PS:我对Nokogiri并不挂念-因此,如果另一个宝石可以做得更好,我就可以玩。

The children method returns all child nodes, including text nodes—even when they are empty. children方法将返回所有子节点,包括文本节点,即使它们为空。

To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), eg: 要仅获取子元素,可以执行显式XPath查询(或等效的CSS),例如:

children = divs.xpath('./div')

You could also use the children_elements method , which would be closer to what you are already doing, and which only returns children that are elements: 您还可以使用children_elements方法 ,该方法将更接近于您正在执行的方法,并且仅返回作为元素的子代:

children = divs.element_children

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM