简体   繁体   中英

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:

html = "<html>
  <body>
    <div class="item">
         <div class="profile">
      <span class="itemize">
         <div class="r12321">Plains</div>
          <div class="as124223">Trains</div>
           <div class="qwss12311232">Automobiles</div>
      </div>
      <div class="profile">
        <span class="itemize">
          <div class="lknoijojkljl98799999">Love</div>
           <div class="vssdfsd0809809">First</div>
            <div class="awefsaf98098">Sight</div>
      </div>
    </div>
  </body>
</html>"

Notice that the class names are random. Notice also that there is whitespace and tabs in the html.

I want to extract the children and end up with a hash like so:

page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
  children = divs.children
  children.each do |child|
    itemhash[child['class']] = child.text
  end
end

Result should be similar to:

 {\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}

But I'm ending up with a mess like this:

 {nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}

This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.

How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?

PS I'm not hung up on Nokogiri - so if another gem can do it better I'm game.

The children method returns all child nodes, including text nodes—even when they are empty.

To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), eg:

children = divs.xpath('./div')

You could also use the children_elements method , which would be closer to what you are already doing, and which only returns children that are elements:

children = divs.element_children

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM