简体   繁体   中英

Parse HTML (without HTML semantics being followed) using Nokogiri

I have an HTML document containing data:

<div>
    <p class="someclass">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </p>
</div>

while parsing I use:

div_node.children.each do |child|
  if child.node_name == 'p'
    #store it as html string in db
    store(child.to_html)
  end
end

When I check the database, I get only the outer <p> tag:

<p class="someclass">
</p>

No inner <ul> tag content is stored or retrieved.

I know that the <p> tag cannot contain the <ul> tag but the document we got from the client has the data and there are about 1000 documents with the data so I cannot edit them manually

Try to use the Nokogiri::XML parser instead of the Nokogiri::HTML one. It shouldn't care about the tag semantics, but I'm not sure how will it handle those parts of HTML5 which are not valid XML.

I ended up using Nokogiri::XML parser for parsing the HTML doc

I had to change my script at numerous places

Parsing code

@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!

Changes Done

  • change attribute method to attr
  • chaining attr with text method is not needed here
  • need to check about the invalid HTML5 tags though
  • some more parsing logic changes were needed
  • node.to_html works like a charm here so i was able to store complete HTML in db

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM