Parse HTML (without HTML semantics being followed) using Nokogiri

Question

I have an HTML document containing data:

<div>
    <p class="someclass">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </p>
</div>

while parsing I use:

div_node.children.each do |child|
  if child.node_name == 'p'
    #store it as html string in db
    store(child.to_html)
  end
end

When I check the database, I get only the outer <p> tag:

<p class="someclass">
</p>

No inner <ul> tag content is stored or retrieved.

I know that the <p> tag cannot contain the <ul> tag but the document we got from the client has the data and there are about 1000 documents with the data so I cannot edit them manually

Answer 1

Try to use the Nokogiri::XML parser instead of the Nokogiri::HTML one. It shouldn't care about the tag semantics, but I'm not sure how will it handle those parts of HTML5 which are not valid XML.

Answer 2

I ended up using Nokogiri::XML parser for parsing the HTML doc

I had to change my script at numerous places

Parsing code

@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!

Changes Done

change attribute method to attr
chaining attr with text method is not needed here
need to check about the invalid HTML5 tags though
some more parsing logic changes were needed
node.to_html works like a charm here so i was able to store complete HTML in db

Parse HTML (without HTML semantics being followed) using Nokogiri

Question

2 answers

solution1
1 2015-11-19 13:56:31

solution2
1 ACCPTED 2015-11-20 09:37:01

Parse HTML (without HTML semantics being followed) using Nokogiri

Question

2 answers

solution1 1 2015-11-19 13:56:31

solution2 1 ACCPTED 2015-11-20 09:37:01

solution1
1 2015-11-19 13:56:31

solution2
1 ACCPTED 2015-11-20 09:37:01