简体   繁体   中英

How to prevent deletion of the <html> tag in Nokogiri?

I have code like this:

doc = Nokogiri::HTML.fragment(html)
doc.to_html

and an HTML fragment which will be parsed:

<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    <html>
        <p>
            qwerty
        </p>
    </html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

Nokogiri deletes the <html> </html> tags in the <code> block. How can I prevent this behavior?

UPDATE:

the Tin Man proposed solution, pre parse fragment of html and escape all html in code block

Here some code, it's not beautiful so if you want suggest another solution please post a comment

html.gsub!(/<code\b[^>]*>(.*?)<\/code>/m) do |x|
  "<code>#{CGI.escapeHTML($1)}</code>"
end

Thanks the Tin Man

The problem is that the HTML is invalid. I used this to test it:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    <html>
        <p>
            qwerty
        </p>
    </html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT

puts doc.errors

After parsing a document, Nokogiri will populate the errors array with a list of errors it found during parsing. In the case of your HTML, doc.errors contains:

htmlParseStartTag: misplaced <html> tag

The reason is that, inside the <code> block, the tags are not HTML encoded as they should be.

Convert it using HTML entities to:

&lt;html&gt;
    &lt;p&gt;
        qwerty
    &lt;/p&gt;
&lt;/html&gt;

And it will work.

Nokogiri is a XML/HTML parser, and it attempts to fix errors in the markup to allow you, the programmer, to have a good chance of using the document. In this case, because the <html> block is in the wrong place, it removes the tags. Nokogiri wouldn't care if the tags were encoded, because, at that point, they're simply text, not tags.


EDIT:

I'll try pre parse with gsub and convert html in code block

require 'nokogiri'

html = <<EOT
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    <html>
        <p>
            qwerty
        </p>
    </html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT

doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(%r[<(/?)html>], '&lt;\1html&gt;'))

puts doc.to_html

Which outputs:

<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    &lt;html&gt;
        <p>
            qwerty
        </p>
    &lt;/html&gt;
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

EDIT:

This will process the <html> tag prior to parsing, so Nokogiri can load the <code> block unscathed. It then finds the <code> block, unescapes the encoded <html> start and end tags, then inserts the resulting text into the <code> block as its content. Because it is inserted as content, when Nokogiri renders the DOM as HTML the text is reencoded as entities where necessary:

require 'cgi'
require 'nokogiri'

html = <<EOT
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    <html>
        <p>
            qwerty
        </p>
    </html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT

doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(%r[<(/?)html>], '&lt;\1html&gt;'))

code = doc.at('code')
code.content = CGI::unescapeHTML(code.inner_html)

puts doc.to_html

Which outputs:

<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    &lt;html&gt;
        &lt;p&gt;
            qwerty
        &lt;/p&gt;
    &lt;/html&gt;
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM