简体   繁体   English

如何防止删除 <html> 在Nokogiri中标记?

[英]How to prevent deletion of the <html> tag in Nokogiri?

I have code like this: 我有这样的代码:

doc = Nokogiri::HTML.fragment(html)
doc.to_html

and an HTML fragment which will be parsed: 和一个将被解析的HTML片段:

<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    <html>
        <p>
            qwerty
        </p>
    </html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

Nokogiri deletes the <html> </html> tags in the <code> block. Nokogiri删除<code>块中的<html> </html>标记。 How can I prevent this behavior? 我该如何防止这种行为?

UPDATE: 更新:

the Tin Man proposed solution, pre parse fragment of html and escape all html in code block Tin Man提出的解决方案,解析html的片段并在代码块中转义所有html

Here some code, it's not beautiful so if you want suggest another solution please post a comment 这里有一些代码,它并不漂亮所以如果你想提出另一个解决方案,请发表评论

html.gsub!(/<code\b[^>]*>(.*?)<\/code>/m) do |x|
  "<code>#{CGI.escapeHTML($1)}</code>"
end

Thanks the Tin Man 谢谢田

The problem is that the HTML is invalid. 问题是HTML无效。 I used this to test it: 我用它来测试它:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    <html>
        <p>
            qwerty
        </p>
    </html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT

puts doc.errors

After parsing a document, Nokogiri will populate the errors array with a list of errors it found during parsing. 解析文档后,Nokogiri将使用在解析过程中发现的错误列表填充errors数组。 In the case of your HTML, doc.errors contains: 对于HTML, doc.errors包含:

htmlParseStartTag: misplaced <html> tag

The reason is that, inside the <code> block, the tags are not HTML encoded as they should be. 原因是,在<code>块中,标签不是HTML编码的。

Convert it using HTML entities to: 使用HTML实体将其转换为:

&lt;html&gt;
    &lt;p&gt;
        qwerty
    &lt;/p&gt;
&lt;/html&gt;

And it will work. 它会奏效。

Nokogiri is a XML/HTML parser, and it attempts to fix errors in the markup to allow you, the programmer, to have a good chance of using the document. Nokogiri是一个XML / HTML解析器,它试图修复标记中的错误,以允许程序员很有可能使用该文档。 In this case, because the <html> block is in the wrong place, it removes the tags. 在这种情况下,由于<html>块位于错误的位置,因此会删除标记。 Nokogiri wouldn't care if the tags were encoded, because, at that point, they're simply text, not tags. Nokogiri不关心标签是否被编码,因为在那时,它们只是文本,而不是标签。


EDIT: 编辑:

I'll try pre parse with gsub and convert html in code block 我将尝试使用gsub进行预解析并在代码块中转换html

require 'nokogiri'

html = <<EOT
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    <html>
        <p>
            qwerty
        </p>
    </html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT

doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(%r[<(/?)html>], '&lt;\1html&gt;'))

puts doc.to_html

Which outputs: 哪个输出:

<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    &lt;html&gt;
        <p>
            qwerty
        </p>
    &lt;/html&gt;
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

EDIT: 编辑:

This will process the <html> tag prior to parsing, so Nokogiri can load the <code> block unscathed. 这将在解析之前处理<html>标记,因此Nokogiri可以无损地加载<code>块。 It then finds the <code> block, unescapes the encoded <html> start and end tags, then inserts the resulting text into the <code> block as its content. 然后它找到<code>块,对编码的<html>开始和结束标记进行unescape,然后将生成的文本作为其内容插入到<code>块中。 Because it is inserted as content, when Nokogiri renders the DOM as HTML the text is reencoded as entities where necessary: 因为它是作为内容插入的,所以当Nokogiri将DOM呈现为HTML时,文本将在必要时重新编码为实体:

require 'cgi'
require 'nokogiri'

html = <<EOT
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    <html>
        <p>
            qwerty
        </p>
    </html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT

doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(%r[<(/?)html>], '&lt;\1html&gt;'))

code = doc.at('code')
code.content = CGI::unescapeHTML(code.inner_html)

puts doc.to_html

Which outputs: 哪个输出:

<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
    &lt;html&gt;
        &lt;p&gt;
            qwerty
        &lt;/p&gt;
    &lt;/html&gt;
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM