简体   繁体   English

如何使用Nokogiri替换文本字符串的“inner_html”

[英]How to use Nokogiri to replace "inner_html" of text string

I want to take a HTML string and return a mutated version retaining the HTML structure but with obfuscated text/inner HTML.我想获取一个 HTML 字符串并返回一个保留 HTML 结构但带有混淆文本/内部 HTML 的变异版本。

For example:例如:

string = "<div><p><h1>this is some sensitive text</h1><br></p><p>more text</p></div>"
obfuscate_html_string(string)
=> "<div><p><h1>**** **** **** **** ****</h1><br></p><p>**** ****</p></div>"

I experimented and, while it seems like the inner_html= method could be useful, it raises an argument error:我进行了试验,虽然inner_html=方法似乎很有用,但它会引发参数错误:

Nokogiri::HTML.fragment(value).traverse { |node| node.content = '***' if node.inner_html }.to_s
=> "***"

Nokogiri::HTML.fragment(value).traverse { |node| node.content ? node.content = '***' : node.to_html }.to_s
=> "***"

Nokogiri::HTML.fragment(value).traverse { |node| node.inner_html = '***' if node.inner_html }.to_s
=> ArgumentError: cannot reparent Nokogiri::XML::Text there

This should help, but the documentation covers this in more detail.这应该有所帮助,但文档对此进行了更详细的介绍。

You have problems with your HTML, because it's invalid, which forces Nokogiri to do a fix-up, which, at that point is going to change the HTML:您的 HTML 有问题,因为它无效,这迫使 Nokogiri 进行修复,此时将更改 HTML:

require 'nokogiri'

doc = Nokogiri::HTML("<div><p><h1>this is some sensitive text</h1><br></p><p>more text</p></div>")
doc.errors # => [#<Nokogiri::XML::SyntaxError: 1:53: ERROR: Unexpected end tag : p>]
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
#    "<html><body><div>\n" +
#    "<p></p>\n" +
#    "<h1>this is some sensitive text</h1>\n" +
#    "<br><p>more text</p>\n" +
#    "</div></body></html>\n"

Nokogiri reports that there's an error in the HTML because you can't nest a h1 tag inside a p : Nokogiri 报告 HTML 中存在错误,因为您无法在p中嵌套h1标签:

ERROR: Unexpected end tag : p>

That means it couldn't make sense of the HTML, and did its best to recover by supplying/changing end-tags until it made sense to it.这意味着它无法理解 HTML,并尽最大努力通过提供/更改结束标记来恢复,直到它有意义为止。 That doesn't mean the HTML was actually what you, or the author wanted it to be.这并不意味着 HTML 实际上就是您或作者想要的。

From that point, your attempts to find nodes is likely to fail because the DOM has changed.从那时起,您尝试查找节点可能会失败,因为 DOM 已更改。

ALWAYS check errors , and if it is not empty be very careful.总是检查errors ,如果它不是空的要非常小心。 The best solution is to run that HTML through Tidy or something similar, and then work on its output.最好的解决方案是通过 Tidy 或类似的东西运行 HTML,然后处理它的 output。

From that point though, this should work:从那时起,这应该有效:

node = doc.at('div h1')
node.inner_html = node.inner_html.tr('a-z', '*')

node = doc.search('div p')[1]
node.inner_html = node.inner_html.tr('a-z', '*')

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><div>
# >> <p></p>
# >> <h1>**** ** **** ********* ****</h1>
# >> <br><p>**** ****</p>
# >> </div></body></html>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM