简体   繁体   中英

How to use Nokogiri to replace "inner_html" of text string

I want to take a HTML string and return a mutated version retaining the HTML structure but with obfuscated text/inner HTML.

For example:

string = "<div><p><h1>this is some sensitive text</h1><br></p><p>more text</p></div>"
obfuscate_html_string(string)
=> "<div><p><h1>**** **** **** **** ****</h1><br></p><p>**** ****</p></div>"

I experimented and, while it seems like the inner_html= method could be useful, it raises an argument error:

Nokogiri::HTML.fragment(value).traverse { |node| node.content = '***' if node.inner_html }.to_s
=> "***"

Nokogiri::HTML.fragment(value).traverse { |node| node.content ? node.content = '***' : node.to_html }.to_s
=> "***"

Nokogiri::HTML.fragment(value).traverse { |node| node.inner_html = '***' if node.inner_html }.to_s
=> ArgumentError: cannot reparent Nokogiri::XML::Text there

This should help, but the documentation covers this in more detail.

You have problems with your HTML, because it's invalid, which forces Nokogiri to do a fix-up, which, at that point is going to change the HTML:

require 'nokogiri'

doc = Nokogiri::HTML("<div><p><h1>this is some sensitive text</h1><br></p><p>more text</p></div>")
doc.errors # => [#<Nokogiri::XML::SyntaxError: 1:53: ERROR: Unexpected end tag : p>]
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
#    "<html><body><div>\n" +
#    "<p></p>\n" +
#    "<h1>this is some sensitive text</h1>\n" +
#    "<br><p>more text</p>\n" +
#    "</div></body></html>\n"

Nokogiri reports that there's an error in the HTML because you can't nest a h1 tag inside a p :

ERROR: Unexpected end tag : p>

That means it couldn't make sense of the HTML, and did its best to recover by supplying/changing end-tags until it made sense to it. That doesn't mean the HTML was actually what you, or the author wanted it to be.

From that point, your attempts to find nodes is likely to fail because the DOM has changed.

ALWAYS check errors , and if it is not empty be very careful. The best solution is to run that HTML through Tidy or something similar, and then work on its output.

From that point though, this should work:

node = doc.at('div h1')
node.inner_html = node.inner_html.tr('a-z', '*')

node = doc.search('div p')[1]
node.inner_html = node.inner_html.tr('a-z', '*')

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><div>
# >> <p></p>
# >> <h1>**** ** **** ********* ****</h1>
# >> <br><p>**** ****</p>
# >> </div></body></html>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM