简体   繁体   中英

Why does JSoup remove element IDs?

I'm using JSoup to sanitize some untrusted HTML. I discovered that if I call

String html = "<div id='foo'><script type='text/javascript'>alert('hello');</script></div>";
String cleanedHtml = Jsoup.clean(html, Whitelist.relaxed());

At this point cleanedHtml is

<div><div>

So the <script> tag has correctly been removed, but mysteriously, so has the id attribute of the <div> . Is there any good reason why this should be removed or is it a bug?

By default the id attribute is removed; add it as an allowable attribute:

Whitelist whitelist = Whitelist.relaxed().addAttributes("div", "id");
System.out.println(Jsoup.clean(html, whitelist));

=> <div id="foo"></div>

Is it a bug? Not AFAIC; it's in the source. IMO there are documentation bugs, though.

Is there "any good reason" why this should be removed? Not sure about that one, but attributes like this aren't structural: removing it doesn't alter the DOM. That's the thing about whitelists–they explicitly allow, and must be curated to match your precise needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM