简体   繁体   中英

How to keep link title attribute with jsoup?

Using Jsoup.clean() , jsoup turns the title attribute of a HTML link from:

<a href="" title="test &lt;br /&gt;">TEST</a>

into:

<a href="" title="test <br />">TEST</a>

This is the demo application:

Whitelist whitelist = new Whitelist();
whitelist.addTags("a");
whitelist.addAttributes("a", "href", "title");

String input = "<a href=\"\" title=\"test &lt;br /&gt;\">TEST</a>";
System.out.println("input: " + input);
String output = Jsoup.clean(input, whitelist);
System.out.println("output: " + output);

which prints:

input: <a href="" title="test &lt;br /&gt;">TEST</a>
output: <a href="" title="test <br />">TEST</a>

I tried to add OutputSettings with EscapeMode :

OutputSettings outputSettings = new OutputSettings();
outputSettings.escapeMode(EscapeMode.xhtml);

EscapeMode.base and EscapeMode.extend have no effect. EscapeMode.xhtml prints the following:

input: <a href="" title="test &lt;br /&gt;">TEST</a>
output: <a href="" title="test &lt;br />">TEST</a>

Any idea how jsoup does not manipulate the title tag?

This is a known issue/behavior: https://github.com/jhy/jsoup/issues/684 (marked as "won't fix" by the jsoup team).

There's not a bug here.

When serializing (ie in your example when you're printing out XML/HTML), we escape as few characters as necessary. That is why the > is not escaped to >; because it's in a quoted attribute, there's no ambiguity that it's closing a tag, so it doesn't get escaped.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM