简体   繁体   中英

Keeping HTML boolean attributes in their original form when parsing with Jsoup

When parsing the following HTML with Jsoup :

String html = "<iframe allowfullscreen></iframe>";
Document doc = Jsoup.parseBodyFragment(html);
System.out.println(doc.body().html());

I get the following output:

<iframe allowfullscreen=""></iframe>

Even if it must have the same meaning ( source ), is there any way to tell Jsoup to keep the boolean attributes in their original form (ie the one in the input, allowfullscreen instead of allowfullscreen="" in the example)?

Unfortunately, I don't think there's a simple setting to control this. If there were, you'd expect to find it in Document.OutputSettings .

The good news, as I said in the comment, is that the original form of the attribute is retained and available via attr , with the exception that you can't tell the difference between allowfullscreen on its own and allowfullscreen="" .

So you could serialize the document yourself, modulo not being able to tell that one difference. Alternately, as Jsoup is open source, you could add a Document.OutputSettings setting for this (and possibly a modification in the parser that lets you tell the difference between the two cases above) and update the logic in the html and related methods to respect the setting, ideally by forking the project , making the change, making tests for the change, and doing a pull request. :-) Not probably the answer one would have liked, but the great thing about OS is that you can scratch your own itch and improve the project in the process.

By reading the javadoc of Document.OutputSettings , I think that Document.OutputSettings.Syntax.xml is what you need:

Document doc = Jsoup.parseBodyFragment("<ol reversed><li>one</li></ol>");
doc.outputSettings().syntax(Syntax.xml);

System.out.println(doc.body().html());

Prints:

<ol reversed="">
 <li>one</li>
</ol>

The default (I think it is Document.OutputSettings.Syntax.html ) would be:

<ol reversed>
 <li>one</li>
</ol>

As mentioned in this question What does it mean in HTML 5 when an attribute is a boolean attribute? with HTML Boolean attributes, semantically these three forms are the same:

<ol reversed>
<ol reversed="">
<ol reversed="reversed">

(Tested with Version 1.10.2 of JSoup)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM