简体   繁体   English

使用Jsoup解析时,将HTML布尔属性保留为其原始形式

[英]Keeping HTML boolean attributes in their original form when parsing with Jsoup

When parsing the following HTML with Jsoup : 使用Jsoup解析以下HTML时:

String html = "<iframe allowfullscreen></iframe>";
Document doc = Jsoup.parseBodyFragment(html);
System.out.println(doc.body().html());

I get the following output: 我得到以下输出:

<iframe allowfullscreen=""></iframe>

Even if it must have the same meaning ( source ), is there any way to tell Jsoup to keep the boolean attributes in their original form (ie the one in the input, allowfullscreen instead of allowfullscreen="" in the example)? 即使它必须具有相同的含义( source ),是否有任何方法告诉Jsoup将布尔属性保留为其原始形式(即,输入中的allowfullscreen="" ,在示例中为allowfullscreen而不是allowfullscreen="" )?

Unfortunately, I don't think there's a simple setting to control this. 不幸的是,我认为没有简单的设置可以控制它。 If there were, you'd expect to find it in Document.OutputSettings . 如果有的话,您希望可以在Document.OutputSettings找到它。

The good news, as I said in the comment, is that the original form of the attribute is retained and available via attr , with the exception that you can't tell the difference between allowfullscreen on its own and allowfullscreen="" . 好消息,我在评论说,就是属性的原始形式保留,并通过提供attr ,不同之处在于,你不能告诉之间的区别allowfullscreen自身和allowfullscreen=""

So you could serialize the document yourself, modulo not being able to tell that one difference. 因此,您可以自己序列化文档,而不能以模数的方式来区别一个区别。 Alternately, as Jsoup is open source, you could add a Document.OutputSettings setting for this (and possibly a modification in the parser that lets you tell the difference between the two cases above) and update the logic in the html and related methods to respect the setting, ideally by forking the project , making the change, making tests for the change, and doing a pull request. 或者,由于Jsoup是开源的,因此您可以为此添加一个Document.OutputSettings设置(可能还可以在解析器中进行修改,以告知上述两种情况之间的区别),并更新html和相关方法中的逻辑以供参考设置,最好是通过分叉项目 ,进行更改,对更改进行测试并进行拉取请求。 :-) Not probably the answer one would have liked, but the great thing about OS is that you can scratch your own itch and improve the project in the process. :-)可能不是一个人想要的答案,但是关于OS的伟大之处在于,您可以自己抓痒,并在此过程中改进项目。

By reading the javadoc of Document.OutputSettings , I think that Document.OutputSettings.Syntax.xml is what you need: 通过阅读Document.OutputSettings的javadoc,我认为Document.OutputSettings.Syntax.xml是您所需要的:

Document doc = Jsoup.parseBodyFragment("<ol reversed><li>one</li></ol>");
doc.outputSettings().syntax(Syntax.xml);

System.out.println(doc.body().html());

Prints: 打印:

<ol reversed="">
 <li>one</li>
</ol>

The default (I think it is Document.OutputSettings.Syntax.html ) would be: 默认值(我认为是Document.OutputSettings.Syntax.html )将是:

<ol reversed>
 <li>one</li>
</ol>

As mentioned in this question What does it mean in HTML 5 when an attribute is a boolean attribute? 如本问题所述, 当属性是布尔属性时,在HTML 5中是什么意思? with HTML Boolean attributes, semantically these three forms are the same: 具有HTML布尔属性,在语义上这三种形式是相同的:

<ol reversed>
<ol reversed="">
<ol reversed="reversed">

(Tested with Version 1.10.2 of JSoup) (使用JSoup 1.10.2版进行了测试)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM