使用jsoup逃避不允许的标签

Question

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. 我正在评估jsoup的功能，该功能可以清理（但不能删除！）未列入白名单的标签。 Let's say only <b> tag is allowed, so the following input 假设仅允许使用<b>标记，因此以下输入

foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>

has to yield the following: 必须产生以下内容：

foo <b>bar</b> &lt;script onLoad='stealYourCookies();'&gt;baz&lt;/script&gt;

I see the following problems/questions with jsoup: 我发现jsoup存在以下问题/问题：

document.getAllElements() always assumes <html> , <head> and <body> . document.getAllElements()始终采用<html> ， <head>和<body> 。 Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in; 是的，我可以调用document.body().getAllElements()但要点是我不知道我的源文件是完整的HTML文档还是仅仅是正文-我希望结果的形状和形式与它进来了；
how do I replace <script>...</script> with <script>...</script> 如何将<script>...</script>替换为<script>...</script> ? ？ I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this. 我只想用转义的实体替换方括号，并且不想更改任何属性，等等Node.replaceWith听起来对此Node.replaceWith大材小用。
Is it possible to completely switch off pretty printing (eg insertion of new lines, etc.)? 是否可以完全关闭漂亮的打印（例如插入新行等）？

Or maybe I should use another framework? 也许我应该使用其他框架？ I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported. 到目前为止，我已经窥视了htmlcleaner ，但是给出的示例并不表明支持我所需的功能。

Answer 1

Answer 1 答案1

How do you load / parse your Document with Jsoup? 如何使用Jsoup加载/解析Document ？ If you use parse() or connect().get() jsoup will automaticly format your html (inserting html , body and head tags). 如果使用parse()或connect().get() jsoup将自动格式化html（插入html ， body和head标签）。 This this ensures you always have a complete Html document - even if input isnt complete. 这样可以确保您始终拥有完整的HTML文档-即使输入不完整。

Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods. 假设您只想清除输入（无需进一步处理），则应使用clean()而不是前面列出的方法。

Example 1 - Using parse() 示例1-使用parse（）

final String html = "<b>a</b>";

System.out.println(Jsoup.parse(html));

Output: 输出：

<html>
 <head></head>
 <body>
  <b>a</b>
 </body>
</html>

Input html is completed to ensure you have a complete document. 输入html已完成，以确保您具有完整的文档。

Example 2 - Using clean() 示例2-使用clean（）

final String html = "<b>a</b>";

System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));

Output: 输出：

<b>a</b>

Input html is cleaned, not more. 输入html被清除，不能更多。

Documentation: 说明文件：

Jsoup 汤

Answer 2 答案2

The method replaceWith() does exactly what you need: 方法replaceWith()完全满足您的需求：

Example: 例：

final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);

for( Element element : doc.select("script") )
{
    element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}

System.out.println(doc);

Output: 输出：

<html>
 <head></head>
 <body>
  <b>&lt;script&gt;your script here&lt;/script&gt;</b>
 </body>
</html>

Or body only : 或仅身体 ：

System.out.println(doc.body().html());

Output: 输出：

<b>&lt;script&gt;your script here&lt;/script&gt;</b>

Documentation: 说明文件：

Node.replaceWith(Node in) Node.replaceWith（Node in）
TextNode 文字节点

Answer 3 答案3

Yes, prettyPrint() method of Jsoup.OutputSettings does this. 是的， Jsoup.OutputSettings prettyPrint()方法Jsoup.OutputSettings做到这一点。

Example: 例：

final String html = "<p>your html here</p>";

Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);

System.out.println(doc);

Note: if the outputSettings() method is not available, please update Jsoup. 注意：如果outputSettings()方法不可用，请更新Jsoup。

Output: 输出：

<html><head></head><body><p>your html here</p></body></html>

Documentation: 说明文件：

Document.OutputSettings.prettyPrint(boolean pretty) Document.OutputSettings.prettyPrint（布尔值漂亮）

Answer 4 (no bullet) 答案4 （无项目符号）

No! 没有！ Jsoup is one of the best and most capable Html library out there! Jsoup是那里最好，功能最强大的 HTML库之一！

使用jsoup逃避不允许的标签

问题描述

1 个解决方案

解决方案1
5 已采纳 2013-02-09 00:49:14

Answer 1 答案1

Answer 2 答案2

Answer 3 答案3

Answer 4 (no bullet) 答案4 （无项目符号）

使用jsoup逃避不允许的标签

问题描述

1 个解决方案

解决方案1 5 已采纳 2013-02-09 00:49:14

Answer 1 答案1

Answer 2 答案2

Answer 3 答案3

Answer 4 (no bullet) 答案4 （无项目符号）

解决方案1
5 已采纳 2013-02-09 00:49:14