[英]Using jsoup to escape disallowed tags
I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. 我正在评估jsoup的功能,该功能可以清理(但不能删除!)未列入白名单的标签。 Let's say only <b>
tag is allowed, so the following input 假设仅允许使用<b>
标记,因此以下输入
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
has to yield the following: 必须产生以下内容:
foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>
I see the following problems/questions with jsoup: 我发现jsoup存在以下问题/问题:
document.getAllElements()
always assumes <html>
, <head>
and <body>
. document.getAllElements()
始终采用<html>
, <head>
和<body>
。 Yes, I can call document.body().getAllElements()
but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in; 是的,我可以调用document.body().getAllElements()
但要点是我不知道我的源文件是完整的HTML文档还是仅仅是正文-我希望结果的形状和形式与它进来了; <script>...</script>
with <script>...</script>
如何将<script>...</script>
替换为<script>...</script>
? ? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith
sounds like an overkill for this. 我只想用转义的实体替换方括号,并且不想更改任何属性,等等Node.replaceWith
听起来对此Node.replaceWith
大材小用。 Or maybe I should use another framework? 也许我应该使用其他框架? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported. 到目前为止,我已经窥视了htmlcleaner ,但是给出的示例并不表明支持我所需的功能。
How do you load / parse your Document
with Jsoup? 如何使用Jsoup加载/解析Document
? If you use parse()
or connect().get()
jsoup will automaticly format your html (inserting html
, body
and head
tags). 如果使用parse()
或connect().get()
jsoup将自动格式化html(插入html
, body
和head
标签)。 This this ensures you always have a complete Html document - even if input isnt complete. 这样可以确保您始终拥有完整的HTML文档-即使输入不完整。
Let's assume you only want to clean an input (no furhter processing) you should use clean()
instead the previous listed methods. 假设您只想清除输入(无需进一步处理),则应使用clean()
而不是前面列出的方法。
Example 1 - Using parse() 示例1-使用parse()
final String html = "<b>a</b>";
System.out.println(Jsoup.parse(html));
Output: 输出:
<html>
<head></head>
<body>
<b>a</b>
</body>
</html>
Input html is completed to ensure you have a complete document. 输入html已完成,以确保您具有完整的文档。
Example 2 - Using clean() 示例2-使用clean()
final String html = "<b>a</b>";
System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));
Output: 输出:
<b>a</b>
Input html is cleaned, not more. 输入html被清除,不能更多。
Documentation: 说明文件:
The method replaceWith()
does exactly what you need: 方法replaceWith()
完全满足您的需求:
Example: 例:
final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);
for( Element element : doc.select("script") )
{
element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}
System.out.println(doc);
Output: 输出:
<html>
<head></head>
<body>
<b><script>your script here</script></b>
</body>
</html>
Or body only : 或仅身体 :
System.out.println(doc.body().html());
Output: 输出:
<b><script>your script here</script></b>
Documentation: 说明文件:
Yes, prettyPrint()
method of Jsoup.OutputSettings
does this. 是的, Jsoup.OutputSettings
prettyPrint()
方法Jsoup.OutputSettings
做到这一点。
Example: 例:
final String html = "<p>your html here</p>";
Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc);
Note: if the outputSettings()
method is not available, please update Jsoup. 注意:如果outputSettings()
方法不可用,请更新Jsoup。
Output: 输出:
<html><head></head><body><p>your html here</p></body></html>
Documentation: 说明文件:
No! 没有! Jsoup is one of the best and most capable Html library out there! Jsoup是那里最好 ,功能最强大的 HTML库之一!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.