如何使用OWASP HTML Sanitizer允许特定字符？

Question

I am using the OWASP Html Sanitizer to prevent XSS attacks on my web app. 我正在使用OWASP Html Sanitizer来防止对我的网络应用程序进行XSS攻击。 For many fields that should be plain text the Sanitizer is doing more than I expect. 对于许多应该是纯文本的字段，Sanitizer的效果超出了我的预期。

For example: 例如：

HtmlPolicyBuilder htmlPolicyBuilder = new HtmlPolicyBuilder();
stripAllTagsPolicy = htmlPolicyBuilder.toFactory();
stripAllTagsPolicy.sanitize('a+b'); // return a&#43;b
stripAllTagsPolicy.sanitize('foo@example.com'); // return foo&#64;example.com

When I have fields such as email address that have a + in it such as foo+bar@gmail.com I end up with the wrong data in the the database. 当我的电子邮件地址等字段中包含+ ，如foo+bar@gmail.com我最终会在数据库中输入错误的数据。 So two questions: 所以有两个问题：

Are characters such as + - @ dangerous on their own do they really need to be encoded? 像+ - @这样的字符本身是否真的需要编码？
How do I configure the OWASP html sanitizer to allow specific characters such as + - @? 如何配置OWASP html清理程序以允许特定字符，如+ - @？

Question 2 is the more important one for me to get an answer to. 问题2对我来说是更重要的答案。

Answer 1

You may want to use ESAPI API to filter specific characters. 您可能希望使用ESAPI API过滤特定字符。 Although if you like to allow specific HTML element or attribute you can use following allowElements and allowAttributes. 虽然如果您想允许特定的HTML元素或属性，可以使用以下allowElements和allowAttributes。

// Define the policy. //定义策略。

Function<HtmlStreamEventReceiver, HtmlSanitizer.Policy> policy
     = new HtmlPolicyBuilder()
         .allowElements("a", "p")
         .allowAttributes("href").onElements("a")
         .toFactory();

 // Sanitize your output.
 HtmlSanitizer.sanitize(myHtml, policy.apply(myHtmlStreamRenderer));

Answer 2

The danger in XSS is that one user may insert html code in his input data that you later inserts in a web page that is sent to another user. XSS中的危险在于，一个用户可能会在其输入数据中插入html代码，您稍后会将这些代码插入发送给另一个用户的网页中。

There are in principle two strategies you can follow if you want to protect against this. 如果您想要防止这种情况，原则上可以遵循两种策略。 You can either remove all dangerous characters from user input when they enter your system or you can html-encode the dangerous characters when you later on write them back to the browser. 您可以在用户输入系统进入系统时删除所有危险字符，也可以在以后将其写回浏览器时对危险字符进行html编码。

Example of the first strategy: 第一个策略示例：

User enter data (with html code) 用户输入数据（使用html代码）

Server remove all dangerous characters 服务器删除所有危险字符
Modified data is stored in database 修改后的数据存储在数据库中
Some time later, server reads modified data from database 一段时间后，服务器从数据库读取修改后的数据
Server inserts modified data in a web page to another user 服务器将网页中的修改数据插入另一个用户

Example of second strategy: 第二个策略示例：

User enter data (with html code) 用户输入数据（使用html代码）
Unmodified data, with dangerous characters, is stored in database 具有危险字符的未修改数据存储在数据库中
Some time later, server reads unmodified data from database 一段时间后，服务器从数据库中读取未修改的数据
Server html-encodes dangerous data and insert them into a web page to another user 服务器对危险数据进行html编码，并将其插入到另一个用户的网页中

The first strategy is simpler, since you usually reads data less often that you use them. 第一种策略更简单，因为您通常不经常读取数据而使用它们。 However, it is also more difficult because it potentially destroys the data. 但是，它也更难，因为它可能会破坏数据。 It is particulary difficult if you needs the data for something other than sending them back to the browser later on (like using an email address to actually send an email). 如果您需要的数据不是稍后将其发送回浏览器（例如使用电子邮件地址来实际发送电子邮件），则特别困难。 It makes it more difficult to ie make a search in the database, include data in an pdf report, insert data in an email and so on. 这使得在数据库中进行搜索，在pdf报告中包含数据，在电子邮件中插入数据等等更加困难。

The other strategy has the advantage of not destroying the input data, so you have a greater freedom in how you want to use the data later on. 另一种策略的优点是不会破坏输入数据，因此您可以更自由地在以后使用数据。 However, it may be more difficult to actually check that you html-encode all user submitted data that is sent to the browser. 但是，实际检查是否对发送到浏览器的所有用户提交的数据进行html编码可能更加困难。 A solution to your particular problem would be to html-encode the email address when (or if) you ever put that email address on a web page. 解决您的特定问题的方法是在您（或者如果）将该电子邮件地址放在网页上时对电子邮件地址进行html编码。

The XSS problem is an example of a more general problem that arise when you mix user submitted data and control code. XSS问题是混合用户提交的数据和控制代码时出现的更普遍问题的示例。 SQL injection is another example of the same problem. SQL注入是同一问题的另一个例子。 The problem is that the user submitted data is interpreted as instructions and not data. 问题是用户提交的数据被解释为指令而不是数据。 A third, less well known example is if you mix user submitted data in an email. 第三个不太为人所知的例子是，如果您在电子邮件中混合用户提交的数据。 The user submitted data may contain strings that the email server interprets as instructions. 用户提交的数据可能包含电子邮件服务器解释为指令的字符串。 The "dangerous character" in this scenario is a line break followed by "From:". 这种情况下的“危险角色”是换行符后跟“From：”。

It would be impossible to validate all input data against all possible control characters or sequences of characters that may in some way be interpreted as instructions in some potential application in the future. 不可能针对所有可能的控制字符或字符序列验证所有输入数据，这些字符可能在某种程度上被解释为将来某些潜在应用中的指令。 The only permanent solution to this is to actually sanitize all data that is potentially unsafe when you actually use that data. 对此唯一的永久解决方案是在实际使用该数据时实际清理所有可能不安全的数据。

Answer 3

To be honest you should really be doing a whitelist against all user supplied input. 说实话，你应该真正针对所有用户提供的输入进行白名单。 If it's an email address, just use the OWASP ESAPI or something to validate the input against their Validator and email regular expressions. 如果它是一个电子邮件地址，只需使用OWASP ESAPI或其他东西验证输入与其Validator和电子邮件正则表达式。

If the input passes the whitelist, you should go ahead and store it in the DB. 如果输入通过白名单，您应该继续将其存储在数据库中。 When displaying the text back to a user, you should always HTML encode it. 将文本显示给用户时，应始终对其进行HTML编码。

Your blacklist approach is not recommended by OWASP and could be bypassed by someone who is committed to attacking your users. OWASP不建议您使用黑名单方法，并且可能会被承诺攻击用户的人绕过。

Answer 4

I know I am answering question after 7 years, but maybe it will be useful for someone. 我知道7年后我会回答问题，但也许对某人有用。 So, basically I agree with you guys, we should not allow specific character for security reasons (you covered this topic, thanks). 所以，基本上我同意你们的意见，出于安全考虑，我们不应该允许特定角色（你们已经涵盖了这个主题，谢谢）。 However I was working on legacy internal project which requried escaping html characters but "@" for reason I cannot tell (but it does not matter). 然而，我正在研究传统的内部项目，该项目需要转义html字符，但“@”因为我无法分辨（但无关紧要）。 My workaround for this was simple: 我的解决方法很简单：

private static final PolicyFactory PLAIN_TEXT_SANITIZER_POLICY = new HtmlPolicyBuilder().toFactory();


public static String toString(Object stringValue) {
    if (stringValue != null && stringValue.getClass() == String.class) {
        return HTMLSanitizerUtils.PLAIN_TEXT_SANITIZER_POLICY.sanitize((String) stringValue).replace("&#64;", "@");
    } else {
        return null;
    }
}

I know it is not clean, creates additional String, but we badly need this. 我知道它不干净，创建额外的String，但我们非常需要这个。 So, if you need to allow specific characters you can use this workaround. 因此，如果您需要允许特定字符，则可以使用此解决方法。 But if you need to do this your application is probably incorrectly designed. 但是，如果您需要这样做，您的应用程序可能设计不正确。

如何使用OWASP HTML Sanitizer允许特定字符？

问题描述

4 个解决方案

解决方案1
3 2014-11-17 03:02:26

解决方案2
1 已采纳 2012-09-26 21:31:16

解决方案3
1 2012-09-27 12:19:12

解决方案4
0 2019-03-19 05:18:59

如何使用OWASP HTML Sanitizer允许特定字符？

问题描述

4 个解决方案

解决方案1 3 2014-11-17 03:02:26

解决方案2 1 已采纳 2012-09-26 21:31:16

解决方案3 1 2012-09-27 12:19:12

解决方案4 0 2019-03-19 05:18:59

解决方案1
3 2014-11-17 03:02:26

解决方案2
1 已采纳 2012-09-26 21:31:16

解决方案3
1 2012-09-27 12:19:12

解决方案4
0 2019-03-19 05:18:59