Java : Matcher.find using high cpu utilization

Question

I am using mod security rules https://github.com/SpiderLabs/owasp-modsecurity-crs to sanitize user input data. I am facing cpu shoot up and delay in matching the user input with mod security rule regular expressions. Overall it contains 500+ regular expression to check different types of attacks(xss , badrobots , generic and sql). For each request , I go through all parameters and check against all these 500 regular expressions. I am using Matcher.find to check the parameters. In this case some parameters fall in infinite looping , I tackled this using below technique.

Cancelling a long running regex match? .

Sanitize a user request took around ~500 ms and cpu % shoots up. I analyzed using visualvm.java.net with my test suite runner.

Cpu Profile Output

在此输入图像描述

Please help me to reduce the cpu usage % and load average?

Answer 1

If possible, compile your regexes once and keep them around, rather than repeatedly (implicitly) compiling (especially inside a loop).
See java.util.regex - importance of Pattern.compile()? for more info.

Answer 2

I suggest you look at this paper: "Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort"

There are better ways to do the matching you describe. Essentially you take the 500 patterns you want to match and compile it into a single suffix tree which can very efficiently match an input against all the rules at once.

The paper explains that this approach was described as "Boyer-Moore Approach to Exact Set Matching" by Dan Gusfield.

Boyer-Moore is a well known algorithm for String matching. The paper describes a variation on Boyer-Moore for Set Matching.

Answer 3

I think this is the root of your problem, not the regex performance per-se:

For each request , I go through all parameters and check against all these 500 regular expressions

No matter how fast your regex will be, this is still plenty of work. I don't know how many parameters you have, but even if there are only a few, that's still checking thousands of regular expressions per request. That can kill your CPU.

Apart from the obvious things like improving your regex performance by precompiling and/or simplifying them, you can do the following things to reduce the amount of regex checking:

Use positive-validation of user input based on the parameter type. Eg if some parameter must be a simple number, don't waste time checking if it contains malicious XML script. Just check whether it matches [0-9]+ (or something similarly simple). If it does, it is ok - skip checking all the 500 regexps.
Try to find simple regexps that could eliminate the whole classes of attacks - find common things in your regexps. If eg you've got 100 regexps checking for existence of certain HTML tags, check if the content contains at least one HTML tag first. If it doesn't, you immediately save on checking 100 regexps.
Cache results. Many parameters generated in webapps repeat themselves. Don't check the same content over and over again, but just remember the final validation result. Beware to limit the maximum size of the cache to avoid DOS attacks.

Also note that negative-validation is usually easy to bypass. Someone just changes a few chars in their malicious code and your regexps won't match. You'll have to grow your "database" of regexps in order to protect against new attacks. Positive validation (whitelisting) doesn't have this disadvantage and is much more effective.

Answer 4

Avoid expresions with:

Multiline
case insensitive
etc.

Perhaps you can consider grouping regular expressions and apply a given group of regular expresions depending on user input.

Answer 5

If you have such a big number of regex, you could group (at least some of) them using a trie algorithm ( http://en.wikipedia.org/wiki/Trie ).
The idea is that if you have for example regexes like /abc[0-9-]/ , /abde/ , /another example/ , /.something else/ and /.I run out of ideas/ , you can combine them into the single regex

 /a(?:b(?:c[0-9-]|de)|nother example)|.(?:I run out of ideas|something else)/

In this way, the matcher has to run only once instead of four times, and you avoid a lot of backtracking, because of how the common starting parts have been written in the regex above.

Answer 6

There must be a subset of problematic regexes among these 500. Ie such a regex

    String s = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB";

    Pattern.compile("(A+)+").matcher(s).matches();

will take years to complete.

So in your case I would log all the problematic regexes with their problematic inputs. Once these are found you could manually rewrite these few problematic regexes and test them versus the original. Regexes can always be rewritten with a simpler and more readable java functions.

Another option, though it would not resolve the problem above, is you could also utilize a faster (x20 in some cases) and more limited regex library . It is available in Maven Central .

Java : Matcher.find using high cpu utilization

Question

6 answers

solution1
3 2013-09-18 23:06:58

solution2
3 2013-09-19 17:59:38

solution3
3 2013-09-23 11:31:11

solution4
2 2013-08-31 21:14:47

solution5
1 2013-09-19 14:05:54

solution6
1 2013-09-20 14:37:24

Java : Matcher.find using high cpu utilization

Question

6 answers

solution1 3 2013-09-18 23:06:58

solution2 3 2013-09-19 17:59:38

solution3 3 2013-09-23 11:31:11

solution4 2 2013-08-31 21:14:47

solution5 1 2013-09-19 14:05:54

solution6 1 2013-09-20 14:37:24

solution1
3 2013-09-18 23:06:58

solution2
3 2013-09-19 17:59:38

solution3
3 2013-09-23 11:31:11

solution4
2 2013-08-31 21:14:47

solution5
1 2013-09-19 14:05:54

solution6
1 2013-09-20 14:37:24