简体   繁体   中英

High CPU utilization on Regex pattern matching

I'm using a pattern matching for string in Java. I have an issue, the CPU goes high and does nothing when trying to match the patterns. I have 100's of string which needs to be checked if it matches the 2 patterns.

Below is the sample code I'm using. It stops and CPU goes 100% for the first string (patternList) when matching it for the pattern 2 ie patternMatch[1]. How can I make this better?

String[] patternMatch = {"([\\w\\s]+)+([+\\-/*])+([\\w\\s]+)",
     "([\\w\\s]+)+([+\\-/*])+([\\w\\s]+)+([+\\-/*])+([\\w\\s]+)"};
    List<String> patternList = new ArrayList<String>();

    patternList.add("Avg Volume Units product A + Volume Units product A");
    patternList.add("Avg Volume Units /  Volume Units product A");
    patternList.add("Avg retailer On Hand / Volume Units Plan / Store Count");
    patternList.add("Avg Hand Volume Units Plan Store Count");
    patternList.add("1 - Avg merchant Volume Units");
    patternList.add("Total retailer shipment Count");

    for (String s :patternList ){

        for(int i=0;i<patternMatch.length;i++){
            Pattern pattern = Pattern.compile(patternMatch[i]);

            Matcher matcher = pattern.matcher(s);
            System.out.println(s);
            if (matcher.matches()) {

                System.out.println("Passed");
            }else
                System.out.println("Failed;");
        }

    }

It looks like you are facing variation of catastrophic backtracking probably caused by ([\\\\w\\\\s]+)+ . Try using ([\\\\w\\\\s]+) instead

String[] patternMatch = {
        "([\\w\\s]+)([+\\-/*])+([\\w\\s]+)",
        "([\\w\\s]+)([+\\-/*])+([\\w\\s]+)([+\\-/*])+([\\w\\s]+)"
};

@Pshemo is probably right regarding the catastrophic backtracking. However, I would suggest a completely different approach using String.split() and zero-with lookahead to match just before and after the operator ( +-*/ ).

String[] x = s.split("((?<=[\\-\\+\\*/])|(?=[\\-\\+\\*/]))");
if (x.length == 3 || x.length== 5)
    System.out.println("Passed");
else
    System.out.println("Failed");

The split returns an array containing the operators at odd offsets (1,3) and the strings between the operators at even offsets (0, 2 and 4). This should be much faster than a regex with backtracking.

I don't think there is a need to quantify a quantified unitary group.
Like this for example (?:(?:X)+)* is simply equal to X*

The quantified unitary group causes exponential backtracking this way.
To use a model, this would be better (?:(?:X))* which won't itself
cause catastrophic backtracking.

The other problem is you should try to refrain from grouping unitary
constructs altogether.

In your sample, the classes are each an example of a unitary (base) construct.

Also, use clustering (?:,,) instead of capturing (,,) if you can.
A construct like this ([+\\-/*])+ will match 1 to many of any of those char's
in that class, but will only capture the last character.
So, the capture group is of no real use as either a grouping nor a capture.

So, if you follow these rules, and keep the capture groups, the new regex's
would look like this:

 # "([\\w\\s]+)([+\\-/*]+)([\\w\\s]+)"

 ( [\w\s]+ )                   # (1)
 ( [+\-/*]+ )                  # (2)
 ( [\w\s]+ )                   # (3)

and

 # "([\\w\\s]+)([+\\-/*]+)([\\w\\s]+)([+\\-/*]+)([\\w\\s]+)"

 ( [\w\s]+ )                   # (1)
 ( [+\-/*]+ )                  # (2)
 ( [\w\s]+ )                   # (3)
 ( [+\-/*]+ )                  # (4)
 ( [\w\s]+ )                   # (5)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM