简体   繁体   中英

java.lang.StackOverflowError at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)

I am getting StackOverflowError when I use following Reg Ex :

"([A-Z][A-Z]\\d\\d[A-Z]\\[(\\*|(((\\d|\\d\\d)-(\\d|\\d\\d))|(\\d|\\d\\d)))\\](,|$))+";

to match something like this String :

RA01D[1-1],RA01D[17-17],RA01D[2-2],RA01D[18-18]

What stribizhev's answer has pointed out and fixed is an inefficiency in the regex. There is no catastrophic backtracking here. The change only slightly delays the StackOverflowError without resolving it (see Appendix ).

In the original regex, if the first branch (\\d|\\d\\d)-(\\d|\\d\\d) fails, the second branch will do extra work matching (\\d|\\d\\d) again, which is the prefix of the first branch.

(
  (
    (\d|\d\d)-(\d|\d\d)
  )
  |
  (\d|\d\d)
)

When re-written (as shown in his answer), the prefix (\\d|\\d\\d) will only be matched once, and the engine only needs to check the 2 different sequels (matching -(\\d|\\d\\d) or just an empty string).

(\d|\d\d)(?:-(\d|\d\d))?

This is how his answer improves on the efficiency of the regex. The same method is applied to \\d|\\d\\d .


Back to the problem of StackOverflowError . If you run the regex on a string with 1000 elements, any of the regex above will cause StackOverflowError . It is due to the implementation of Pattern class in Sun/Oracle/OpenJDK, which uses recursion for greedy and lazy quantifier.

Since the regex is non-ambiguous , you can fix the problem by making the quantifier on the outer most group possessive. The regex is copied from stribizhev's answer with some modifications:

"(?:[A-Z][A-Z]\\d\\d[A-Z]\\[(?:\\*|\\d{1,2}(?:-\\d{1,2})?)\\](?:,|$))++"
                                                                     ^^

Since the implementation uses a loop to implement possessive quantifier (as there is no need to backtrack), StackOverflowError cannot occur, regardless of how long the input string is. The stack usage is only as much as a single repetition, as opposed to the case in the question, where it grows linearly to the number of elements in the string.

Appendix

Test program

Below is a test program showing the number of elements that the regex can handle. On my system (Oracle JRE, version 1.8.0_25), the regex in the question only manages to reach 104 * 4 = 416 elements before crashing, stribizhev's answer manages to reach 137 * 4 = 548, stribizhev's answer modified to remove unnecessary group manages to reach 150 * 4 = 600, and the version with possessive quantifier passes all tests (500 * 4 = 2000 elements).

public class SO29758814 {
    public static void main(String[] args) {
        String s = "";
        for (int i = 1; i <= 500; i++) {

            s += "RA01D[1-1],RA01D[17-17],RA01D[2-2],RA01D[18-18],";

            System.out.print(i);

            try {
                // Question
                System.out.print(" 1 " + s.matches("([A-Z][A-Z]\\d\\d[A-Z]\\[(\\*|(((\\d|\\d\\d)-(\\d|\\d\\d))|(\\d|\\d\\d)))\\](,|$))+"));
            } catch (Throwable e) { }

            try {
                // stribizhev's answer
                System.out.print(" 2 " + s.matches("([A-Z]{2}\\d{2}[A-Z]\\[(\\*|((\\d{1,2})(?:-(\\d{1,2}))?))\\](?:,|$))+"));
            } catch (Throwable e) { }

            try {
                // stribizhev's answer, remove unnecessary groups
                System.out.print(" 3 " + s.matches("(?:[A-Z][A-Z]\\d\\d[A-Z]\\[(?:\\*|\\d{1,2}(?:-\\d{1,2})?)\\](?:,|$))+"));
            } catch (Throwable e) { }

            try {
                // stribizhev's answer, remove unnecessary groups, use possessive quantifier
                System.out.print(" 4 " + s.matches("(?:[A-Z][A-Z]\\d\\d[A-Z]\\[(?:\\*|\\d{1,2}(?:-\\d{1,2})?)\\](?:,|$))++"));
            } catch (Throwable e) { }

            System.out.println();

        }
    }
}

Your regex contains alternative lists having similar patterns that often lead to catastrophic backtracking and may impact performance. Have a look at this pattern:

 (
   (
    (\d|\d\d)-(\d|\d\d)
   )
   |
   (\d|\d\d)
 )

It is equal to

(
   (
    (\d|\d\d)(?:-(\d|\d\d))?
   )
)

Also, you'd better use quantifiers, (\\d|\\d\\d) is equal to \\d{1,2} . I also doubt you need to capture a comma or end-of-string, so add a non-capturing group (?:,|$) .

So, try using this regex ( see demo here )

([A-Z]{2}\d{2}[A-Z]\[(\*|((\d{1,2})(?:-(\d{1,2}))?))\](?:,|$))+

Or as a Java string:

String pattern = "([A-Z]{2}\\d{2}[A-Z]\\[(\\*|((\\d{1,2})(?:-(\\d{1,2}))?))\\](?:,|$))+";

You can also adjust the number of capturing groups.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM