简体   繁体   中英

How to find a last occurrence of set of characters in string using regex in java?

I need find the last index of set of characters in a string. Consider the set of characters be x,y,z and string as Vereador Luiz Pauly Home then I need index as 18 .

So for finding the index I have created a pattern with DOTALL flag and greedy quantifier as (?s).*(x|y|z) . When the pattern is applied to that string(multiline), I can find out index from the start group. The code:

int findIndex(String str){
  int index = -1;
  Pattern p = Pattern.compile("(?s).*(x|y|z)");
  Matcher m = regex.matcher(str);
  if(m.find()){
    index = m.start(1);
  }
  return index;
}

As expected it is returning the values correctly, if there is match.

But if there is no match, then it takes too long time (17 minutes for 600000 characters) as it is a Greedy match.

I tried with other quantifiers, but can't get the desired output. So can anyone refer any better regex?

PS: I can also think about traversing the content from last and finding the index.But I hope there is some better way in regex which can do the job quickly.

Performance issues with the (?s).*(x|y|z) regex come from the fact the .* pattern is the first subpattern that grabs the whole string first, and then backtracking occurs to find x , y or z . If there is no match, or the match is at the start of the string, and the strings is very large, this might take a really long time.

The ([xyz])(?=[^xyz]*$) pattern seems a little bit better: it captures x , y or z and asserts there is no other x , y or z up to the end of the string, but it also is somewhat resource-consuming due to each lookahead check after a match is found.

The fastest regex to get your job done is

^(?:[^xyz]*+([xyz]))+

It matches

  • ^ - start of string
  • (?:[^xyz]*+([xyz]))+ - 1 or more repetitions of
    • [^xyz]*+ - any 0 or more chars other than x , y and z matched possessively (no backtracking into the pattern is allowed)
    • ([xyz]) - Group 1: x , y or z .

The Group 1 value and data will belong to the last iteration of the repeated group (as all the preceding data is re-written with each subsequent iteration).

There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:

  1. Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately java.util.regex doesn't allow to to match the pattern from right to left.

  2. Instead of using a greedy quantifier simply match the pattern and loop Matcher.find() until last occurrence is found.

  3. Use a different regex engine with better performance eg RE2/J: linear time regular expression matching in Java .

If option 2 is not efficient enough for your case I'd suggest to try RE2/J:

Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b , the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.

If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.

In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.

StringBuilder both has a reverse and is a CharSequence, so searching is possible.

Pattern p = Pattern.compile("[xyz]");
StringBuilder sb = new StringBuilder(str).reverse();
Matcher m = p.matcher(sb);
return m.find() ? sb.length() - m.end() : -1;

Unfortunately reversal is costly.

A solution without regex is probably faster.

(BTW surrogate pairs are handled correctly by the reversal.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM