简体   繁体   中英

Java regex for matching multiple keys in a string

Consider an input string like

Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5

and the regular expression

\b(TWO|FOUR)=([^ ]*)\b

Using this regular expression, the following code can extract the 2 specific key-value pairs out of the 5 total ones (ie, only some predefined key-value pairs should be extracted).

  public static void main(String[] args) throws Exception {
    String input = "Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5";
    String regex = "\\b(TWO|FOUR)=([^ ]*)\\b";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
      System.out.println("\t" + matcher.group(1) + " = " + matcher.group(2));
    }
  }

More specifically, the main() method above prints

TWO = 2
FOUR = 4

but every time find() is invoked, the whole regular expression is evaluated for the part of the string remaining after the latest match, left to right.

Also, if the keys are not mutually distinct (or, if a regular expression with overlapping matches was used in the place of each key), there will be multiple matches. For instance, if the regex becomes

\b(O.*?|T.*?)=([^ ]*)\b

the above method yields

ONE = 1
TWO = 2
THREE = 3

If the regex was not fully re-evaluated but each alternative part was somehow examined once (or, if an appropriately modified regex was used), the output would have been

ONE = 1
TWO = 2

So, two questions:

  1. Is there a more efficient way of extracting a selected set of unique keys and their values, compared to the original regular expression?
  2. Is there a regular expression that can match every alternative part of the OR ( | ) sub-expression exactly once and not evaluate it again?

Java Returns a Match Position: You can Use Dynamically-Generated Regex on Remaining Substrings

With the understanding that it can be generalized to a more complex and useful scenario, let's take a variation on your first example: \\b(TWO|FOUR|SEVEN)=([^ ]*)\\b

You can use it like this:

Pattern regex = Pattern.compile("\\b(TWO|FOUR|SEVEN)=([^ ]*)\\b");
Matcher regexMatcher = regex.matcher(yourString);
if (regexMatcher.find()) {
    String theMatch = regexMatcher.group();
    String FoundToken =  = regexMatcher.group(1);
    String EndPosition = regexMatcher.end();
} 

You could then:

  • Test the value contained by FoundToken
  • Depending on that value, dynamically generate a regex testing for the remaining possible tokens. For instance, if you found FOUR , your new regex would be \\\\b(TWO|SEVEN)=([^ ]*)\\\\b
  • Using EndPosition , apply that regex to the end of the string.

Discussion

  • This approach would serve your goal of not re-evaluating parts of the OR that have already matched.
  • It also serves your goal of avoiding duplicates.
  • Would that be faster? Not in this simple case. But you said you are dealing with a real problem, and it will be a valid approach in some cases.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM