简体   繁体   中英

Regular expression with variable number of groups?

Is it possible to create a regular expression with a variable number of groups?

After running this for instance...

Pattern p = Pattern.compile("ab([cd])*ef");
Matcher m = p.matcher("abcddcef");
m.matches();

... I would like to have something like

  • m.group(1) = "c"
  • m.group(2) = "d"
  • m.group(3) = "d"
  • m.group(4) = "c" .

(Background: I'm parsing some lines of data, and one of the "fields" is repeating. I would like to avoid a matcher.find loop for these fields.)


As pointed out by @Tim Pietzcker in the comments, perl6 and .NET have this feature.

According to the documentation , Java regular expressions can't do this:

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

(emphasis added)

You can use split to get the fields you need into an array and loop through that.

http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String )

I have not used java regex, but for many languages the answer is: No.

Capturing groups seem to be created when the regex is parsed, and filled when it matches the string. The expression (a)|(b)(c) has three capturing groups, only if either one, or two of them can be filled. (a)* has just one group, the parser leaves the last match in the group after matching.

Pattern p = Pattern.compile("ab(?:(c)|(d))*ef");
Matcher m = p.matcher("abcdef");
m.matches();

should do what you want.

EDIT:

@aioobe, I understand now. You want to be able to do something like the grammar

A    ::== <Foo> <Bars> <Baz>
Foo  ::== "foo"
Baz  ::== "baz"
Bars ::== <Bar> <Bars>
        | ε
Bar  ::== "A"
        | "B"

and pull out all the individual matches of Bar .

No, there is no way to do that using java.util.regex . You can recurse and use a regex on the match of Bars or use a parser generator like ANTLR and attach a side-effect to Bar .

I have just had the very similar problem, and managed to do "variable number of groups" but a combination of a while loop and resetting the matcher.

    int i=0;
    String m1=null, m2=null;

    while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
    {
        // do work on two found groups
        i=matcher.end();
    }

But this is for my problem (with two repeating

    Pattern pattern = Pattern.compile("(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)");
    Matcher matcher = pattern.matcher("abcddcef")
    int i=0;
    String res=null;

    while(matcher.find(i) && (res=matcher.group())!=null)
    {
        System.out.println(res);
        i=matcher.end();
    }

You lose the ability to specify arbitrary length of repetition with * or + because look-ahead and look-behind must be of the predictable length.

I would think that backtracking inhibits this behavior, and say the effect of /([\\S\\s])/ in its grouping accumulative state on something like the Bible. Even if it can be done, the output is unknowable as the groups will lose positional meaning. Its better to do a separate regex on like kind in a global sense and have it deposited into an array.

If there is a reasonable max number of matching groups you would encounter:

"ab([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?ef"

This example will work for 0 - 8 matches. I admit this is ugly and not humanly readable.

I would like to avoid a matcher.find loop for these fields.

As stated in other answers, that cannot be avoided. For completeness, here is how to do it using a second Pattern to go over the individual matches. Note the position of the * being inside the round brackets rather than after.

Pattern subPattern = Pattern.compile("[cd]");
Pattern pattern = Pattern.compile("ab(" + subPattern.pattern() + "*)ef"); // DRY, but probably safer ways to do it for the case that subPattern needs to be changed.
Matcher matcher = pattern.matcher("abccdcddef is great and all, but have you heard about abef and abddcef?");
List<String> letterSequence = new ArrayList<>();
while (matcher.find()) {
    String letters = matcher.group(1);
    Matcher subMatcher = subPattern.matcher(letters);
    while (subMatcher.find()) {
        String letter = subMatcher.group();
        letterSequence.add(letter);
    }
}
System.out.println(letterSequence);

Output:

[c, c, d, c, d, d, d, d, c]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM