简体   繁体   English

组数可变的正则表达式?

[英]Regular expression with variable number of groups?

Is it possible to create a regular expression with a variable number of groups?是否可以创建具有可变数量组的正则表达式?

After running this for instance...例如运行这个之后......

Pattern p = Pattern.compile("ab([cd])*ef");
Matcher m = p.matcher("abcddcef");
m.matches();

... I would like to have something like ...我想要类似的东西

  • m.group(1) = "c" m.group(1) = "c"
  • m.group(2) = "d" m.group(2) = "d"
  • m.group(3) = "d" m.group(3) = "d"
  • m.group(4) = "c" . m.group(4) = "c"

(Background: I'm parsing some lines of data, and one of the "fields" is repeating. I would like to avoid a matcher.find loop for these fields.) (背景:我正在解析一些数据行,其中一个“字段”在重复。我想避免对这些字段使用matcher.find循环。)


As pointed out by @Tim Pietzcker in the comments, perl6 and .NET have this feature.正如@Tim Pietzcker 在评论中指出的那样, perl6.NET具有此功能。

According to the documentation , Java regular expressions can't do this:根据文档,Java 正则表达式不能这样做:

The captured input associated with a group is always the subsequence that the group most recently matched.与组关联的捕获输入始终是该组最近匹配的子序列。 If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails.如果由于量化而对组进行第二次评估,那么如果第二次评估失败,则将保留其先前捕获的值(如果有)。 Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b".例如,将字符串“aba”与表达式 (a(b)?)+ 匹配,将第二组设置为“b”。 All captured input is discarded at the beginning of each match.在每次匹配开始时,所有捕获的输入都将被丢弃。

(emphasis added) (强调)

You can use split to get the fields you need into an array and loop through that.您可以使用 split 将您需要的字段放入数组并循环遍历。

http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String ) http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String )

I have not used java regex, but for many languages the answer is: No.我没有使用过 java regex,但对于许多语言,答案是:不。

Capturing groups seem to be created when the regex is parsed, and filled when it matches the string.捕获组似乎在解析正则表达式时创建,并在匹配字符串时填充。 The expression (a)|(b)(c) has three capturing groups, only if either one, or two of them can be filled.表达式(a)|(b)(c)具有三个捕获组,前提是可以填充其中一个或两个。 (a)* has just one group, the parser leaves the last match in the group after matching. (a)*只有一个组,解析器匹配后留下组中的最后一个匹配项。

Pattern p = Pattern.compile("ab(?:(c)|(d))*ef");
Matcher m = p.matcher("abcdef");
m.matches();

should do what you want.应该做你想做的。

EDIT:编辑:

@aioobe, I understand now. @aioobe,我现在明白了。 You want to be able to do something like the grammar你希望能够做一些类似语法的事情

A    ::== <Foo> <Bars> <Baz>
Foo  ::== "foo"
Baz  ::== "baz"
Bars ::== <Bar> <Bars>
        | ε
Bar  ::== "A"
        | "B"

and pull out all the individual matches of Bar .并拉出Bar所有个人匹配项。

No, there is no way to do that using java.util.regex .不,没有办法使用java.util.regex做到这一点。 You can recurse and use a regex on the match of Bars or use a parser generator like ANTLR and attach a side-effect to Bar .您可以递归并在Bars的匹配上使用正则表达式,或者使用像 ANTLR 这样的解析器生成器并将副作用附加到Bar

I have just had the very similar problem, and managed to do "variable number of groups" but a combination of a while loop and resetting the matcher.我刚刚遇到了非常相似的问题,并设法做到了“可变数量的组”,但结合了 while 循环和重置匹配器。

    int i=0;
    String m1=null, m2=null;

    while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
    {
        // do work on two found groups
        i=matcher.end();
    }

But this is for my problem (with two repeating但这是针对我的问题(有两个重复

    Pattern pattern = Pattern.compile("(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)");
    Matcher matcher = pattern.matcher("abcddcef")
    int i=0;
    String res=null;

    while(matcher.find(i) && (res=matcher.group())!=null)
    {
        System.out.println(res);
        i=matcher.end();
    }

You lose the ability to specify arbitrary length of repetition with * or + because look-ahead and look-behind must be of the predictable length.您无法使用*+指定任意长度的重复,因为前瞻和后视必须具有可预测的长度。

I would think that backtracking inhibits this behavior, and say the effect of /([\\S\\s])/ in its grouping accumulative state on something like the Bible.我认为回溯会抑制这种行为,并说/([\\S\\s])/在其分组累积状态下对圣经之类的东西的影响。 Even if it can be done, the output is unknowable as the groups will lose positional meaning.即使可以完成,输出也是不可知的,因为组将失去位置意义。 Its better to do a separate regex on like kind in a global sense and have it deposited into an array.最好在全局意义上对同类进行单独的正则表达式并将其存放到数组中。

If there is a reasonable max number of matching groups you would encounter:如果有合理的最大匹配组数,您会遇到:

"ab([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?ef"

This example will work for 0 - 8 matches.此示例适用于 0 - 8 个匹配项。 I admit this is ugly and not humanly readable.我承认这很丑陋,而且不是人类可读的。

I would like to avoid a matcher.find loop for these fields.我想避免对这些字段使用 matcher.find 循环。

As stated in other answers, that cannot be avoided.正如其他答案中所述,这是无法避免的。 For completeness, here is how to do it using a second Pattern to go over the individual matches.为了完整起见,这里是如何使用第二个Pattern到 go 来完成各个匹配项。 Note the position of the * being inside the round brackets rather than after.请注意*的 position 在圆括号内而不是之后。

Pattern subPattern = Pattern.compile("[cd]");
Pattern pattern = Pattern.compile("ab(" + subPattern.pattern() + "*)ef"); // DRY, but probably safer ways to do it for the case that subPattern needs to be changed.
Matcher matcher = pattern.matcher("abccdcddef is great and all, but have you heard about abef and abddcef?");
List<String> letterSequence = new ArrayList<>();
while (matcher.find()) {
    String letters = matcher.group(1);
    Matcher subMatcher = subPattern.matcher(letters);
    while (subMatcher.find()) {
        String letter = subMatcher.group();
        letterSequence.add(letter);
    }
}
System.out.println(letterSequence);

Output: Output:

[c, c, d, c, d, d, d, d, c] [c, c, d, c, d, d, d, d, c]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM