简体   繁体   English

Java 中的正则表达式。 意外行为

[英]Regular Expression in Java. Unexpected behaviour

I am trying to match mostly numbers, but depending on the Words which follow the Expression I need to make a difference.我试图匹配大部分数字,但根据表达式后面的单词,我需要有所作为。

I match every Number which is not followed by a Temperature Term like °C or a Time Specification.我匹配每一个后面没有像°C 或时间规范这样的温度项的数字。 My Regular Expression looks like this:我的正则表达式如下所示:

(((\d+?)(\s*)(\-)(\s*))?(\d+)(\s*))++(?!minuten|Minuten|min|Min|Stunden|stunden|std|Std|°C| °C)

Here is an Example: http://regexr.com?33jeg这是一个例子: http : //regexr.com?33jeg

While this Behavior is what I expected Java does the Following: Index is the corresponding Group to the Match 4虽然此行为是我所期望的 Java 执行以下操作: 索引是匹配 4 的相应组

0: "4 "1: "4 "2: "0 - "3: "0"4: " "5: "-"6: " "7: "4"8: " "9: "°C"

You need to Know that I match every String separate.您需要知道我将每个字符串都分开匹配。 So the match for the 5 looks like this:所以 5 的匹配看起来是这样的:

0: "5 "1: "5 "2: "null"3: "null"4: "null"5: "null"6: "null"7: "5"8: " "9: "null"

This is how Id like the other Match to be.这就是我喜欢其他比赛的方式。 This unpleasant behavior is only when a "-" is somewhere in the String before the Match这种令人不快的行为仅在匹配之前字符串中的某个位置出现“-”时

My Java Code is the following:我的Java代码如下:

public static void adaptPortionDetails(EList<Step> steps, double multiplicator){
    
    String portionMatcher = "(((\\d+?)(\\s*)(\\-)(\\s*))?(\\d+)(\\s*))++(?!°C|Grad|minuten|Minuten|min|Min|Stunden|stunden|std|Std)";
    
    for (int i = 0; i < steps.size(); i++) {
        Matcher matcher = Pattern.compile(portionMatcher).matcher(
                steps.get(i).getDescription());
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            printGroups(matcher);
            String newValue1Str;
            if (matcher.group(3) == null){
                newValue1Str = "";
                System.out.println("test");
            }else{
                double newValue1 = Integer.parseInt(matcher.group(3)) * multiplicator;
                newValue1Str = Fraction.getFraction(newValue1).toProperString();
            }
            double newValue2 = Integer.parseInt(matcher.group(7)) * multiplicator;
            String newValue2Str = Fraction.getFraction(newValue2).toProperString();
            
            
            matcher.appendReplacement(sb, newValue1Str + "$4$5$6" + newValue2Str + "$8");
        }
        matcher.appendTail(sb);
        steps.get(i).setDescription(sb.toString());
    }
}

Hope you can tell what I'm missing.希望你能告诉我缺少什么。

This seems to be a bug (or feature?) in Java's implementation.这似乎是 Java 实现中的一个错误(或特性?)。 It doesn't seem to reset the captured text for the capturing group when the matching has to be redone from the next index.当必须从下一个索引重做匹配时,它似乎不会重置捕获组的捕获文本。

This test reveals the discrepancy in behavior between Java regex engine and PHP's PCRE.该测试揭示了 Java 正则表达式引擎和 PHP 的 PCRE 之间的行为差​​异。

  • Regex: (\\d+(-\\d+)?){1}+(?!x)正则表达式: (\\d+(-\\d+)?){1}+(?!x)
  • Input: 34 34-43x 78 90输入: 34 34-43x 78 90
  • Java result: 3 matches ( 34 , 78 , 90 ). Java 结果:3 个匹配项( 347890 )。 The 2nd capturing group of the 2nd match is -43 .第二场比赛的第二个捕获组是-43 The 2nd capturing group captures nothing for 1st and 3rd match.第 2 个捕获组在第 1 次和第 3 场比赛中没有捕获任何内容。
  • PHP result : Also the same 3 matches, but 2nd capturing group captures nothing for all matches. PHP 结果:同样是 3 个匹配项,但第二个捕获组没有捕获所有匹配项。 For PHP's PCRE implementation, when the match has to be redone, the captured text of the capturing groups are reset.对于 PHP 的 PCRE 实现,当必须重做匹配时,捕获组的捕获文本将被重置。

This is tested this on JRE 6 Update 37 and JRE 7 Update 11.这在 JRE 6 Update 37 和 JRE 7 Update 11 上进行了测试。

Same result for this, just to prove the point that captured text is not reset when matching has to be redone:相同的结果,只是为了证明在必须重做匹配时不会重置捕获的文本:

  • Regex: a(\\d+(-\\d+)?){1}+(?!x)正则表达式: a(\\d+(-\\d+)?){1}+(?!x)
  • Input: a34 a34-43x a78 a90输入: a34 a34-43x a78 a90
  • PHP result PHP 结果

Some comment about your regex关于你的正则表达式的一些评论

I think the ++ should be {1}+ , since it seems that you want to modify one number or one range of number at a time, while making the match possessive to discard unwanted numbers.我认为++应该是{1}+ ,因为您似乎想一次修改一个数字或一个数字范围,同时使匹配具有所有格以丢弃不需要的数字。

Workaround解决方法

The first group (the outer most capturing group), which captures everything (one number or a range of number), will always be overwritten when a match is found.第一组(最外面的捕获组)捕获所有内容(一个数字或一个数字范围),在找到匹配项时将始终被覆盖。 Hence you can rely on it.因此,您可以信赖它。 You can check whether there exist a - in the group 1 (with contains method).您可以检查组 1 中是否存在- (使用contains方法)。 If there is, then you can tell that capturing group 2 contains captured text from the current match, and you can use the captured text.如果有,那么您可以判断捕获组 2 包含来自当前匹配项的捕获文本,并且您可以使用捕获的文本。 If there is not, then you can ignore all the captured text in capturing group 2 and its nested capturing groups.如果没有,那么您可以忽略捕获组 2 及其嵌套捕获组中的所有捕获文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM