具有可变数量的组的正则表达式中的表情符号Unicode

Question

I am aware that this is a corner case, but I have come across a code that uses regular expression with variable number of groups 我知道这是一个极端的案例，但我遇到了一个使用带有可变数量的组的正则表达式的代码

According to docs this is legal: 根据文档，这是合法的：

The captured input associated with a group is always the subsequence that the group most recently matched. 与组关联的捕获输入始终是该组最近匹配的子序列。 If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. 如果由于量化而第二次评估组，则如果第二次评估失败，则将保留其先前捕获的值（如果有的话）。 Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". 例如，将字符串“aba”与表达式（a（b）？）+匹配，将第二组设置为“b”。 All captured input is discarded at the beginning of each match. 在每次比赛开始时丢弃所有捕获的输入。

However, when I try to use that with the unicode sign 'GRINNING FACE WITH SMILING EYES' (U+1F601) I get StringIndexOutOfBoundsException. 但是，当我尝试使用unicode标志'GRINNING FACE WITH SMILING EYES'（U + 1F601）时，我得到了StringIndexOutOfBoundsException。

Is that expected according to the spec or a bug? 这是根据规格或错误预期的吗？

Here is the test code: 这是测试代码：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TestEmoji {
    public static void main(String[] args)  {
        Pattern pattern = Pattern.compile("(A.)* EEE");

        testGroups(pattern, "ACAB EEE");
        testGroups(pattern,  "ABACA\uD83D\uDE01");

    }

    public static void testGroups(Pattern pattern, String s) {
        Matcher matcher = pattern.matcher(s);
        if (matcher.matches()) {
            System.out.println("matches");
            System.out.println(matcher.groupCount());
            for (int i = 1; i <= matcher.groupCount(); ++i) {
                System.out.println(matcher.group(i));
            }
        }
    }
}

and the exception: 和例外：

matches
1
AB
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2
        at java.lang.String.charAt(String.java:658)
        at java.util.regex.Pattern$Slice.match(Pattern.java:3867)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4382)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4354)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:4304)
        at java.util.regex.Matcher.match(Matcher.java:1221)
        at java.util.regex.Matcher.matches(Matcher.java:559)
        at TestEmoji.testGroups(TestEmoji.java:19)
        at TestEmoji.main(TestEmoji.java:12)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

Answer 1

After some digging in Java Bugs database, I found it: 在Java Bugs数据库中进行了一些挖掘之后，我找到了它：

http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8007395 http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8007395

JDK-8007395 : StringIndexOutofBoundsException in Match.find() when input String contains surrogate UTF-16 characters 当输入String包含代理UTF-16字符时，Match.find（）中的JDK-8007395：StringIndexOutofBoundsException

具有可变数量的组的正则表达式中的表情符号Unicode

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-02-07 17:00:07

具有可变数量的组的正则表达式中的表情符号Unicode

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-02-07 17:00:07

解决方案1
3 已采纳 2014-02-07 17:00:07