简体   繁体   English

奇怪的Java Unicode正则表达式StringIndexOutOfBoundsException

[英]Strange Java Unicode Regular Expression StringIndexOutOfBoundsException

My question is quite simple yet puzzling. 我的问题很简单但令人费解。 It could be that there is a simple switch which fixes this but I'm not much experienced in Java regexes... 可能有一个简单的开关可以解决这个问题,但我对Java正则表达不太熟悉......

String line = "💕💕💕";
line.replaceAll("(?i)(.)\\1{2,}", "$1");

This crashes. 这崩溃了。 If I remove the (?i) switch, it works. 如果我删除(?i)开关,它可以工作。 The three unicode characters are not random, they were found amidst a big Korean text, but I don't know they are valid or not. 三个unicode字符不是随机的,它们是在韩文大文中发现的,但我不知道它们是否有效。

Strange thing is that the regex works for all the other text but this. 奇怪的是,正则表达式适用于所有其他文本,但这一点。 Why do I get the error? 为什么我会收到错误?

This is the exception I get 这是我得到的例外

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 6
    at java.lang.String.charAt(String.java:658)
    at java.lang.Character.codePointAt(Character.java:4668)
    at java.util.regex.Pattern$CIBackRef.match(Pattern.java:4846)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4125)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Start.match(Pattern.java:3408)
    at java.util.regex.Matcher.search(Matcher.java:1199)
    at java.util.regex.Matcher.find(Matcher.java:592)
    at java.util.regex.Matcher.replaceAll(Matcher.java:902)
    at java.lang.String.replaceAll(String.java:2162)
    at tokenizer.Test.main(Test.java:51)

The characters you mentioned are actually " Double byte characters ". 你提到的字符实际上是“ 双字节字符 ”。 Which means that two bytes form one character. 这意味着两个字节组成一个字符。 But for Java to interpret this, the encoding information ( when it is different from the default platform encoding ) needs to be passed explicitly ( or else default platform encoding will be used ). 但是对于Java来解释这一点,需要显式传递编码信息( 当它与默认平台编码不同时 )( 否则将使用默认平台编码 )。

To prove this, consider following 为了证明这一点,请考虑以下

String line = "💕💕💕";
System.out.println(line.length());

this prints the length as 6 ! 这打印长度为6 Whereas we only have three characters, 而我们只有三个字符,

now the following code 现在以下代码

String line1 = new String("💕💕💕".getBytes(),"UTF-8");
System.out.println(line1.length());

prints length as 3 which intended. 打印长度为3意图。

if you replace the line 如果你更换线

String line = "💕💕💕";

with

 String line1 = new String("💕💕💕".getBytes(),"UTF-8");

it works and regex does not fail . 它的工作原理和正则表达式不会失败 I have used UTF-8 here. 我在这里使用过UTF-8。 Please use the appropriate encoding of your intended platform. 请使用您想要的平台的相应编码。

Java regex libraries depend heavily on Character Sequence which in turn depends on the encoding scheme. Java正则表达式库严重依赖于字符序列 ,而字符序列又依赖于编码方案。 For the strings having character encoding different from the default encoding, characters cannot be decoded correctly (it showed 6 chars instead of 3 !) and hence regex fails. 对于具有与默认编码不同的字符编码的字符串,字符无法正确解码(它显示6个字符而不是3个!)因此正则表达式失败。

What's explained by Santosh in this answer is incorrect. Santosh在这个答案中解释的是不正确的。 This can be demonstrated by running 这可以通过运行来证明

String str = "💕💕💕";
System.out.println("code point: " + .codePointAt(0));

which will output (at least for me) the value 128149, which is confirmed by this page as correct. 这将输出(至少对我来说)值128149,该值由此页面确认为正确。 So Java does not interpret the string in a wrong way. 因此Java不会以错误的方式解释字符串。 It did interpret it wrong when using the getBytes() method. 它在使用getBytes()方法时确实解释错了。

However, as explained by OP, it seems the regular expression crashes on that. 但是,正如OP所解释的那样,正则表达式似乎崩溃了。 I have no other explanation for it as it being a bug in java. 我没有其他解释,因为它是java中的一个错误。 Either that, or then it doesn't support UTF-16 fully by design. 要么那么,或者它完全不支持UTF-16的设计。

Edit: 编辑:

based on this answer : 根据这个答案

the regex compiler screws up on the UTF-16. 正则表达式编译器搞砸了UTF-16。 Again, this can never be fixed or it will change old programs. 同样,这永远不会被修复或它将改变旧程序。 You cannot even get around the bug by using the normal workaround to Java's Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. 你甚至无法通过使用java -encoding UTF-8编译来解决Java的Unicode-in-source-code问题的常规解决方法,因为愚蠢的东西将字符串存储为讨厌的UTF-16,这必然会破坏它们人物类。 OOPS! OOPS!

It would seem that this is a limitation of regular expressions in java. 这似乎是java中正则表达式的限制。


Since you commented that 既然你评论过

it would be best if I could simply ignore the UTF-16 characters and apply the regex rather than throw an exception. 如果我可以简单地忽略UTF-16字符并应用正则表达式而不是抛出异常,那将是最好的。

This can certainly be done. 这肯定可以做到。 A straightforward way is to only apply your regex to a certain range. 一种简单的方法是仅将正则表达式应用于特定范围。 Filtering unicode character ranges has been explained in this answer . 这个答案中已经解释了过滤unicode字符范围。 Based on that answer, example that doesn't seem to choke but just leaves the problem characters alone: 基于这个答案,这个例子似乎并没有窒息而只留下问题角色:

line.replaceAll("(?Ui)([\\u0000-\\uffff])\\1{2,}", "$1")    

// "💕💕💕" -> "💕💕💕"
// "foo 💕💕💕 foo" -> "foo 💕💕💕 foo"
// "foo aAa foo" -> "foo a foo"

Actually, it's just a bug. 实际上,这只是一个错误。

This is what stack traces and open source are for. 这就是堆栈跟踪和开源的用途。

When CIBackRef (for case-insensitive back reference) compares with the group, it doesn't bump the loop index correctly. CIBackRef (对于不区分大小写的后向引用)与组进行比较时,它不会正确地碰撞循环索引。 This shows the fix: 这显示了修复:

        // Check each new char to make sure it matches what the group
        // referenced matched last time around
        int x = i;
        for (int index=0; index<groupSize; ) {
            int c1 = Character.codePointAt(seq, x);
            int c2 = Character.codePointAt(seq, j);
            if (c1 != c2) {
                if (doUnicodeCase) {
                    int cc1 = Character.toUpperCase(c1);
                    int cc2 = Character.toUpperCase(c2);
                    if (cc1 != cc2 &&
                        Character.toLowerCase(cc1) !=
                        Character.toLowerCase(cc2))
                        return false;
                } else {
                    if (ASCII.toLower(c1) != ASCII.toLower(c2))
                        return false;
                }
            }
            int n = Character.charCount(c1);
            x += n;
            index += n;  // was index++
            j += Character.charCount(c2);
        }

groupSize is the total charCount of the group. groupSize是组的总charCount。 j is the index for the referenced group. j是引用组的索引。

The test 考试

  //9ff0 9592 9ff0 9592 9ff0 9592
  val line = "\ud83d\udc95\ud83d\udc95\ud83d\udc95"
  Console println Try(line.replaceAll("(?ui)(.)\\1{2,}", "$1"))

fails normally 正常失败

apm@mara:~/tmp$ skalac kcharex.scala ; skala kcharex.Test
Failure(java.lang.StringIndexOutOfBoundsException: String index out of range: 6)

but succeeds with the fix 但是修复成功了

apm@mara:~/tmp$ skala -J-Xbootclasspath/p:../bootfix kcharex.Test
Success(💕)

The other bug in the original sample code is that the inline flags should include ?ui . 原始示例代码中的另一个错误是内联标志应包含?ui The javadoc on Pattern.CASE_INSENSITIVE says: Pattern.CASE_INSENSITIVE上的javadoc说:

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. 默认情况下,不区分大小写的匹配假定只匹配US-ASCII字符集中的字符。 Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag. 通过将UNICODE_CASE标志与此标志一起指定,可以启用Unicode感知的不区分大小写的匹配。

As you can see from the code snippet, without u , it will fail only if ASCII.toLower doesn't compare equal, which is not intended. 正如您从代码片段中看到的那样,没有u ,只有当ASCII.toLower不比较相等时才会失败,这是不可取的。 I'm not sophisticated enough to know of a supplementary character that would fail that test without writing code to figure it out. 我不够精明,无法知道一个补充字符,如果没有编写代码来解决这个问题就会失败。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM