简体   繁体   English

无法识别Java字符串中的替代字符

[英]Cannot identify surrogate characters in Java string

I am having trouble identifying surrogate characters in strings like devā́n . 我在识别像devā́n这样的字符串中的替代字符时遇到了麻烦。 I read the relevant questions concerning the topic here on SO, but something is still wrong with this... 我在此处阅读了与该主题有关的相关问题,但是这样做还是有问题的...
As you see, the "natural" length (i just made up that expression) of this string is 5, but "devā́n".length() gives me 6. 如您所见,此字符串的“自然”长度(我刚刚组成了该表达式)为5,但"devā́n".length()给我6。
That is fine, because ā́ consists of two characters internally (it's not withing the UTF-16 code range). 没关系,因为ā́内部由两个字符组成(不属于UTF-16代码范围)。 But i would like to get the length of the string as you'd read it or as it's printed, so 5 in this case. 但是我想获取字符串的长度,如您所读或打印的那样,因此本例中为5

I tried identifying the weirdo chars with the following tricks found here and here , but it doesn't work and i'm always getting 6. Just have a look at this: 我尝试使用以下在此处此处找到的技巧识别怪人字符,但是它不起作用,而且我总是得到6。请看一下:

//string containing surrogate pair
String s = "devā́n";

//prints the string properly
System.out.println("String: " + s);

//prints "Length: 6"
System.out.println("Length: " + s.length());

//prints "Codepoints: 6"
System.out.println("Codepoints: " + s.codePointCount(0, s.length()));

//false
System.out.println(
        Character.isSurrogate(s.charAt(3)));

//false
System.out.println(
        Character.isSurrogate(s.charAt(4)));

//six code points
System.out.println("\n");
for (int i = 0; i < s.length(); i++) {
    System.out.println(s.charAt(i) + ": " + s.codePointAt(i));
}

Is it maybe possible that ā́ is not a valid pair of surrogate chars? ā́可能不是一对有效的替代字符吗? How can i identify such a compound char and count it as only one? 我该如何识别这种复合字符并将其算作一个字符?

BTW the output of above code is 顺便说一句以上代码的输出是

String: devā́n
Length: 6
Codepoints: 6
false
false


d: 100
e: 101
v: 118
ā: 257
́: 769
n: 110

First of all, the reason that 769 (U+0301) is not testing as a surrogate character, is that it is NOT a surrogate characters. 首先,769(U + 0301)未作为代理字符进行测试的原因是,它不是代理字符。 Surrogate characters are used when a Unicode codepoint is outside of plane 0 is represented in UTF-16. 当UTF-16中表示Unicode代码点在平面0之外时,将使用代理字符。 (Surrogates are code units in the range U+D800 through U+DFFF.) (替代是U + D800到U + DFFF范围内的代码单位。)

So what you are really trying to do here is to figure out how many "ordinary" characters there are in a UTF-16 string. 因此,您实际上要在这里做的是弄清楚UTF-16字符串中有多少个“普通”字符。 This is done in two steps: 这分两个步骤完成:

  • First, normalize the string to NFC form (see Normalizing Text ) using the Normalizer API. 首先,使用Normalizer API将字符串标准化为NFC形式(请参阅标准化文本 )。
  • Then use the String API to find the number of code points in the string; 然后使用String API查找字符串中的代码点数 eg use String.codePointCount ( javadoc ). 例如,使用String.codePointCountjavadoc )。

In this case, this still fails. 在这种情况下,这仍然会失败。 The reason is that the code point sequence 原因是代码点序列

ā: 257
́: 769

actually represents an "a" character with two diacritical marks. 实际上代表一个带有两个变音符号的“ a”字符。 This cannot be represented as a single Unicode codepoint, so the NFC for it is two codepoints. 这不能表示为单个Unicode代码点,因此它的NFC是两个代码点。

What confuses this even further is that a typical renderer will display the "acute" accent over the following character. 令这更加困惑的是,典型的渲染器将在下一个字符上显示“急性”重音。 So it looks like you have a "n acute" in your example. 因此,您的示例中看起来像是“ n急性”。

It is going to be very difficult to deal with pathological examples like this where base characters have multiple diacriticals that might render strangely. 处理这样的病理示例将非常困难,在这些示例中,基本字符具有多个变音符,这些变音符可能会产生奇怪的效果。 Maybe you need to translate to NFD and then count the code points that are not diacriticals. 也许您需要转换为NFD,然后计算不是变音符号的代码点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM