简体   繁体   中英

How to compare Chinese characters in Java using 'equals()'

I want to compare a string portion (ie character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:

for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {

   // Account for 'r' like in dianr/huir
   if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {

Also, feel free to suggest a more elegant way to parse this ...

[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)

在此处输入图片说明

在此处输入图片说明

oh, dang, apparently it does not work simply copy and pasting:

在此处输入图片说明

Use CharSequence.codePoints() , which returns a stream of the codepoints, rather than having to deal with chars:

tmpChar.codePoints().forEach(c -> {
  if (c == '兒') {
    // ...
  }
});

(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ }) ).

Either characters, accepting as substring.

String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
    int position2 = position + "兒".length();
    s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
    // At position i there is a 兒.
}

Or code points where it would be one code point. As that is not really easier, variable substring seem fine.

if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {

Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.

There are a variety of APIs on the String class for coping.

As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.

Or, you can use the ICU4J library with a richer set of facilities for all of this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM