简体   繁体   中英

Android toLowerCase() issue with accented characters

My app has a feature to filter content based on some keywords. This is case insensitive so in order to work I first call String.toLowerCase() on the source content.

The issue I have is when the source is in upper case and contains accentuated characters like with the french word: "INVITÉ"

This word when set to lowercase using the device default locale returns "invité" The problem is that the last character is not the same as the lowercase character "é" Instead it's the combination of 2 chars: "e" 101 & " ' " 769

Because of this "invité" does not match "invité"

How can I solve this? I would prefer not to remove accentuated characters altogether

You should normalize the string like this.

String upper = "INVITÉ";
System.out.println(upper + " length=" + upper.length());
String lower = upper.toLowerCase();
System.out.println(lower + " length=" + lower.length());
String normalized = Normalizer.normalize(lower, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());

output:

INVITÉ length=7
invité length=7
invité length=6

It also works for Japanese.

String japanese = "が";
System.out.println(japanese + " length=" + japanese.length());
String normalized = Normalizer.normalize(japanese, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());

output:

が length=2
が length=1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM