简体   繁体   中英

Replace Unicode Characters in a String

I need to replace diacritic characters (eg ä, ó, etc.) with their 'base' character. For most of the characters, this solution works:

StringUtils.stripAccents(tmpStr);

but this misses four characters: æ, œ, ø, and ß.

I took a look at this solution here Is there a way to get rid of accents and convert a whole string to regular letters? . I figured the first solution would work, but it does not.

How can I replace these characters with their 'base' character (eg replace æ with a).

The source code says ( https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html ),

public static String stripAccents(final String input) {
    if (input == null) {
        return null;
    }        final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));        convertRemainingAccentCharacters(decomposed);        

    // Note that this doesn't correctly remove ligatures...   
 
    return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);    
}

It has a comment that says, // Note that this doesn't correctly remove ligatures...

So may be you need to manually replace those instances. Something like,

    String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
    string = string.replaceAll("\\p{M}", "");

    string = string.replace("ß", "s");
    string = string.replace("ø", "o");
    string = string.replace("œ", "o");
    string = string.replace("æ", "a");

Diacritical Character to ASCII Character Mapping https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM