简体   繁体   中英

Removing accents from String

Recentrly I found very helpful method in StringUtils library which is

StringUtils.stripAccents(String s)

I found it really helpful with removing any special characters and converting it to some ASCII "equivalent", for instace ç=c etc.

Now I am working for a German customer who really needs to do such a thing but only for non-German characters. Any umlauts should stay untouched. I realised that strinAccents won't be useful in that case.

Does anyone has some experience around that stuff? Are there any useful tools/libraries/classes or maybe regular expressions? I tried to write some class which is parsing and replacing such characters but it can be very difficult to build such map for all languages...

Any suggestions appriciated...

Best built a custom function. It can be like the following. If you want to avoid the conversion of a character, you can remove the relationship between the two strings (the constants).

private static final String UNICODE =
        "ÀàÈèÌìÒòÙùÁáÉéÍíÓóÚúÝýÂâÊêÎîÔôÛûŶŷÃãÕõÑñÄäËëÏïÖöÜüŸÿÅåÇçŐőŰű";
private static final String PLAIN_ASCII =
        "AaEeIiOoUuAaEeIiOoUuYyAaEeIiOoUuYyAaOoNnAaEeIiOoUuYyAaCcOoUu";

public static String toAsciiString(String str) {
    if (str == null) {
        return null;
    }
    StringBuilder sb = new StringBuilder();
    for (int index = 0; index < str.length(); index++) {
        char c = str.charAt(index);
        int pos = UNICODE.indexOf(c);
        if (pos > -1)
            sb.append(PLAIN_ASCII.charAt(pos));
        else {
            sb.append(c);
        }
    }
    return sb.toString();
}

public static void main(String[] args) {
    System.out.println(toAsciiString("Höchstalemannisch"));
}

My gut feeling tells me the easiest way to do this would be to just list allowed characters and strip accents from everything else. This would be something like

import java.util.regex.*;
import java.text.*;

public class Replacement {
    public static void main(String args[]) {
        String from = "aoeåöäìé";
        String result = stripAccentsFromNonGermanCharacters(from);
        
        System.out.println("Result: " + result);
    }

    private static String patternContainingAllValidGermanCharacters =
                                            "a-zA-Z0-9äÄöÖéÉüÜß";
    private static Pattern nonGermanCharactersPattern =
        Pattern.compile("([^" + patternContainingAllValidGermanCharacters + "])");

    public static String stripAccentsFromNonGermanCharacters(
           String from) {
        return stripAccentsFromCharactersMatching(
            from, nonGermanCharactersPattern);
    }

    public static String stripAccentsFromCharactersMatching(
        String target, Pattern myPattern) {

        StringBuffer myStringBuffer = new StringBuffer();
        Matcher myMatcher = myPattern.matcher(target);
        while (myMatcher.find()) {
            myMatcher.appendReplacement(myStringBuffer,
                stripAccents(myMatcher.group(1)));
        }
        myMatcher.appendTail(myStringBuffer);

        return myStringBuffer.toString();
    }


    // pretty much the same thing as StringUtils.stripAccents(String s)
    // used here so I can demonstrate the code without StringUtils dependency
    public static String stripAccents(String text) {
        return Normalizer.normalize(text,
            Normalizer.Form.NFD)
           .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    }
}

(I realize the pattern doesn't probably contain all the characters needed, but add whatever is missing)

This might give you a work around. here you can detect the language and get the specific text only.

EDIT: You can have the raw string as an input, put the language detection to German and then it will detect the German characters and will discard the remaining.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM