How do I detect unicode characters in a Java string?

Question

Suppose I have a string that contains Ü. How would I find all those unicode characters? Should I test for their code? How would I do that?

For example, given the string "AÜXÜ", I'd like to transform it to "AYXY". I'd like to do the same for other unicode characters, and I would hate to have to store them in a translation map of some sort.

Answer 1

The definition of "unicode characters" is vague, but will be taken to mean UTF-8 characters not covered by the standard ISO 8859 charset . If this is true in your case, then loop through all characters in the String and test its codepoint to determine whether it is within the given character set.

Alternatively, use a Map<Character, Character> and characters in the map that contain match the keys. For example:

Map<Character, Character> charReplacementMap = new HashMap<Character, Character>() {{
    put('Ü', 'Y');
    // Put more here.
}};

String originalString = "AÜAÜ";
StringBuilder builder = new StringBuilder();

for (char currentChar : originalString.toCharArray()) {
    Character replacementChar = charReplacementMap.get(currentChar);
    builder.append(replacementChar != null ? replacementChar : currentChar);
}

String newString = builder.toString();

Or, do you mean "all characters with diacritics"? If so, then use java.text.Normalizer to remove diacritical marks:

/**
 * Remove any diacritical marks (accents like ç, ñ, é, etc) from
 * the given string (so that it returns plain c, n, e, etc).
 * @param string The string to remove diacritical marks from.
 * @return The string with removed diacritical marks, if any.
 */
public static String removeDiacriticalMarks(String string) {
    return Normalizer.normalize(string, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

One pitfall, Ü would become U, not Y. Not sure if that's what you're after. If you want to replace by pronounced character, you'll really need to create a mapping. Sure, it's a tedious work, but it's done in less time than you needed to follow this topic.

Answer 2

You could loop through your string and for every character call

If (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
 // replace with Y
}

Answer 3

You could go the other way round and ask if the character is an ascii character.

public static boolean isAscii(char ch) {
    return ch < 128;
}

You'd have to analyse the string char by char then of course.

(the method is from commons-lang CharUtils which contains loads of useful Character methods)

Answer 4

It isn't clear to me exactly what is gained by transforming "AÜXÜ" to "AYXY". Is this because Ü is pronounced like Y in a particular language? What language? And what other rules might apply?

In terms of terminology...

"a"

The above is a Unicode string. It contains a single UTF-16 encoded character.

If you wish to limit the range of characters to the English alphabet, have a look at the Normalization performed in this answer .

Answer 5

我不确定你的例子你想要做什么 - 如果你只是试图用Y替换所有非ASCII值，那么你可以遍历字符串寻找0到127范围之外的代码点，并用Y替换那些代码点。

Answer 6

The class Character also offers some interesting methods. Take a look at it.

Character.UnicodeBlock.of('a') == Character.UnicodeBlock.BASIC_LATIN; //true

Character.UnicodeBlock.of('�') == Character.UnicodeBlock.BASIC_LATIN; //false

How do I detect unicode characters in a Java string?

Question

6 answers

solution1
15 ACCPTED 2009-11-04 12:48:15

solution2
13 2009-11-04 12:48:53

solution3
12 2009-11-04 12:44:28

solution4
2 2009-11-04 12:50:12

solution5
1 2009-11-04 12:45:46

solution6
0 2017-06-06 09:28:03

How do I detect unicode characters in a Java string?

Question

6 answers

solution1 15 ACCPTED 2009-11-04 12:48:15

solution2 13 2009-11-04 12:48:53

solution3 12 2009-11-04 12:44:28

solution4 2 2009-11-04 12:50:12

solution5 1 2009-11-04 12:45:46

solution6 0 2017-06-06 09:28:03

solution1
15 ACCPTED 2009-11-04 12:48:15

solution2
13 2009-11-04 12:48:53

solution3
12 2009-11-04 12:44:28

solution4
2 2009-11-04 12:50:12

solution5
1 2009-11-04 12:45:46

solution6
0 2017-06-06 09:28:03