Suppose I have a string that contains Ü. How would I find all those unicode characters? Should I test for their code? How would I do that?
For example, given the string "AÜXÜ", I'd like to transform it to "AYXY". I'd like to do the same for other unicode characters, and I would hate to have to store them in a translation map of some sort.
The definition of "unicode characters" is vague, but will be taken to mean UTF-8 characters not covered by the standard ISO 8859 charset . If this is true in your case, then loop through all characters in the String and test its codepoint to determine whether it is within the given character set.
Alternatively, use a Map<Character, Character>
and characters in the map that contain match the keys. For example:
Map<Character, Character> charReplacementMap = new HashMap<Character, Character>() {{
put('Ü', 'Y');
// Put more here.
}};
String originalString = "AÜAÜ";
StringBuilder builder = new StringBuilder();
for (char currentChar : originalString.toCharArray()) {
Character replacementChar = charReplacementMap.get(currentChar);
builder.append(replacementChar != null ? replacementChar : currentChar);
}
String newString = builder.toString();
Or, do you mean "all characters with diacritics"? If so, then use java.text.Normalizer
to remove diacritical marks:
/**
* Remove any diacritical marks (accents like ç, ñ, é, etc) from
* the given string (so that it returns plain c, n, e, etc).
* @param string The string to remove diacritical marks from.
* @return The string with removed diacritical marks, if any.
*/
public static String removeDiacriticalMarks(String string) {
return Normalizer.normalize(string, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
One pitfall, Ü would become U, not Y. Not sure if that's what you're after. If you want to replace by pronounced character, you'll really need to create a mapping. Sure, it's a tedious work, but it's done in less time than you needed to follow this topic.
You could loop through your string and for every character call
If (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
// replace with Y
}
You could go the other way round and ask if the character is an ascii character.
public static boolean isAscii(char ch) {
return ch < 128;
}
You'd have to analyse the string char by char then of course.
(the method is from commons-lang CharUtils which contains loads of useful Character methods)
It isn't clear to me exactly what is gained by transforming "AÜXÜ" to "AYXY". Is this because Ü is pronounced like Y in a particular language? What language? And what other rules might apply?
In terms of terminology...
"a"
The above is a Unicode string. It contains a single UTF-16 encoded character.
If you wish to limit the range of characters to the English alphabet, have a look at the Normalization performed in this answer .
我不确定你的例子你想要做什么 - 如果你只是试图用Y替换所有非ASCII值,那么你可以遍历字符串寻找0到127范围之外的代码点,并用Y替换那些代码点。
The class Character
also offers some interesting methods. Take a look at it.
Character.UnicodeBlock.of('a') == Character.UnicodeBlock.BASIC_LATIN; //true
Character.UnicodeBlock.of('�') == Character.UnicodeBlock.BASIC_LATIN; //false
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.