I'm writing code for remove all diacritics for one String.
For example: áÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑ
I'm using the property InCombiningDiacriticalMarks
of Unicode. But I want to ignore the replace for ñ
and Ñ
.
Now I'm saving these two characters before replace with:
s = s.replace('ñ', '\001');
s = s.replace('Ñ', '\002');
It's possible to use InCombiningDiacriticalMarks
ignoring the diacritic of ñ
and Ñ
.
This is my code:
public static String stripAccents(String s)
{
/*Save ñ*/
s = s.replace('ñ', '\001');
s = s.replace('Ñ', '\002');
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
/*Add ñ to s*/
s = s.replace('\001', 'ñ');
s = s.replace('\002', 'Ñ');
return s;
}
It works fine, but I want know if it's possible optimize this code.
It depends what you mean by " optimize ". It's tough to reduce the number of lines of code from what you have written, but since you are processing the string six times there's scope to improve performance by processing the input string only once, character by character:
public class App {
// See SO answer https://stackoverflow.com/a/10831704/2985643 by virgo47
private static final String tab00c0
= "AAAAAAACEEEEIIII"
+ "DNOOOOO\u00d7\u00d8UUUUYI\u00df"
+ "aaaaaaaceeeeiiii"
+ "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey"
+ "AaAaAaCcCcCcCcDd"
+ "DdEeEeEeEeEeGgGg"
+ "GgGgHhHhIiIiIiIi"
+ "IiJjJjKkkLlLlLlL"
+ "lLlNnNnNnnNnOoOo"
+ "OoOoRrRrRrSsSsSs"
+ "SsTtTtTtUuUuUuUu"
+ "UuUuWwYyYZzZzZzF";
public static void main(String[] args) {
var input = "AaBbCcáÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑçÇ";
var output = removeDiacritic(input);
System.out.println("input = " + input);
System.out.println("output = " + output);
}
public static String removeDiacritic(String input) {
var output = new StringBuilder(input.length());
for (var c : input.toCharArray()) {
if (isModifiable(c)) {
c = tab00c0.charAt(c - '\u00c0');
}
output.append(c);
}
return output.toString();
}
// Returns true if the supplied char is a candidate for diacritic removal.
static boolean isModifiable(char c) {
boolean modifiable;
if (c < '\u00c0' || c > '\u017f') {
modifiable = false;
} else {
modifiable = switch (c) {
case 'ñ', 'Ñ' ->
false;
default ->
true;
};
}
return modifiable;
}
}
This is the output from running the code:
input = AaBbCcáÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑçÇ
output = AaBbCcaAeEiIoOuUaAeEiIoOuUñÑcC
Characters without diacritics in the input string are not modified. Otherwise the diacritic is removed (eg Ç
to C
), except in the cases of ñ
and Ñ
.
Notes:
Normalizer
class or InCombiningDiacriticalMarks
at all. Instead it processes each character in the input string only once, removing its accent if appropriate. The conventional approach for removing diacritics (as used in the OP) does not support selective removal as far as I know.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.