简体   繁体   中英

UTF-8 to ASCII conversion in java

I have one string which contains UTF-8 character set format.

String str = "100µF";

And my desire output of above string is "100µF"

I have checked on StackOverflow and i got below code

public static String decompose(String s) {
    return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
}

But, I got the output of above string was "100AµF"

This is an XY problem .

The problem here is that your String was created from bytes, using an incorrect charset that assumes one byte is one character, like like ISO 8559-1 .

But the bytes are not ASCII and they are not ISO 8859-1. The bytes are a UTF-8 representation of text.

Do not replace any characters. Do not normalize the string. The only correct solution is to revert the incorrectly decoded String back to bytes, then correctly decode the bytes using UTF-8:

byte[] originalBytes = str.getBytes(StandardCharsets.ISO_8859_1);

str = new String(originalBytes, StandardCharsets.UTF_8);

There is no µ char in ASCII, so you can't write it in ASCII.

Java String s are sequence of unicode characters (and are internally encoded in UTF-16), so the problem you have depends either on how you read this string or on how you write it.

Normally this thing are solved by creating an OutputStreamWriter(OutputStream out, String charsetName) or InputStreamReader(InputStream in, String charsetName) setting the correct character set.

So if for example you get your string from an UTF-8 encoded file, you should create a reader like this:

Reader in = new InputStreamReader(new FileInputStream('some_file.txt'),"UTF-8")

Or if you are writing to an ISO-Latin-1 file you should create the Writer like this:

Writer out = new OutputStreamWriter(new FileOutputStream('some_file.txt'),"ISO-8859-1")

Similar things can happen with HTTP request / response, depending on how the body of each is interpreted by either the application server or browser, if that's your case, then you add some detail to your question.

You are dealing with µ (U+00B5, MICRO SIGN) and  (U+00C2, LATIN CAPITAL LETTER A WITH CIRCUMFLEX). Both these characters belong to Latin-1 Supplement unicode block .

If you want to allow µ but disallow  you have to do the filtering yourself. There won't be a separate character group ( \\p{} ) for each of the characters.

One way to do it is to define a white-list filter:

String input = "100µF";
String allowedFilter = "[^\\p{ASCII}µ]"; // regular ASCII + µ sign
String output = input.replaceAll(allowedFilter, "");
System.out.println(output); // 100µF

Do note that both µ and  can be represented in Extended ASCII so filtering one and not the other is counter intuitive.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM