简体   繁体   中英

UTF-8 string converts non english character to invalid character

I am converting byte array to string with UTF-8 encoding

new String(bytearray, StandardCharsets.UTF_8));

It changes the string Impresión to Impresi n . But if i execute below code

new String(bytearray);

It gives the proper string Impresión to Impresión

I want to make UTF-8 encoding string without changing any non English character into invalid character.

Any help would be appreciated.

String objects in Java use the UTF-16 encoding and cant be modified.

If you need to use a character from an alternate encoding then you must use a byte[] array to store your data, and when you convert it to a string, ensure that you specify the same encoding that was used to encode the byte array.

Therefore when you construct your string from a byte array, you must ensure that the string know how to encode it into UTF-16 from what ever the original encoding was. This is why your first code did not work as in the constructor you specified what the original encoding was and appearently it was not the right encoding, therfore java was unable able to decode the byte array properly. However in the second code you did not specify an encoding, therfore java used the default one on your system which probaly was the same encoding that was used to encode the byte array therefore producing the proper character.

To fix this ensure that the byte array is being encoded with the same encoding that you are specifing when you decode the byte array into a string.

For more information see the below link, particularly the introduction where they write about Strings using the UTF-16 encoding:

https://docs.oracle.com/javase/7/docs/api/java/lang/String.html

The format changes because your source byte array is not UTF-8 encoded. the below code works fine for me.

    byte[] bytearray = "Impresión".getBytes(StandardCharsets.UTF_8);
    String s = new String(bytearray, StandardCharsets.UTF_8);
    System.out.println(s);

and the output is

Impresión

but when I run below code

byte[] bytearray = "Impresión".getBytes();
String s = new String(bytearray, StandardCharsets.UTF_8);
System.out.println(s);

it prints

Impresi?n

you need to use the same charset for encoding and decoding.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM