简体   繁体   中英

UTF-8 in Java's String.GetBytes(Charset)

I read some documents about String.getBytes(Charset) method in Java.

It is used to convert a String to a byte array (byte type can get value -2^7 to 2^7-1).

As I knew, per character in UTF-8 charset can be used with 1-4 byte(s). What will happen if the code of a character in UTF-8 charset is larger than 2^7-1?

I tried with

String s="Hélô"

then I got such 'Hélô' with:

String sr=new String(s.getBytes("UTF-8"),Charset.forName("UTF-8"));

I want it to return orginal value 'Hélô'.

Can anybody describe this? Thanks. (Sorry for my English)

As Jon already said, the reason is that you use different encodings. In UTF-8 encoding the characters é and ô are encoded as two bytes each.

ISO-8859-1: H  é  l ô
     bytes: 48 E9 6C F4

UTF-8     : H  é    l  ô
     bytes: 48 C3A9 6C C3B4

Your example fo the wrong string result is in bytes as follow

UTF-8 bytes interpreted as ISO-8859-1
H  Ã  ©  l  Ã  ´
48 C3 A9 6C C3 B4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM