import java.io.UnsupportedEncodingException;
public class TestChar {
public static void main(String[] args) throws UnsupportedEncodingException {
String cnStr = "龙";
String enStr = "a";
byte[] cnBytes = cnStr.getBytes("UTF-8");
byte[] enBytes = enStr.getBytes("UTF-8");
System.out.println("bytes size of Chinese:" + cnBytes.length);
System.out.println("bytes size of English:" + enBytes.length);
// in java, char takes two bytes, the question is:
char cnc = '龙'; // will '龙‘ take two or three bytes ?
char enc = 'a'; // will 'a' take one or two bytes ?
}
}
Output :
bytes size of Chinese:3
bytes size of English:1
Here, My JVM is set as UTF-8, from the output, we know Chinese character '龙' takes 3 bytes, and English character 'a' takes one byte. My question is:
In Java, char takes two bytes, here, char cnc = '龙'; char enc = 'a'; will cnc only takes two bytes instead of 3 bytes ? And 'a' takes two bytes instead of one byte ?
The codepoint value of龙
is 40857. That fits inside the two bytes of a char.
It takes 3 bytes to encode in UTF-8 because not all 2-byte sequences are valid in UTF-8.
UTF-8 is a variable-length character encoding, where characters take up 1 to 4 bytes.
A Java char
is 16 bits. See 3.1 Unicode in the Java Language Specification to understand how exactly Java handles Unicode.
Internally, Strings/chars are UTF-16, so it'll be the same for both: Each char will be 16bits.
byte[] cnBytes = cnStr.getBytes("UTF-8");
UTF-8 is a variable length encoding, so the Chinese char takes more bits because it's out of the ASCII character range.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.