How many bytes of English and Chinese characters take in java?

Question

import java.io.UnsupportedEncodingException;

public class TestChar {

    public static void main(String[] args) throws UnsupportedEncodingException {
        String cnStr = "龙";
        String enStr = "a";
        byte[] cnBytes = cnStr.getBytes("UTF-8");
        byte[] enBytes = enStr.getBytes("UTF-8");

        System.out.println("bytes size of Chinese：" + cnBytes.length);
        System.out.println("bytes size of English：" + enBytes.length);

        //  in java, char takes two bytes, the question is: 
        char cnc = '龙'; // will '龙‘ take two or three bytes ?
        char enc = 'a'; // will 'a' take one or two bytes ?
    }
}

Output :

   bytes size of Chinese：3

   bytes size of English：1

Here, My JVM is set as UTF-8, from the output, we know Chinese character '龙' takes 3 bytes, and English character 'a' takes one byte. My question is:

In Java, char takes two bytes, here, char cnc = '龙'; char enc = 'a'; will cnc only takes two bytes instead of 3 bytes ? And 'a' takes two bytes instead of one byte ?

Answer 1

The codepoint value of龙is 40857. That fits inside the two bytes of a char.

It takes 3 bytes to encode in UTF-8 because not all 2-byte sequences are valid in UTF-8.

Answer 2

UTF-8 is a variable-length character encoding, where characters take up 1 to 4 bytes.

A Java char is 16 bits. See 3.1 Unicode in the Java Language Specification to understand how exactly Java handles Unicode.

Answer 3

Internally, Strings/chars are UTF-16, so it'll be the same for both: Each char will be 16bits.

byte[] cnBytes = cnStr.getBytes("UTF-8");

UTF-8 is a variable length encoding, so the Chinese char takes more bits because it's out of the ASCII character range.

How many bytes of English and Chinese characters take in java?

Question

3 answers

solution1
3 2019-11-25 20:38:25

solution2
3 2019-11-25 20:39:42

solution3
1 2019-11-25 20:36:34

How many bytes of English and Chinese characters take in java?

Question

3 answers

solution1 3 2019-11-25 20:38:25

solution2 3 2019-11-25 20:39:42

solution3 1 2019-11-25 20:36:34

solution1
3 2019-11-25 20:38:25

solution2
3 2019-11-25 20:39:42

solution3
1 2019-11-25 20:36:34