简体   繁体   中英

Java japanese characters string size in bytes

I'm trying to calculate the length of the String of japanese characters '漢字仮名交じり文' :

    String testStr = "漢字仮名交じり文";
    try {
        System.out.println("Length : " + testStr.getBytes("UTF-16").length);
    }
        catch(Exception ex) {
        ..... 
    }

There are 8 characters in the string and this excerpt prints : 18. Why is it 18?

It is 18 since your have 8 characters each encoded as UTF-16 which means 2 bytes each. Consequently this is 8*2=16 plus the 2 byte BOM which got inserted at the beginning of the byte array!

This is your byte sequence (feff is the so called BOM or Byte Order Mark which allows to detect if the byte sequence is using little endiion or big endian byte order):

fe ff 6f 22 5b 57 4e ee 54 0d 4e a4 30 58 30 8a 65 87

This is how I printed the byte sequence (it is crude code only meant for testing this out of course):

final String text = "漢字仮名交じり文";
byte[] bytes = text.getBytes("UTF-16");
for (int i=0; i<bytes.length; ++i) {
    System.out.printf("%02x ", bytes[i]);
}

You are getting the byte count, which is not the character count. depending on the encoding (you used UTF-16), a character can be from 1 to 4 bytes.

If you actually want to find the character count in a given string, an easy way to do (not optimal) it is

   String testStr = "漢字仮名交じり文";
   System.out.println(testStr.toCharArray().length);

Prints 8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM