简体   繁体   中英

Java String UTF-8 limits

I'm trying to deserialize Strings from files directly and I have a question about very long Strings: Java Strings have a character count limit equal to Integer.MAX_VALUE , which is 31^2-1.

But here comes my question: what happens when I have a UTF-8 String with little less than that size but formed by characters with size more than 1 byte and then I ask Java to give me the byte array?

To make it clearer, what happens if I could run this code? (I haven't got RAM enough):

String toPrint = "";
String string100 = "";
int max = Integer.MAX_VALUE -100;
for (int i = 0; i < 100; i += 10) {
    string100 += "1234567ñ90";
}
for (int i = 0; i < max; i += 100) {
    toPrint += string100;
}
System.out.println("String complete!");
byte[] byteArray = toPrint.getBytes(StandardCharsets.UTF_8);
System.out.println(byteArray.length);
System.exit(0);

Does it print "String complete!"? Or does it break before?

Fundamentally, the limit on Strings is that the char arrays inside of them can't be longer than the maximum array length, which is roughly Integer.MAX_VALUE and greater than your variable max . Strings store their characters in UTF-16 and therefore the UTF-16 representation of a string can't exceed the maximum array length. The number of bytes in UTF-8 and the number of logical characters (Unicode code points, or UTF-32 characters) ultimately don't matter.

Now let's move to your particular example. Since each of the 10 characters in "1234567ñ90" is a single UTF-16 value, that string takes up 10 values of a String 's char array. Despite your code's horrible performance and high memory requirement, it should eventually get to "String complete!" if there is sufficient available memory. However, it will break when converting to UTF-8 because the UTF-8 representation of the string is longer than the maximum array length, since "ñ" requires more than one byte.

Array size is also limited to Integer.MAX_VALUE (which is why String size is limited, after all there's a char[] backing it) , so it's impossible to get the byte array if the encoding uses more bytes than that, no matter what the size of the String is in characters.

The end result would be an OutOfMemoryError , but creating the String in the first place would succeed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM