简体   繁体   English

Java String UTF-8限制

[英]Java String UTF-8 limits

I'm trying to deserialize Strings from files directly and I have a question about very long Strings: Java Strings have a character count limit equal to Integer.MAX_VALUE , which is 31^2-1. 我正在尝试直接从文件中反序列化字符串,并且我有一个关于很长的字符串的问题:Java字符串的字符数限制等于Integer.MAX_VALUE ,即31 ^ 2-1。

But here comes my question: what happens when I have a UTF-8 String with little less than that size but formed by characters with size more than 1 byte and then I ask Java to give me the byte array? 但是这里出现了我的问题:当我有一个UTF-8字符串,该字符串的大小小于该大小,但由大于1个字节的字符组成,然后我要求Java给我字节数组时,会发生什么?

To make it clearer, what happens if I could run this code? 更清楚地说,如果我可以运行此代码会怎样? (I haven't got RAM enough): (我的RAM还不够):

String toPrint = "";
String string100 = "";
int max = Integer.MAX_VALUE -100;
for (int i = 0; i < 100; i += 10) {
    string100 += "1234567ñ90";
}
for (int i = 0; i < max; i += 100) {
    toPrint += string100;
}
System.out.println("String complete!");
byte[] byteArray = toPrint.getBytes(StandardCharsets.UTF_8);
System.out.println(byteArray.length);
System.exit(0);

Does it print "String complete!"? 它是否显示“字符串已完成!”? Or does it break before? 还是以前打破过?

Fundamentally, the limit on Strings is that the char arrays inside of them can't be longer than the maximum array length, which is roughly Integer.MAX_VALUE and greater than your variable max . 从根本上说,对Strings的限制是它们内部的char数组不能超过最大数组长度,该长度大约为Integer.MAX_VALUE且大于您的变量max Strings store their characters in UTF-16 and therefore the UTF-16 representation of a string can't exceed the maximum array length. 字符串将其字符存储在UTF-16中,因此字符串的UTF-16表示形式不能超过最大数组长度。 The number of bytes in UTF-8 and the number of logical characters (Unicode code points, or UTF-32 characters) ultimately don't matter. 最终,UTF-8中的字节数和逻辑字符(Unicode代码点或UTF-32字符)的数量无关紧要。

Now let's move to your particular example. 现在,让我们转到您的特定示例。 Since each of the 10 characters in "1234567ñ90" is a single UTF-16 value, that string takes up 10 values of a String 's char array. 由于“1234567ñ90”中的10个字符中的每个字符都是单个UTF-16值,因此该字符串占用Stringchar数组的10个值。 Despite your code's horrible performance and high memory requirement, it should eventually get to "String complete!" 尽管您的代码的性能令人恐惧,并且对内存的要求很高,但最终它应该变成“字符串完成!”。 if there is sufficient available memory. 如果有足够的可用内存。 However, it will break when converting to UTF-8 because the UTF-8 representation of the string is longer than the maximum array length, since "ñ" requires more than one byte. 但是,当转换为UTF-8时,它将中断,因为字符串的UTF-8表示比最大数组长度长,因为“?”需要多个字节。

Array size is also limited to Integer.MAX_VALUE (which is why String size is limited, after all there's a char[] backing it) , so it's impossible to get the byte array if the encoding uses more bytes than that, no matter what the size of the String is in characters. 数组大小也限制为Integer.MAX_VALUE (这就是为什么String大小受到限制,毕竟有char[]支持它),因此,如果编码使用的字节多于此,则不可能获得字节数组,无论String大小以字符为单位。

The end result would be an OutOfMemoryError , but creating the String in the first place would succeed. 最终结果将是OutOfMemoryError ,但是首先创建String将会成功。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM