简体   繁体   English

为什么DataOutputStream.writeChars(str)和String(byte [])使用相同的编码?

[英]Why don't DataOutputStream.writeChars(str) and String(byte[]) use the same encoding?

I'm writing some marshaling/unmarshaling routines for a class project and am a bit perplexed about Java's default behavior in this case. 我正在为一个类项目编写一些编组/解组例程,在这种情况下我对Java的默认行为感到有些困惑。 Here are my "naive" subroutines for writing and reading strings to and from byte streams: 这是我用于在字节流中写入和读取字符串的“天真”子程序:

protected static void write(DataOutputStream dout, String str)
        throws IOException{
    dout.writeInt(str.length());
    dout.writeChars(str);
}

protected static String readString(DataInputStream din)
        throws IOException{
    int strLength = 2*din.readInt(); // b/c there are two bytes per char
    byte[] stringHolder = new byte[strLength];
    din.read(stringHolder);
    return new String(stringHolder);
}

Unfortunately, this simply doesn't work; 不幸的是,这根本行不通; the characters are written in UTF-16 format by default, but String(byte[]) seems to assume that each byte will contain a character, and since ASCII characters all start with a 0 byte in UTF-16, the constructor appears to just give up and return an empty string. 默认情况下,字符以UTF-16格式写入,但String(byte[])似乎假设每个字节都包含一个字符,并且因为ASCII字符都以UTF-16中的0字节开头,所以构造函数似乎只是放弃并返回一个空字符串。 The solution is to change readString to specify that it must use UTF-16 encoding: 解决方案是更改readString以指定它必须使用UTF-16编码:

protected static String readString(DataInputStream din)
        throws IOException{
    int strLength = 2*din.readInt();
    byte[] stringHolder = new byte[strLength];
    din.read(stringHolder);
    return new String(stringHolder, "UTF-16");
}

My question is, why is this necessary? 我的问题是,为什么这有必要? Since Java uses UTF-16 for strings by default, why wouldn't it assume that UTF-16 is being used when reading chars from bytes? 由于Java默认使用UTF-16作为字符串,为什么它不会假设从字节读取字符时使用UTF-16? Or, alternatively, why wouldn't it just encode the chars as bytes in the first place by default? 或者,或者,为什么它默认情况下不会将字符编码为字节? In short, why don't the default behaviors of the writeChars() method and the String(byte[]) constructor parallel each other? 简而言之,为什么writeChars()方法和String(byte[])构造函数的默认行为不相互平行?

The issue is you are writing using the underlying char[] which is essentialy a byte[] that represents a UTF-16 representation of a string, see the javadoc . 问题是你正在使用底层char[]编写,它本质上是一个byte[] ,表示字符串的UTF-16表示,请参阅javadoc
You are then reading using the String(byte[] bytes) constructor, which is designed for reading data encoded with the system default encoding, in your case presumably this is UTF-8. 然后使用String(byte[] bytes)构造函数进行读取,该构造函数用于读取使用系统默认编码编码的数据,在您的情况下可能是UTF-8。
You need to be consistent, in fact the DataOutputStream.writeUTF() and DataInputStream.readUTF() functions are designed especially for this. 您需要保持一致,实际上DataOutputStream.writeUTF()DataInputStream.readUTF()函数是专门为此而设计的。
If you want use the underlying byte[] for some reason you can get the UTF-8 representation of the String easily using String.getBytes("UTF-8") , again, see the javadoc . 如果由于某种原因需要使用底层byte[] ,可以使用String.getBytes("UTF-8")轻松获取String的UTF-8表示,再次参见javadoc
To simplify matters you could just use an ObjectOutputStream and an ObjectInputStream and that would serialize the actual String to the stream rather than just its char[] representation. 为简化问题,您可以使用ObjectOutputStreamObjectInputStream ,并将实际的String序列化为流而不仅仅是char[]表示。

Its better to think that Java does not use any encoding of its characters. 最好认为Java不使用其字符的任何编码。 Its Strings are simply the raw 16 bit char value which is the same as UTF16. 它的字符串只是原始的16位字符值,与UTF16相同。 The reason the "other" methods default to the system encoding is because different platforms use different default encodings. “其他”方法默认为系统编码的原因是因为不同的平台使用不同的默认编码。 Eg it wouldnt make sense to write UTF8 which contains partial ascii characters to a mainframe which uses EBDCDIC (sp) . 例如,将包含部分ascii字符的UTF8写入使用EBDCDIC(sp)的主机是没有意义的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM