[英]Java UTF-8 differences
The JavaDoc says "The null byte '\ ' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls." JavaDoc说“空字节'\\ u0000'以2字节格式而不是1字节编码,因此编码的字符串永远不会嵌入空值。”
But what does this even mean? 但这甚至意味着什么呢? What's an embedded null in this context?
在这种情况下,什么是嵌入式null? I am trying to convert from a Java saved UTF-8 string to "real" UTF-8.
我试图从Java保存的UTF-8字符串转换为“真正的”UTF-8。
In C a string is terminated by the byte value 00. 在C中,字符串由字节值00终止。
The thing here is that you can have 0-chars in Java strings but to avoid confusion when passing the string over to C (which all native methods are written in) the character is encoded in another way, namely as two bytes 这里的事情是你可以在Java字符串中使用0-chars但是为了避免在将字符串传递给C(所有本地方法都写入)时出现混淆,字符以另一种方式编码,即作为两个字节
11000000 10000000
(according to the javadoc) neither of which is actually 00. (根据javadoc)这两者实际上都不是00。
This is a hack to work around something you cannot change easily. 这是一个解决你无法轻易改变的事情的黑客。
Also note, that this is valid UTF-8 and decode correctly to 00. 另请注意,这是有效的UTF-8并正确解码为00。
No "embedded nulls" means that the raw data does not contain a single 0x00
(NULL) byte. 没有“嵌入空值”表示原始数据不包含单个
0x00
(NULL)字节。
\
gets encoded to (binary) 11000000 10000000
, (hex) 0xC080
. \
被编码为(二进制) 11000000 10000000
,(十六进制) 0xC080
。
That's not a Java-wide difference, only in DataInput/OutputStream
. 这不是Java范围的差异,仅在
DataInput/OutputStream
。 If the string data was written using DataOutputStream
then just read it in using DataInputStream
. 如果字符串数据是使用
DataOutputStream
编写的,那么只需使用DataInputStream
读取它。
If you need to write the string data to, say, a file, don't use DataOutputStream
, use a Writer
, which is meant for character streams. 如果您需要将字符串数据写入文件,请不要使用
DataOutputStream
,请使用Writer
,它适用于字符流。
This is only for the method writeUTF
of DataOutputStream, not for normal converted streams (OutputStreamWriter or such). 这仅适用于DataOutputStream的
writeUTF
方法,不适用于正常转换的流(OutputStreamWriter等)。
It means that if you have a string "\ "
, it will be encoded as 0xC0 0x80
instead of simply 0x00
. 这意味着如果你有一个字符串
"\ "
,它将编码为0xC0 0x80
而不是简单的0x00
。
And in the other way around, this sequence 0xB0 0x80
, which will never occur in normal UTF-8 strings, represents a nul character. 而在另一方面,这个序列
0xB0 0x80
,它永远不会出现在普通的UTF-8字符串中,代表一个空字符。
Also, the documentation you linked seems to be from the time when Unicode still was a 16-bit character set - nowadays it also allows characters over 0xFFFF, which will be represented by two Java char
values each (in UTF-16 format, a surrogate pair), and will need 4 bytes in UTF-8, if I calculated right. 此外,您链接的文档似乎来自Unicode仍然是16位字符集的时间 - 现在它还允许超过0xFFFF的
char
每个char
将由两个Java char
值表示(以UTF-16格式,代理如果我计算得正确的话,将需要UTF-8中的4个字节。 I'm note sure about the implementation here, though - it looks like these are simply written in CESU-8 format (eg two 3-byte sequences, each corresponding to a UTF-16 surrogate, which together give one Unicode character). 我注意到这里的实现确实 - 看起来这些只是用CESU-8格式编写的(例如两个3字节序列,每个序列对应一个UTF-16代理,它们一起给出一个Unicode字符)。 You will have to take care of this, too.
你也必须要照顾好这一点。
If you are using Java, the simplest thing would be to use DataInputStream to read this into a string, and then convert it (with getBytes("UTF-8")
or a OutputStreamWriter to real UTF-8 data. 如果您使用的是Java,最简单的方法是使用DataInputStream将其读入字符串,然后将其转换(使用
getBytes("UTF-8")
或OutputStreamWriter转换为真正的UTF-8数据)。
If you are having difficulty reading a "saved" Java string, you need to look at the specification for the methods that read/write in that format: 如果您在阅读“已保存”的Java字符串时遇到困难,则需要查看以该格式读/写的方法的规范:
If the string was written using DataOutput.writeUTF8, the DataInput.readUTF8()
javadoc is a definitive spec. 如果字符串是使用DataOutput.writeUTF8编写的,则
DataInput.readUTF8()
javadoc是一个明确的规范。 In addition to the non-standard handling of NUL
, it specifies that the string starts with an unsigned 16-bit byte count. 除了
NUL
的非标准处理之外,它还指定字符串以无符号的16位字节计数开头。
If the string was written using ObjectOutputStream.writeObject()
then the serialization spec is definitive. 如果字符串是使用
ObjectOutputStream.writeObject()
编写的,则序列化规范是确定的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.