简体   繁体   English

Java UTF-8的区别

[英]Java UTF-8 differences

The JavaDoc says "The null byte '\' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls." JavaDoc说“空字节'\\ u0000'以2字节格式而不是1字节编码,因此编码的字符串永远不会嵌入空值。”

But what does this even mean? 但这甚至意味着什么呢? What's an embedded null in this context? 在这种情况下,什么是嵌入式null? I am trying to convert from a Java saved UTF-8 string to "real" UTF-8. 我试图从Java保存的UTF-8字符串转换为“真正的”UTF-8。

In C a string is terminated by the byte value 00. 在C中,字符串由字节值00终止。

The thing here is that you can have 0-chars in Java strings but to avoid confusion when passing the string over to C (which all native methods are written in) the character is encoded in another way, namely as two bytes 这里的事情是你可以在Java字符串中使用0-chars但是为了避免在将字符串传递给C(所有本地方法都写入)时出现混淆,字符以另一种方式编码,即作为两个字节

11000000 10000000

(according to the javadoc) neither of which is actually 00. (根据javadoc)这两者实际上都不是00。

This is a hack to work around something you cannot change easily. 这是一个解决你无法轻易改变的事情的黑客。

Also note, that this is valid UTF-8 and decode correctly to 00. 另请注意,这是有效的UTF-8并正确解码为00。

No "embedded nulls" means that the raw data does not contain a single 0x00 (NULL) byte. 没有“嵌入空值”表示原始数据不包含单个0x00 (NULL)字节。

\ gets encoded to (binary) 11000000 10000000 , (hex) 0xC080 . \被编码为(二进制) 11000000 10000000 ,(十六进制) 0xC080

That's not a Java-wide difference, only in DataInput/OutputStream . 这不是Java范围的差异,仅在DataInput/OutputStream If the string data was written using DataOutputStream then just read it in using DataInputStream . 如果字符串数据是使用DataOutputStream编写的,那么只需使用DataInputStream读取它。

If you need to write the string data to, say, a file, don't use DataOutputStream , use a Writer , which is meant for character streams. 如果您需要将字符串数据写入文件,请不要使用DataOutputStream ,请使用Writer ,它适用于字符流。

This is only for the method writeUTF of DataOutputStream, not for normal converted streams (OutputStreamWriter or such). 这仅适用于DataOutputStream的writeUTF方法,不适用于正常转换的流(OutputStreamWriter等)。

It means that if you have a string "\" , it will be encoded as 0xC0 0x80 instead of simply 0x00 . 这意味着如果你有一个字符串"\" ,它将编码为0xC0 0x80而不是简单的0x00

And in the other way around, this sequence 0xB0 0x80 , which will never occur in normal UTF-8 strings, represents a nul character. 而在另一方面,这个序列0xB0 0x80 ,它永远不会出现在普通的UTF-8字符串中,代表一个空字符。

Also, the documentation you linked seems to be from the time when Unicode still was a 16-bit character set - nowadays it also allows characters over 0xFFFF, which will be represented by two Java char values each (in UTF-16 format, a surrogate pair), and will need 4 bytes in UTF-8, if I calculated right. 此外,您链接的文档似乎来自Unicode仍然是16位字符集的时间 - 现在它还允许超过0xFFFF的char每个char将由两个Java char值表示(以UTF-16格式,代理如果我计算得正确的话,将需要UTF-8中的4个字节。 I'm note sure about the implementation here, though - it looks like these are simply written in CESU-8 format (eg two 3-byte sequences, each corresponding to a UTF-16 surrogate, which together give one Unicode character). 我注意到这里的实现确实 - 看起来这些只是用CESU-8格式编写的(例如两个3字节序列,每个序列对应一个UTF-16代理,它们一起给出一个Unicode字符)。 You will have to take care of this, too. 你也必须要照顾好这一点。

If you are using Java, the simplest thing would be to use DataInputStream to read this into a string, and then convert it (with getBytes("UTF-8") or a OutputStreamWriter to real UTF-8 data. 如果您使用的是Java,最简单的方法是使用DataInputStream将其读入字符串,然后将其转换(使用getBytes("UTF-8")或OutputStreamWriter转换为真正的UTF-8数据)。

If you are having difficulty reading a "saved" Java string, you need to look at the specification for the methods that read/write in that format: 如果您在阅读“已保存”的Java字符串时遇到困难,则需要查看以该格式读/写的方法的规范:

  • If the string was written using DataOutput.writeUTF8, the DataInput.readUTF8() javadoc is a definitive spec. 如果字符串是使用DataOutput.writeUTF8编写的,则DataInput.readUTF8() javadoc是一个明确的规范。 In addition to the non-standard handling of NUL , it specifies that the string starts with an unsigned 16-bit byte count. 除了NUL的非标准处理之外,它还指定字符串以无符号的16位字节计数开头。

  • If the string was written using ObjectOutputStream.writeObject() then the serialization spec is definitive. 如果字符串是使用ObjectOutputStream.writeObject()编写的,则序列化规范是确定的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM