简体   繁体   English

java,utf8,国际字符和字节解释

[英]java, utf8, international characters and byte interpretation

I have a String that gets input to my program. 我有一个可以输入程序的字符串。

4 letters A, O, "E with an umlaut", L 4个字母A,O,“带有变音符的E”,L

The hex code for "E with an umlaut" is 0xc38b. “带有变音符的E”的十六进制代码是0xc38b。 see UTF-8 encoding table and Unicode characters and look for "LATIN CAPITAL LETTER E WITH DIAERESIS" 请参阅UTF-8编码表和Unicode字符,并查找“带DIAERESIS的拉丁文大写字母E”

And then it gets weird 然后变得奇怪

My java code is not printing "E with an umlaut" but "A with a ~" followed by 0x8b 我的Java代码不是打印“带变音符的E”,而是打印带后缀0x8b的“带〜的A”

When I convert the string to a byte array and the print it out as hex, my 4 character string becomes 7 characters: 当我将字符串转换为字节数组并将其打印为十六进制时,我的4个字符串变为7个字符:

byte[0]=41 "A"
byte[1]=4f "O"
byte[2]=c3 c383 is "A with a ~" (per above link)
byte[3]=83
byte[4]=c2 c28b is some kind of control character (per above link)
byte[5]=8b
byte[6]=4c "L"

I have verified my encoding is UTF-8 via Charset.defaultCharset() 我已经通过Charset.defaultCharset()验证了我的编码为UTF-8

It almost looks like its interpreting the bytes incorrectly but how is that possible? 几乎看起来像是错误地解释了字节,但这怎么可能呢?

Can anyone shed any light on why the byte interpretation of this string is getting screwed up and how i can correct it? 谁能阐明为什么这个字符串的字节解释越来越混乱,我该如何纠正呢?

Somewhere along the line, your input is encoded with UTF-8, then decoded with ISO 8859-1 (or a similar single-byte encoding). 沿行的某个位置,您的输入使用UTF-8编码,然后使用ISO 8859-1解码(或类似的单字节编码)。 At this point the string is corrupted. 此时,字符串已损坏。

Encoding "Ë" with UTF-8 results in the bytes [ 0xC3 0x8B ] . 用UTF-8编码"Ë"产生字节[ 0xC3 0x8B ] Decoding this with ISO 8859-1 produces the corrupt string, "Ë" ( "\Ã\‹" ). 使用ISO 8859-1对此进行解码会生成损坏的字符串"Ë""\Ã\‹" )。 Re-encoding that string with UTF-8 produces the byte sequence from the original question, [ 0xC3 0x83 0xC2 0x8B ] 使用UTF-8重新编码该字符串会产生原始问题的字节序列, [ 0xC3 0x83 0xC2 0x8B ]

Determine where ISO 8859-1 is erroneously used to decode UTF-8 data, and specify UTF-8 instead. 确定将ISO 8859-1错误地用于解码UTF-8数据的位置,并改为指定UTF-8。

This is a common problem when decoding web requests or responses. 解码Web请求或响应时,这是一个常见问题。 Standards specify ISO 8859-1 as the character encoding unless explicitly overridden, so frameworks fall back to this as a default. 除非明确重写,否则标准将ISO 8859-1指定为字符编码,因此框架默认情况下会使用此字符编码。

Yes everything is correct. 是的,一切都正确。 Those Unicode characters above U+7F, non 7-bits ASCII, are encoded with multiple bytes, like the (Dutch) U+C38B. U + 7F以上的那些Unicode字符(非7位ASCII)以多个字节编码,例如(荷兰语)U + C38B。 Every byte of that sequence have there high bit set. 该序列的每个字节都有高位设置。 In other character sets, like some Windows single-byte character set, that will be two or more weird characters. 在其他字符集中,例如某些Windows单字节字符集,将是两个或更多个奇怪的字符。

String s = "Zee\uC38Bn van tijd in Belgi\uC38B\r\n";
Path path = "C:/temp/test.txt";
Files.write(path, ("\uFEFF" + s).getBytes(StandardCharsets.UTF_8));

The above writes a text file with a BOM char (zero width space) at the beginning (U+FEFF). 上面的代码以(U + FEFF)开头的BOM字符(零宽度空格)编写了一个文本文件。 This is an ugly redundancy and helps Windows Notepad to recognize the file as UTF-8. 这是一个丑陋的冗余,可以帮助Windows记事本将文件识别为UTF-8。


Clarification 澄清度

The Unicode character U+C38B, in java the java char '\쎋' is actually . Unicode字符U + C38B,在Java中,java char'\\ uC38B'实际上是 That indeed is converted to 4 bytes in UTF-8. 确实在UTF-8中将其转换为4个字节。

Ë actually is U+CB or '\Ë' . Ë实际上是U + CB或'\Ë' Its byte representation in UTF-8 is as follows: 它在UTF-8中的字节表示如下:

String s = new String(new byte[]{ (byte)0xC3, (byte)0x8B}, 0, 2, StandardCharsets.UTF_8);

That UTF-8 is something totally different than simply splitting the (sequential) Unicode number for that character serves several purposes: the byte sequence is recognizable as part of a multibyte sequence: start and continuation bytes, and normal ASCII like / can never be part of such a byte sequence. UTF-8与为该字符简单地分割(顺序)Unicode数字完全不同,可以满足以下几个目的:字节序列可识别为多字节序列的一部分:起始字节和连续字节,以及普通ASCII,例如/永远不能成为一部分这样的字节序列。 So normal ASCII is safe. 因此,正常的ASCII是安全的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM