简体   繁体   English

在Java中错误地修复了ISO-8859-1解码的UTF-8字符串

[英]Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java

I have to deal with a library which is not in my control. 我必须处理不在我控制范围内的图书馆。 It delivers a string which it decoded from a byte stream with ISO-8859-1. 它提供一个字符串,并使用ISO-8859-1从字节流对其进行解码。 However the byte stream is UTF-8. 但是字节流是UTF-8。 So obviously the resulting string I get is wrong if it contains non ASCII characters. 因此,很明显,如果包含非ASCII字符,我得到的结果字符串是错误的。

So what I do to fix this is to convert the string back to the byte stream and decode it again with UTF-8. 因此,我要解决的问题是将字符串转换回字节流,并使用UTF-8再次对其进行解码。 Like this: 像这样:

byte[] raw = inputText.getBytes(StandardCharsets.ISO_8859_1);
String correctedText = new String(raw, StandardCharsets.UTF_8);

I tested it with many examples and it seems to work. 我通过许多示例对其进行了测试,并且似乎可以正常工作。 Is this always correct however or are there cases where this would not work? 但是,这始终是正确的吗?还是在某些情况下不起作用? In other words: are there cases where decoding / reencoding any arbitrary byte array with ISO-8859-1 would not result in the original byte array? 换句话说:是否存在使用ISO-8859-1解码/重新编码任意字节数组不会导致原始字节数组的情况?

Since ISO-8859-1 is a 1 byte per character encoding, it will always work. 由于ISO-8859-1是每个字符编码1个字节,因此它将始终有效。 The UTF-8 bytes are converted to incorrect characters, but luckily there's no information lost. UTF-8字节被转换为不正确的字符,但是幸运的是,没有信息丢失。

Changing the characters back to bytes using ISO-8859-1 encoding gives you the original byte array, containing characters encoded in UTF-8 , so you can then safely reinterpret it with the correct encoding. 使用ISO-8859-1编码将字符改回字节将为您提供原始字节数组,其中包含以UTF-8编码的字符,因此您可以使用正确的编码安全地重新解释它。

The opposite of this is not (always¹) true, as UTF-8 is a multibyte encoding. 相反的事实并非总是如此,因为UTF-8是多字节编码。 The encoding process may encounter invalid byte sequences and replace them with the replacement character ? 编码过程可能遇到无效的字节序列,并用替换字符替换它们? . At that point you've lost information and can't get the original bytes back anymore. 到那时,您已经丢失了信息,无法再恢复原始字节。

¹ If you stick to characters in the 0-127 range it will work, as they're encoded in UTF-8 using a single byte. ¹如果您坚持使用0-127范围内的字符,那么它将起作用,因为它们是使用单个字节以UTF-8编码的。

UTF-8 and ISO-88-1 encode ASCII charactesrs at the same way. UTF-8和ISO-88-1以相同的方式编码ASCII特征。 Given this You should not have any losses only since your original input is ASCII. 鉴于此,您不应仅因为原始输入为ASCII就不会有任何损失。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM