在Java中错误地修复了ISO-8859-1解码的UTF-8字符串

Question

I have to deal with a library which is not in my control. 我必须处理不在我控制范围内的图书馆。 It delivers a string which it decoded from a byte stream with ISO-8859-1. 它提供一个字符串，并使用ISO-8859-1从字节流对其进行解码。 However the byte stream is UTF-8. 但是字节流是UTF-8。 So obviously the resulting string I get is wrong if it contains non ASCII characters. 因此，很明显，如果包含非ASCII字符，我得到的结果字符串是错误的。

So what I do to fix this is to convert the string back to the byte stream and decode it again with UTF-8. 因此，我要解决的问题是将字符串转换回字节流，并使用UTF-8再次对其进行解码。 Like this: 像这样：

byte[] raw = inputText.getBytes(StandardCharsets.ISO_8859_1);
String correctedText = new String(raw, StandardCharsets.UTF_8);

I tested it with many examples and it seems to work. 我通过许多示例对其进行了测试，并且似乎可以正常工作。 Is this always correct however or are there cases where this would not work? 但是，这始终是正确的吗？还是在某些情况下不起作用？ In other words: are there cases where decoding / reencoding any arbitrary byte array with ISO-8859-1 would not result in the original byte array? 换句话说：是否存在使用ISO-8859-1解码/重新编码任意字节数组不会导致原始字节数组的情况？

Answer 1

Since ISO-8859-1 is a 1 byte per character encoding, it will always work. 由于ISO-8859-1是每个字符编码1个字节，因此它将始终有效。 The UTF-8 bytes are converted to incorrect characters, but luckily there's no information lost. UTF-8字节被转换为不正确的字符，但是幸运的是，没有信息丢失。

Changing the characters back to bytes using ISO-8859-1 encoding gives you the original byte array, containing characters encoded in UTF-8 , so you can then safely reinterpret it with the correct encoding. 使用ISO-8859-1编码将字符改回字节将为您提供原始字节数组，其中包含以UTF-8编码的字符，因此您可以使用正确的编码安全地重新解释它。

The opposite of this is not (always¹) true, as UTF-8 is a multibyte encoding. 相反的事实并非总是如此，因为UTF-8是多字节编码。 The encoding process may encounter invalid byte sequences and replace them with the replacement character ? 编码过程可能遇到无效的字节序列，并用替换字符替换它们? . 。 At that point you've lost information and can't get the original bytes back anymore. 到那时，您已经丢失了信息，无法再恢复原始字节。

¹ If you stick to characters in the 0-127 range it will work, as they're encoded in UTF-8 using a single byte. ¹如果您坚持使用0-127范围内的字符，那么它将起作用，因为它们是使用单个字节以UTF-8编码的。

Answer 2

UTF-8 and ISO-88-1 encode ASCII charactesrs at the same way. UTF-8和ISO-88-1以相同的方式编码ASCII特征。 Given this You should not have any losses only since your original input is ASCII. 鉴于此，您不应仅因为原始输入为ASCII就不会有任何损失。

在Java中错误地修复了ISO-8859-1解码的UTF-8字符串

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-02-01 16:59:47

解决方案2
-3 2018-02-01 17:07:25

在Java中错误地修复了ISO-8859-1解码的UTF-8字符串

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-02-01 16:59:47

解决方案2 -3 2018-02-01 17:07:25

解决方案1
3 已采纳 2018-02-01 16:59:47

解决方案2
-3 2018-02-01 17:07:25