[英]Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java
I have to deal with a library which is not in my control. 我必须处理不在我控制范围内的图书馆。 It delivers a string which it decoded from a byte stream with ISO-8859-1. 它提供一个字符串,并使用ISO-8859-1从字节流对其进行解码。 However the byte stream is UTF-8. 但是字节流是UTF-8。 So obviously the resulting string I get is wrong if it contains non ASCII characters. 因此,很明显,如果包含非ASCII字符,我得到的结果字符串是错误的。
So what I do to fix this is to convert the string back to the byte stream and decode it again with UTF-8. 因此,我要解决的问题是将字符串转换回字节流,并使用UTF-8再次对其进行解码。 Like this: 像这样:
byte[] raw = inputText.getBytes(StandardCharsets.ISO_8859_1);
String correctedText = new String(raw, StandardCharsets.UTF_8);
I tested it with many examples and it seems to work. 我通过许多示例对其进行了测试,并且似乎可以正常工作。 Is this always correct however or are there cases where this would not work? 但是,这始终是正确的吗?还是在某些情况下不起作用? In other words: are there cases where decoding / reencoding any arbitrary byte array with ISO-8859-1 would not result in the original byte array? 换句话说:是否存在使用ISO-8859-1解码/重新编码任意字节数组不会导致原始字节数组的情况?
Since ISO-8859-1
is a 1 byte per character encoding, it will always work. 由于ISO-8859-1
是每个字符编码1个字节,因此它将始终有效。 The UTF-8
bytes are converted to incorrect characters, but luckily there's no information lost. UTF-8
字节被转换为不正确的字符,但是幸运的是,没有信息丢失。
Changing the characters back to bytes using ISO-8859-1
encoding gives you the original byte array, containing characters encoded in UTF-8
, so you can then safely reinterpret it with the correct encoding. 使用ISO-8859-1
编码将字符改回字节将为您提供原始字节数组,其中包含以UTF-8
编码的字符,因此您可以使用正确的编码安全地重新解释它。
The opposite of this is not (always¹) true, as UTF-8
is a multibyte encoding. 相反的事实并非总是如此,因为UTF-8
是多字节编码。 The encoding process may encounter invalid byte sequences and replace them with the replacement character ?
编码过程可能遇到无效的字节序列,并用替换字符替换它们?
. 。 At that point you've lost information and can't get the original bytes back anymore. 到那时,您已经丢失了信息,无法再恢复原始字节。
¹ If you stick to characters in the 0-127
range it will work, as they're encoded in UTF-8
using a single byte. ¹如果您坚持使用0-127
范围内的字符,那么它将起作用,因为它们是使用单个字节以UTF-8
编码的。
UTF-8 and ISO-88-1 encode ASCII charactesrs at the same way. UTF-8和ISO-88-1以相同的方式编码ASCII特征。 Given this You should not have any losses only since your original input is ASCII. 鉴于此,您不应仅因为原始输入为ASCII就不会有任何损失。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.