Java解码双编码utf-8字符

Question

我正在解析一个 websocket 消息，并且由于在特定的 socket.io 版本中做了一个错误（不幸的是我无法控制服务器端），一些有效负载被双重编码为 utf-8：

正确的值应该是Wrocławskiej （注意 l 字母，它是带有中风的拉丁文小写字母 L）但我实际上得到了WrocÅawskiej 。

我已经尝试用 java 再次解码/编码它

String str = new String(wrongEncoded.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);

不幸的是，字符串保持不变。 关于如何在 Java 中进行双重解码的任何想法？ 我看到了一个 python 版本，他们首先将它转换为raw_unicode然后再次解析它，但我不知道这是否有效，或者是否有类似的 Java 解决方案。 我已经阅读了有关该主题的几篇文章，但没有任何帮助。

编辑：为了在 Fiddler 中澄清，我收到上面提到的单词的以下字节序列：

WrocÃÂawskiej

byte[] arrOutput = { 0x57, 0x72, 0x6F, 0x63, 0xC3, 0x85, 0xC2, 0x82, 0x61, 0x77, 0x73, 0x6B, 0x69, 0x65, 0x6A };

Answer 1

您的文本编码为 UTF-8，然后这些字节被解释为 ISO-8859-1 并重新编码为 UTF-8。

Wrocławskiej是 Unicode：0057 0072 006f 0063 0142 0061 0077 0073 006b 0069 0065 006a
编码为 UTF-8 是： 57 72 6f 63 c5 82 61 77 73 6b 69 65 6a

在ISO-8859-1 中， c5是Å ， 82是未定义的。
作为 ISO-8859-1，这些字节是： WrocÅawskiej
编码为 UTF-8 是： 57 72 6f 63 c3 85 c2 82 61 77 73 6b 69 65 6a
这些可能是您正在接收的字节。

因此，要撤消该操作，您需要：

String s = new String(bytes, StandardCharsets.UTF_8);

// fix "double encoding"
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);

Answer 2

我遇到的问题是有时我收到双重编码的字符串，有时收到正确的编码字符串。 以下方法 fixDoubleUTF8Encoding 将正确处理两者：

public static void main(String[] args) {
  String input = "werewrÃ¤Ã¼Ã¨Ã¶";
  String result = fixDoubleUTF8Encoding(input);
  System.out.println(result); // werewräüèö
  
  input = "üäöé";
  result = fixDoubleUTF8Encoding(input);
  System.out.println(result); // üäöé
}

private static String fixDoubleUTF8Encoding(String s) {
  // interpret the string as UTF_8
  byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
  // now check if the bytes contain 0x83 0xC2, meaning double encoded garbage
  if(isDoubleEncoded(bytes)) {
    // if so, lets fix the string by assuming it is ASCII extended and recode it once
    s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);            
  }
  return s;
}

private static boolean isDoubleEncoded(byte[] bytes) {
  for (int i = 0; i < bytes.length; i++) {
    if(bytes[i] == -125 && i+1 < bytes.length && bytes[i+1] == -62) {
      return true;
    }
  }
  return false;
}

Answer 3

好吧，双重编码可能不是唯一需要处理的问题。 这是一个解决方案，其原因不止一个

String myString = "heartbroken ð";
                myString = new String(myString.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
                String cleanedText = StringEscapeUtils.unescapeJava(myString);
                byte[] bytes = cleanedText.getBytes(StandardCharsets.UTF_8);
                String text = new String(bytes, StandardCharsets.UTF_8);
                Charset charset = Charset.forName("UTF-8");
                CharsetDecoder decoder = charset.newDecoder();
                decoder.onMalformedInput(CodingErrorAction.IGNORE);
                decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
                CharsetEncoder encoder = charset.newEncoder();
                encoder.onMalformedInput(CodingErrorAction.IGNORE);
                encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
                try {
                    // The new ByteBuffer is ready to be read.
                    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(text));
                    // The new ByteBuffer is ready to be read.
                    CharBuffer cbuf = decoder.decode(bbuf);
                    String str = cbuf.toString();
                } catch (CharacterCodingException e) {
                    logger.error("Error Message if you want to");

                }

一

Java解码双编码utf-8字符

问题描述

3 个解决方案

解决方案1
11 已采纳 2017-06-29 17:16:09

解决方案2
1 2020-10-26 15:01:15

解决方案3
0 2019-06-13 17:56:03

Java解码双编码utf-8字符

问题描述

3 个解决方案

解决方案1 11 已采纳 2017-06-29 17:16:09

解决方案2 1 2020-10-26 15:01:15

解决方案3 0 2019-06-13 17:56:03

解决方案1
11 已采纳 2017-06-29 17:16:09

解决方案2
1 2020-10-26 15:01:15

解决方案3
0 2019-06-13 17:56:03