简体   繁体   English

如何在Java中将转义小数文本转换回Unicode

[英]How to convert escape-decimal text back to unicode in Java

A third-party library in our stack is munging strings containing emoji etc like so: 我们堆栈中的第三方库正在处理包含表情符号等的字符串,如下所示:

"Ben \\240\\159\\144\\144\\240\\159\\142\\169" “本\\ 240 \\ 159 \\ 144 \\ 144 \\ 240 \\ 159 \\ 142 \\ 169”

That is, decimal bytes, not hexadecimal shorts. 也就是说,十进制字节而不是十六进制短裤。

Surely there is an existing routine to turn this back into a proper Unicode string, but all the discussion I've found about this expects the format \ኯ, not \\123. 当然,已经有一个例程可以将其转换为正确的Unicode字符串,但是我发现的所有讨论都期望该格式为\\ u12AF,而不是\\ 123。

I am not aware of any existing routine, but something simple like this should do the job (assuming the input is available as a string): 我不知道任何现有的例程,但是像这样的简单操作应该可以完成此工作(假设输入可以作为字符串使用):

public static String unEscapeDecimal(String s) {
  try {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    Writer writer = new OutputStreamWriter(baos, "utf-8");
    int pos = 0;
    for (int i = 0; i < s.length(); i++) {
      char c = s.charAt(i);
      if (c == '\\') {
        writer.flush();
        baos.write(Integer.parseInt(s.substring(i+1, i+4)));
        i += 3;
      } else {
        writer.write(c);
      }
    }
    writer.flush();
    return new String(baos.toByteArray(), "utf-8");
  } catch (IOException e) {
    throw new RuntimeException(e);
  }
}

The writer is just used to make sure existing characters in the string with code points > 127 are encoded correctly, should they occur unescaped. 编写器仅用于确保字符串中代码点大于127的现有字符正确编码(如果它们未转义的话)。 If all non-ascii characters are escaped, the byte array output stream should be sufficient. 如果所有非ASCII字符都已转义,则字节数组输出流应足够。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM