I am parsing a websocket message and due do a bug in a specific socket.io version (Unfortunately I don't have control over the server side), some of the payload is double encoded as utf-8:
The correct value would be Wrocławskiej (note the l letter which is LATIN SMALL LETTER L WITH STROKE) but I actually get back WrocÅawskiej .
I already tried to decode/encode it again with java
String str = new String(wrongEncoded.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);
Unfortunately the string stays the same. Any idea on how to do a double decoding in java? I saw a python version where they convert it to raw_unicode
first and then parse it again, but I don't know this works or if there is a similar solution for Java. I already read through a couple of posts on that topic, but none helped.
Edit: To clarify in Fiddler I receive the following byte sequence for the above mentionend word:
WrocÃÂawskiej
byte[] arrOutput = { 0x57, 0x72, 0x6F, 0x63, 0xC3, 0x85, 0xC2, 0x82, 0x61, 0x77, 0x73, 0x6B, 0x69, 0x65, 0x6A };
You text was encoding to UTF-8, those bytes were then interpreted as ISO-8859-1 and re-encoded to UTF-8.
Wrocławskiej
is unicode: 0057 0072 006f 0063 0142 0061 0077 0073 006b 0069 0065 006a
Encoding to UTF-8 it is: 57 72 6f 63 c5 82 61 77 73 6b 69 65 6a
In ISO-8859-1 , c5
is Å
and 82
is undefined .
As ISO-8859-1, those bytes are: WrocÅawskiej
Encoding to UTF-8 it is: 57 72 6f 63 c3 85 c2 82 61 77 73 6b 69 65 6a
Those are likely the bytes you are receiving.
So, to undo that, you need:
String s = new String(bytes, StandardCharsets.UTF_8);
// fix "double encoding"
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
I had the problem that sometimes I received double encoded strings and sometimes proper encoded strings. The following method fixDoubleUTF8Encoding will handle both properly:
public static void main(String[] args) {
String input = "werewräüèö";
String result = fixDoubleUTF8Encoding(input);
System.out.println(result); // werewräüèö
input = "üäöé";
result = fixDoubleUTF8Encoding(input);
System.out.println(result); // üäöé
}
private static String fixDoubleUTF8Encoding(String s) {
// interpret the string as UTF_8
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
// now check if the bytes contain 0x83 0xC2, meaning double encoded garbage
if(isDoubleEncoded(bytes)) {
// if so, lets fix the string by assuming it is ASCII extended and recode it once
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
}
return s;
}
private static boolean isDoubleEncoded(byte[] bytes) {
for (int i = 0; i < bytes.length; i++) {
if(bytes[i] == -125 && i+1 < bytes.length && bytes[i+1] == -62) {
return true;
}
}
return false;
}
Well, double encoding may not be the only issue to deal with. Here is a solution that counts for more then one reason
String myString = "heartbroken ð";
myString = new String(myString.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
String cleanedText = StringEscapeUtils.unescapeJava(myString);
byte[] bytes = cleanedText.getBytes(StandardCharsets.UTF_8);
String text = new String(bytes, StandardCharsets.UTF_8);
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
CharsetEncoder encoder = charset.newEncoder();
encoder.onMalformedInput(CodingErrorAction.IGNORE);
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
try {
// The new ByteBuffer is ready to be read.
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(text));
// The new ByteBuffer is ready to be read.
CharBuffer cbuf = decoder.decode(bbuf);
String str = cbuf.toString();
} catch (CharacterCodingException e) {
logger.error("Error Message if you want to");
}
A
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.