简体   繁体   中英

Convert string representation of a hexadecimal byte array to a string with non ascii characters in Java

I have a String being sent in the request payload by a client as:

"[0xc3][0xa1][0xc3][0xa9][0xc3][0xad][0xc3][0xb3][0xc3][0xba][0xc3][0x81][0xc3][0x89][0xc3][0x8d][0xc3][0x93][0xc3][0x9a]Departms"

I want to get a String which is "áéíóúÁÉÍÓÚDepartms" . How can I do this in Java?

The problem is that I have no control over the way client encodes this string. Seems like the client is just encoding the non-ascii characters in this format and sends the ascii chars as it is(see 'Departms' at the end).

The stuff within the square brackets, seems to be characters encoded in UTF-8 but converted into a hexadecimal string in a weird way. What you can do is find each instance that looks like [0xc3] and convert it into the corresponding byte, and then create a new string from the bytes.

Unfortunately there are no good tools for working with byte arrays. Here's a quick and dirty solution that uses regex to find and replace these hex codes with the corresponding character in latin-1, and then fixes that by re-interpreting the bytes.

String bracketDecode(String str) {
    Pattern p = Pattern.compile("\\[(0x[0-9a-f]{2})\\]");
    Matcher m = p.matcher(str);
    StringBuilder sb = new StringBuilder();
    while (m.find()) {
        String group = m.group(1);
        Integer decode = Integer.decode(group);
        // assume latin-1 encoding
        m.appendReplacement(sb, Character.toString(decode));
    }
    m.appendTail(sb);
    // oh no, latin1 is not correct! re-interpret bytes in utf-8
    byte[] bytes = sb.toString().getBytes(StandardCharsets.ISO_8859_1);
    return new String(bytes, StandardCharsets.UTF_8);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM