简体   繁体   中英

Unicode base 64 encoding with java

I am trying to encode and decode a UTF8 string to base64. In theory not a problem but when decoding and never seem to output the correct characters but the ?.


        String original = "خهعسيبنتا";
        B64encoder benco = new B64encoder();
        String enc = benco.encode(original);
        try
        {
            String dec = new String(benco.decode(enc.toCharArray()), "UTF-8");
            PrintStream out = new PrintStream(System.out, true, "UTF-8");
            out.println("Original: " + original);
            prtHx("ara", original.getBytes());
            out.println("Encoded: " + enc);
            prtHx("enc", enc.getBytes());
            out.println("Decoded: " + dec);
            prtHx("dec", dec.getBytes());
        } catch (UnsupportedEncodingException e)
        {
            e.printStackTrace();
        }

The output to the console is as follow:

Original: خهعسيبنتا
ara = 3F, 3F, 3F, 3F, 3F, 3F, 3F, 3F, 3F
Encoded: Pz8/Pz8/Pz8/
enc = 50, 7A, 38, 2F, 50, 7A, 38, 2F, 50, 7A, 38, 2F
Decoded: ?????????
dec = 3F, 3F, 3F, 3F, 3F, 3F, 3F, 3F, 3F

prtHx simply writes the hex value of the bytes to the output. Am I doing something obviously wrong here?


Andreas pointed to the correct solution by highlighting that the getBytes() method uses the platform default encoding (Cp1252) even though the source file itself is UTF-8. By using the getBytes("UTF-8") I was able to notice that the bytes encoded and decoded were actually different. further investigation shown that the encode method used getBytes(). Changing this did the trick nicely.


try
        {
            String enc = benco.encode(original);
            String dec = new String(benco.decode(enc.toCharArray()), "UTF-8");
            PrintStream out = new PrintStream(System.out, true, "UTF-8");
            out.println("Original: " + original);
            prtHx("ori", original.getBytes("UTF-8"));
            out.println("Encoded: " + enc);
            prtHx("enc", enc.getBytes("UTF-8"));
            out.println("Decoded: " + dec);
            prtHx("dec", dec.getBytes("UTF-8"));

        } catch (UnsupportedEncodingException e)
        {
            e.printStackTrace();
        }

System encoding Cp1252
Original: خهعسيبنتا
ori = D8, AE, D9, 87, D8, B9, D8, B3, D9, 8A, D8, A8, D9, 86, D8, AA, D8, A7
Encoded: 2K7Zh9i52LPZitio2YbYqtin
enc = 32, 4B, 37, 5A, 68, 39, 69, 35, 32, 4C, 50, 5A, 69, 74, 69, 6F, 32, 59, 62, 59, 71, 74, 69, 6E
Decoded: خهعسيبنتا
dec = D8, AE, D9, 87, D8, B9, D8, B3, D9, 8A, D8, A8, D9, 86, D8, AA, D8, A7

Thanks.

String#getBytes() encodes the characters using the platform's default charset. The actual encoding of the String literal "خهعسيبنتا" is "defined" in the java source file (you choose a character encoding when you create or save the file)

This could be the reason, why ara is encode to 0x3f bytes..

Give this a try:

out.println("Original: " + original);
prtHx("ara", original.getBytes("UTF-8"));
out.println("Encoded: " + enc);
prtHx("enc", enc.getBytes("UTF-8"));
out.println("Decoded: " + dec);
prtHx("dec", dec.getBytes("UTF-8"));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM