简体   繁体   中英

Base64 UTF-16 encoding between java, python and javascript applications

As a sample I have the following string, that I presume to be under utf-16 encoding: "hühühüh".

In python I get the following result when encoding

>>> base64.b64encode("hühühüh".encode("utf-16"))
b'//5oAPwAaAD8AGgA/ABoAA=='

In java:

>>> String test = "hühühüh";
>>> byte[] encodedBytes = Base64.getEncoder().encode(test.getBytes(StandardCharsets.UTF_16));
>>> String testBase64Encoded = new String(encodedBytes, StandardCharsets.US_ASCII);
>>> System.out.println(testBase64Encoded);
/v8AaAD8AGgA/ABoAPwAaA==

In javascript I define a binary encoding function as per the Mozilla dev guideline and then encode the same string.

>> function toBinary(string) {                                                                                                                            
      const codeUnits = new Uint16Array(string.length);
      for (let i = 0; i < codeUnits.length; i++) {
          codeUnits[i] = string.charCodeAt(i);
      }
      return String.fromCharCode(...new Uint8Array(codeUnits.buffer));
  }
>> atob(toBinary("hühühüh"))

aAD8AGgA/ABoAPwAaAA=

As you can see, each encoder created a distinct base64 string. So lets reverse the encoding again.

In Python all the generated strings decode fine again:

>>> base64.b64decode("//5oAPwAaAD8AGgA/ABoAA==").decode("utf-16")
'hühühüh'
>>> base64.b64decode("/v8AaAD8AGgA/ABoAPwAaA==").decode("utf-16")
'hühühüh'
>>> base64.b64decode("aAD8AGgA/ABoAPwAaAA=").decode("utf-16")
'hühühüh'

In javascript using the fromBinary function again as per the Mozilla dev guideline :

>>> function fromBinary(binary) {
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < bytes.length; i++) {
    bytes[i] = binary.charCodeAt(i);
 }
  console.log(...bytes)
  return String.fromCharCode(...new Uint16Array(bytes.buffer));
}
>>> fromBinary(window.atob("//5oAPwAaAD8AGgA/ABoAA=="))
"\ufeffhühühüh"
>>> fromBinary(window.atob("/v8AaAD8AGgA/ABoAPwAaA=="))
"\ufffe栀ﰀ栀ﰀ栀ﰀ栀"
>>> fromBinary(window.atob("aAD8AGgA/ABoAPwAaAA="))
"hühühüh"

And finally in Java:

>>> String base64Encoded = "//5oAPwAaAD8AGgA/ABoAA==";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println(base64Decoded);
hühühüh
>>> String base64Encoded = "/v8AaAD8AGgA/ABoAPwAaA==";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println(base64Decoded);
hühühüh
>>> String base64Encoded = "aAD8AGgA/ABoAPwAaAA=";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println("Decoded" + base64Decoded);
hühühüh

We can see that python's base64 decoder is able to encode and decode messages for and from the other two parsers. But the definitions between the Java and Javascript parsers do not seem to be compatible with each other. I do not understand why this is. Is this a problem with the base64 libraries in Java and Javascript and if so, are there other tools or routes that let us pass base64 encoded utf-16 strings between a Java and Javascript application? How can I ensure safe base64 string transport between Java and Javscript applications by using tools as close to core language functionality as possible?

EDIT: As said in the accepted answer, the problem is different utf16 encodings. The compatibility problem between Java and Javascript can either be solved by generating the utf16 bytes in Javascript in reverse order, or accepting the encoded string as StandardCharsets.UTF_16LE .

The problem is that there are 4 variants of UTF-16 .

This character encoding uses two bytes per code unit. Which of the two bytes should come first? This creates two variants:

  • UTF-16BE stores the most significant byte first.
  • UTF-16LE stores the least significant byte first.

To allow telling the difference between these two, there is an optional "byte order mark" (BOM) character, U+FEFF, at the start of the text. So UTF-16BE with BOM starts with the bytes fe ff while UTF-16LE with BOM starts with ff fe . Since BOM is optional, its presence doubles the number of possible encodings.

It looks like you are using 3 of the 4 possible encodings:

  • Python used UTF-16LE with BOM
  • Java used UTF-16BE with BOM
  • JavaScript used UTF-16LE without BOM

One of the reasons why people prefer UTF-8 to UTF-16 is to avoid this confusion.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM