简体   繁体   中英

Getting UTF-8 encoded from US-ASCII encoded string

I have a string "Château" with UTF-8 encoded & it gets converted to US-ASCII format as "Ch??teau"(in the underlying lib of my app)

Now, I want to get the original string "Château" back from "U-ASCII" converted string "Ch??teau". But, I am not able to get that using the below code.

StringBuilder masterBuffer = new StringBuilder();
byte[] rawDataBuffer = (Read from InputStream) // say here it is "Château"
String rawString = new String(rawDataBuffer, "UTF-8");
masterBuffer.append(rawString);
onMessageReceived(masterBuffer.toString().getBytes()) => Here, getBytes() uses the platform's default charset 'US-ASCII.

My application receives the byte array of US-ASCII encoded. On application side, even if I try to get UTF-8 string out of it, it's of no use. The conversion attempt still gives "Ch??teau".

String asciiString = "Ch??teau";
String originalString = new String(asciiString.getBytes("UTF-8"), "UTF-8");
System.out.println("orinalString: " + originalString);

The value of 'originalString" is still "Ch??teau".

Is this right way to do this ?

Thanks,

You can't. You lost information by converting to US-ASCII. You can't get back what was lost.

Your code is receiving a UTF-8 encoded byte array, correctly converting it to a Java String , but is then converting that string to an ASCII encoded byte array. ASCII does not support the à and ¢ characters, which is why they are being converted to ? . Once that conversion has been done, there is no going back. ASCII is a subset of UTF-8, and ? in ASCII is also ? in UTF-8.

The solution is to stop converting to ASCII to begin with. You should convert back to UTF-8 instead:

StringBuilder masterBuffer = new StringBuilder();
byte[] rawDataBuffer = ...; // Read from InputStream
String rawString = new String(rawDataBuffer, "UTF-8");
masterBuffer.append(rawString);
onMessageReceived(masterBuffer.toString().getBytes("UTF-8"));

At least that way, for true ASCII characters, the receiver will never know the difference (since ASCII is a subset of UTF-8), and non-ASCII character will not be lost anymore. The receiver just needs to know to expect UTF-8 and not ASCII. And, your code will be more portable, since you will no longer be dependent on a platform-specific default charset (not all platforms use ASCII by default).

Of course, in your example, your StringBuilder is redundant since you are not adding anything else to it, so you could just remove it:

byte[] rawDataBuffer = ...; // Read from InputStream
String rawString = new String(rawDataBuffer, "UTF-8");
onMessageReceived(rawString.getBytes("UTF-8"));

And then the String becomes redundant, too:

byte[] rawDataBuffer = ...; // Read from InputStream
onMessageReceived(rawDataBuffer);

If onMessageReceived() expects bytes as input, why waste overhead converting bytes to String to bytes again?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM