简体   繁体   中英

Serialize Java Object into String using UTF-8

I am trying to write a function which serialize an Java object into a String using UTF-8 encoding. This is my implementation:

public static String serializeToString(DefaultMutableTreeNode tree) {
    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    try {
        ObjectOutput out = new ObjectOutputStream(byteArrayOutputStream);
        out.writeObject(tree);
        return byteArrayOutputStream.toString("UTF-8");
    } catch (IOException e) {
        return null;
    }
}

However, it doesn't seem to work. I tried to pass the resulting String into a database which only accept UTF-8 encoding but failed with an error with encoding problem.

My questions are:

  1. What is the problem of my implementation?
  2. How can I examine if the resulting String is in UTF-8 or not?

Many thanks

Regards

This is not a good idea, an arbitrary binary array doesn't always translate into a valid UTF-8 sequence. You should rather put the array in the database as a binary blob, or transform the array into a string with something like a Base64 encoding.

You are bound to get unprintable characters in your string, which the DB won't like at all. The Java ByteArrayOutputStream documentation sort-of hints that it might recode the unprintable characters as printable, but, looking at the code, I can't see that it does anything but stop the program with an error. Nor can I see what you would do with such a string in the future.

Only a part (about a quarter) of the 256 possible values of a byte are valid ASCII characters. Most databases won't take them as part of a character string. Hence your error message. (Unicode, and UTF-8 have the same problem.)

I did once store binary data on a database by converting it to printable characters by converting every 6 bits to a byte containing a printable character. But I used simple ASCII encoding, and I wrote code to convert the characters back to binary. I was then able to store binary data in a database character column and retrieve it later. I was rather forced into it; I wouldn't recommend you do it.

If you want to see what your "character string" looks like, just print out each byte as an integer and compare it to an ASCII table. You'll probably see the problem without needing to consider the fine points of Unicode.

I am trying to write a function which serialize an Java object into a String using UTF-8 encoding.

Yes ... well what your code is actually doing is serializing the object to bytes, and then telling the String constructor "these bytes are a valid UTF-8 encoding of some Unicode code points". The problem is that (generally speaking) they are NOT ... and when the UTF-8 decoder attempts to convert them to the UTF-16 representation used in a Java String, it finds sequences that are invalid and replaces them with an "invalid character" codepoint.

If you want to represent arbitrary bytes as a Java String, then you need to use something like base64 encoding. A better idea would be to put the bytes into the database as a Blob.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM