Checking if string from database is utf-8 encoded in Java

Question

For 2 days now, I've been searching for ways to check if a value from the database is utf-8 encoded or not in Java. So far, I've read that strings in Java are using unicode (utf-16) encoding. I've tried following the suggested answer from here and here but neither seem to work properly. The first one always returns false while the second one would always return true.

An example of strings I try to check are as follows wherein everything except the last string is utf8 encoded:

ABCDEF, ｋａｔａｋａｎａ, カタカナ and K { ` F b N G [

One idea that I've been trying is to get the bytes of the string using utf-8 encoding then also get the bytes of the string using the default encoding then compare like so:

byte[] utf8byte = str.getBytes("UTF-8");
byte[] bytes = str.getBytes();
if(utf8byte.length == bytes.length) {
   return true;
}

However given this logic, only the first string would return true. From my understanding, this is because not all characters use only 1 byte.

So what is the best approach you can suggest to check whether a string from the database is utf-8 encoded or not? I'd really appreciate any idea. Thanks in advanced.

Answer 1

You can't.

The Java database driver reads the encoded byte string from the database and converts it to a Java string. The Database may choose to send the string as UTF-8, UTF-16 or any other encoding the driver understands.

Once it's a Java string it no longer contains any traces of the original encoding. getBytes() will use your system character encoding to decode the string. It has no relevance to the Database encoding.

Yes, Java uses UTF-16 under the hood but it's irrelevant.

Checking if string from database is utf-8 encoded in Java

Question

1 answers

solution1
3 2015-11-11 05:52:50

Checking if string from database is utf-8 encoded in Java

Question

1 answers

solution1 3 2015-11-11 05:52:50

solution1
3 2015-11-11 05:52:50