简体   繁体   English

检查数据库中的字符串是否是用Java编写的utf-8编码

[英]Checking if string from database is utf-8 encoded in Java

For 2 days now, I've been searching for ways to check if a value from the database is utf-8 encoded or not in Java. 现在2天,我一直在寻找方法来检查数据库中的值是否是用Java编码的utf-8编码。 So far, I've read that strings in Java are using unicode (utf-16) encoding. 到目前为止,我已经读过Java中的字符串使用的是unicode(utf-16)编码。 I've tried following the suggested answer from here and here but neither seem to work properly. 我尝试过这里这里的建议答案,但似乎都没有正常工作。 The first one always returns false while the second one would always return true. 第一个总是返回false,而第二个总是返回true。

An example of strings I try to check are as follows wherein everything except the last string is utf8 encoded: 我尝试检查的字符串示例如下,其中除最后一个字符串之外的所有内容都是utf8编码的:

ABCDEF, katakana, カタカナ and K { ` F b N G [ ABCDEF,片假名,カタカナ和 K { ` F b N G [

One idea that I've been trying is to get the bytes of the string using utf-8 encoding then also get the bytes of the string using the default encoding then compare like so: 我一直在尝试的一个想法是使用utf-8编码获取字符串的字节,然后使用默认编码获取字符串的字节,然后比较如下:

byte[] utf8byte = str.getBytes("UTF-8");
byte[] bytes = str.getBytes();
if(utf8byte.length == bytes.length) {
   return true;
}

However given this logic, only the first string would return true. 但是根据这个逻辑,只有第一个字符串会返回true。 From my understanding, this is because not all characters use only 1 byte. 根据我的理解,这是因为并非所有字符都只使用1个字节。

So what is the best approach you can suggest to check whether a string from the database is utf-8 encoded or not? 那么,您可以建议检查数据库中的字符串是否为utf-8编码的最佳方法是什么? I'd really appreciate any idea. 我真的很感激任何想法。 Thanks in advanced. 提前致谢。

You can't. 你不能。

The Java database driver reads the encoded byte string from the database and converts it to a Java string. Java数据库驱动程序从数据库中读取编码的字节字符串,并将其转换为Java字符串。 The Database may choose to send the string as UTF-8, UTF-16 or any other encoding the driver understands. 数据库可以选择将字符串发送为UTF-8,UTF-16或驱动程序可以理解的任何其他编码。

Once it's a Java string it no longer contains any traces of the original encoding. 一旦它是Java字符串,它就不再包含原始编码的任何痕迹。 getBytes() will use your system character encoding to decode the string. getBytes()将使用您的系统字符编码来解码字符串。 It has no relevance to the Database encoding. 它与数据库编码无关。

Yes, Java uses UTF-16 under the hood but it's irrelevant. 是的,Java使用UTF-16,但它无关紧要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM