简体   繁体   English

检查字符串是否与MySQL兼容UTF-8

[英]Checking if a string is UTF-8 compatible for mySQL

We have older mySQL DB that only support UTF-8 charset. 我们有较旧的mySQL DB,仅支持UTF-8字符集。 Is a there a way in Java to detect if a given string will be UTF-8 compatible? Java中是否有一种方法可以检测给定的字符串是否兼容UTF-8?

public static boolean isUTF8MB4(String s) {
    for (int i = 0; i < s.length(); ++i) {
        int bytes = s.substring(i, i + 1).getBytes(StandardCharsets.UTF_8);
        if (bytes > 3) {
            return true;
        }
    }
    return false;
}

The above implementation seems best, but otherwise: 上面的实现似乎最好,但否则:

public static boolean isUTF8MB4(String s) {
    for (int i = 0; i < s.length(); ) {
        int codePoint = s.codePointAt(i);
        int bytes = Character.charCount(codePoint);
        if (bytes > 3) {
            return true;
        }
        i += bytes;
    }
    return false;
}

which might fail more often. 这可能会更频繁地失败。

Every String is UTF-8 compatible. 每个字符串都是UTF-8兼容的。 Just set encoding in the database and the MySQL driver correctly and you're set. 只需在数据库和MySQL驱动程序中正确设置编码,即可设置好。

The only gotcha is that the length in bytes of the UTF-8 encoded string may be larger that what .length() says. 唯一的问题是,UTF-8编码的字符串的字节长度可能大于.length()所说的长度。 Here's a Java implementation of a function to measure how many bytes a string will take after encoding to UTF-8. 这是一个函数的Java实现,用于测量将字符串编码为UTF-8后将占用多少字节。

EDIT: Since Saqib pointed out that older MySQL doesn't actually support UTF-8, but only its BMP subset, you can check if a string contains codepoints outside BMP with string.length()==string.codePointCount(0,string.length()) ("true" means "all codepoints are in BMP") and remove them with string.replaceAll("[^\-\￿]", "") 编辑:由于Saqib指出较早的MySQL实际上并不支持UTF-8,而仅支持其BMP子集,因此您可以使用string.length()==string.codePointCount(0,string.length())检查字符串是否包含BMP之外的代码点string.length()==string.codePointCount(0,string.length()) (“ true”表示“所有代码点都在BMP中”),并使用string.replaceAll("[^\-\￿]", "")删除它们

MySQL defines : MySQL 定义

The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. 名为utf8的字符集每个字符最多使用三个字节,并且仅包含BMP字符。

Therefore this function should work: 因此,此功能应该起作用:

private boolean isValidUTF8(final String string) {
    for (int i = 0; i < string.length(); i++) {
        final char c = string.charAt(i);
        if (!Character.isBmpCodePoint(c)) {
            return false;
        }
    }
    return true;
 }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM