检查字符串是否与MySQL兼容UTF-8

Question

We have older mySQL DB that only support UTF-8 charset. 我们有较旧的mySQL DB，仅支持UTF-8字符集。 Is a there a way in Java to detect if a given string will be UTF-8 compatible? Java中是否有一种方法可以检测给定的字符串是否兼容UTF-8？

Answer 1

public static boolean isUTF8MB4(String s) {
    for (int i = 0; i < s.length(); ++i) {
        int bytes = s.substring(i, i + 1).getBytes(StandardCharsets.UTF_8);
        if (bytes > 3) {
            return true;
        }
    }
    return false;
}

The above implementation seems best, but otherwise: 上面的实现似乎最好，但否则：

public static boolean isUTF8MB4(String s) {
    for (int i = 0; i < s.length(); ) {
        int codePoint = s.codePointAt(i);
        int bytes = Character.charCount(codePoint);
        if (bytes > 3) {
            return true;
        }
        i += bytes;
    }
    return false;
}

which might fail more often. 这可能会更频繁地失败。

Answer 2

Every String is UTF-8 compatible. 每个字符串都是UTF-8兼容的。 Just set encoding in the database and the MySQL driver correctly and you're set. 只需在数据库和MySQL驱动程序中正确设置编码，即可设置好。

The only gotcha is that the length in bytes of the UTF-8 encoded string may be larger that what .length() says. 唯一的问题是，UTF-8编码的字符串的字节长度可能大于.length()所说的长度。 Here's a Java implementation of a function to measure how many bytes a string will take after encoding to UTF-8. 这是一个函数的Java实现，用于测量将字符串编码为UTF-8后将占用多少字节。

EDIT: Since Saqib pointed out that older MySQL doesn't actually support UTF-8, but only its BMP subset, you can check if a string contains codepoints outside BMP with string.length()==string.codePointCount(0,string.length()) ("true" means "all codepoints are in BMP") and remove them with string.replaceAll("[^\-\]", "") 编辑：由于Saqib指出较早的MySQL实际上并不支持UTF-8，而仅支持其BMP子集，因此您可以使用string.length()==string.codePointCount(0,string.length())检查字符串是否包含BMP之外的代码点string.length()==string.codePointCount(0,string.length()) （“ true”表示“所有代码点都在BMP中”），并使用string.replaceAll("[^\-\]", "")删除它们

Answer 3

MySQL defines : MySQL 定义：

The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. 名为utf8的字符集每个字符最多使用三个字节，并且仅包含BMP字符。

Therefore this function should work: 因此，此功能应该起作用：

private boolean isValidUTF8(final String string) {
    for (int i = 0; i < string.length(); i++) {
        final char c = string.charAt(i);
        if (!Character.isBmpCodePoint(c)) {
            return false;
        }
    }
    return true;
 }

检查字符串是否与MySQL兼容UTF-8

问题描述

3 个解决方案

解决方案1
1 已采纳 2014-02-19 15:55:23

解决方案2
0 2014-02-19 10:22:22

解决方案3
0 2015-06-05 10:24:43

检查字符串是否与MySQL兼容UTF-8

问题描述

3 个解决方案

解决方案1 1 已采纳 2014-02-19 15:55:23

解决方案2 0 2014-02-19 10:22:22

解决方案3 0 2015-06-05 10:24:43

解决方案1
1 已采纳 2014-02-19 15:55:23

解决方案2
0 2014-02-19 10:22:22

解决方案3
0 2015-06-05 10:24:43