[英]Checking if a string is UTF-8 compatible for mySQL
We have older mySQL DB that only support UTF-8 charset. 我们有较旧的mySQL DB,仅支持UTF-8字符集。 Is a there a way in Java to detect if a given string will be UTF-8 compatible? Java中是否有一种方法可以检测给定的字符串是否兼容UTF-8?
public static boolean isUTF8MB4(String s) {
for (int i = 0; i < s.length(); ++i) {
int bytes = s.substring(i, i + 1).getBytes(StandardCharsets.UTF_8);
if (bytes > 3) {
return true;
}
}
return false;
}
The above implementation seems best, but otherwise: 上面的实现似乎最好,但否则:
public static boolean isUTF8MB4(String s) {
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
int bytes = Character.charCount(codePoint);
if (bytes > 3) {
return true;
}
i += bytes;
}
return false;
}
which might fail more often. 这可能会更频繁地失败。
Every String is UTF-8 compatible. 每个字符串都是UTF-8兼容的。 Just set encoding in the database and the MySQL driver correctly and you're set. 只需在数据库和MySQL驱动程序中正确设置编码,即可设置好。
The only gotcha is that the length in bytes of the UTF-8 encoded string may be larger that what .length()
says. 唯一的问题是,UTF-8编码的字符串的字节长度可能大于.length()
所说的长度。 Here's a Java implementation of a function to measure how many bytes a string will take after encoding to UTF-8. 这是一个函数的Java实现,用于测量将字符串编码为UTF-8后将占用多少字节。
EDIT: Since Saqib pointed out that older MySQL doesn't actually support UTF-8, but only its BMP subset, you can check if a string contains codepoints outside BMP with string.length()==string.codePointCount(0,string.length())
("true" means "all codepoints are in BMP") and remove them with string.replaceAll("[^\ -\]", "")
编辑:由于Saqib指出较早的MySQL实际上并不支持UTF-8,而仅支持其BMP子集,因此您可以使用string.length()==string.codePointCount(0,string.length())
检查字符串是否包含BMP之外的代码点string.length()==string.codePointCount(0,string.length())
(“ true”表示“所有代码点都在BMP中”),并使用string.replaceAll("[^\ -\]", "")
删除它们
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. 名为utf8的字符集每个字符最多使用三个字节,并且仅包含BMP字符。
Therefore this function should work: 因此,此功能应该起作用:
private boolean isValidUTF8(final String string) {
for (int i = 0; i < string.length(); i++) {
final char c = string.charAt(i);
if (!Character.isBmpCodePoint(c)) {
return false;
}
}
return true;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.