检查UTF-8数据类型3字节或4字节Unicode

Question

In my database I get the error 在我的数据库中，我收到错误

com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column

I use Java and MySQL 5. As I know 4-byte Unicode is legal i Java, but illegal in MySQL 5, I think that it can cause my problem and I want to check type of my data, so here's my question: How can i check that my UTF-8 data is 3-byte or 4-byte Unicode? 我使用Java和MySQL 5.我知道4字节Unicode是合法的Java，但在MySQL 5中是非法的，我认为它可能导致我的问题，我想检查我的数据类型，所以这里是我的问题：怎么能我检查我的UTF-8数据是3字节还是4字节Unicode？

Answer 1

UTF-8 encodes everything in the basic multilingual plane (ie U+0000 to U+FFFF inclusive) in 1-3 bytes. UTF-8以1-3个字节对基本多语言平面中的所有内容（即U + 0000到U + FFFF）进行编码。 Therefore, you just need to check whether everything in your string is in the BMP. 因此，你只需要检查一切都在你的字符串是在BMP。

In Java, that means checking whether any char (which is a UTF-16 code unit) is a high or low surrogate character, as Java will use surrogate pairs to encode non-BMP characters: 在Java中，这意味着检查是否有任何char （UTF-16代码单元）是高或低代理字符，因为Java将使用代理对来编码非BMP字符：

public static boolean isEntirelyInBasicMultilingualPlane(String text) {
    for (int i = 0; i < text.length(); i++) {
        if (Character.isSurrogate(text.charAt(i))) {
            return false;
        }
    }
    return true;
}

Answer 2

If you do not want to support beyond BMP, you can just strip those characters before handing it to MySQL: 如果你不想支持BMP之外，你可以在将它们交给MySQL之前删除这些字符：

public static String withNonBmpStripped( String input ) {
    if( input == null ) throw new IllegalArgumentException("input");
    return input.replaceAll("[^\\u0000-\\uFFFF]", "");
}

If you want to support beyond BMP, you need MySQL 5.5+ and you need to change everything that's utf8 to utf8mb4 (collations, charsets ...). 如果你想支持超出BMP，你需要MySQL 5.5+，你需要将utf8所有内容改为utf8mb4 （collations，charsets ......）。 But you also need the support for this in the driver that I am not familiar with. 但是你也需要我不熟悉的驱动程序中的支持。 Handling these characters in Java is also a pain because they are spread over 2 chars and thus need special handling in many operations. 在Java中处理这些字符也是一种痛苦，因为它们分布在2个chars ，因此需要在许多操作中进行特殊处理。

Answer 3

在我发现的 java中剥离非BMP charactres的最佳方法如下：

inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");

检查UTF-8数据类型3字节或4字节Unicode

问题描述

3 个解决方案

解决方案1
17 已采纳 2013-02-20 13:37:08

解决方案2
10 2013-02-20 15:29:16

解决方案3
3 2013-11-18 04:39:02

检查UTF-8数据类型3字节或4字节Unicode

问题描述

3 个解决方案

解决方案1 17 已采纳 2013-02-20 13:37:08

解决方案2 10 2013-02-20 15:29:16

解决方案3 3 2013-11-18 04:39:02

解决方案1
17 已采纳 2013-02-20 13:37:08

解决方案2
10 2013-02-20 15:29:16

解决方案3
3 2013-11-18 04:39:02