简体   繁体   中英

Java - Count exactly 60 characters from a string with a mixture of UTF-8 and non UTF-8 characters

I have a string which i want to save in a database that only supports UTF8 characters. If the string size is > 60 characters i want to truncate it and only store the first 60 characters. The Oracle database in use only supports UTF-8 characters.

Using String.substring(0,59) in Java returns 60 characters but when i save it in the database it gets rejected as the database claims that the string is > 60 characters.

  • Is there a way to find out if a particular string contains non UTF8 characters. One option i found is:

    try {

      bytes = returnString.getBytes("UTF-8"); } catch (UnsupportedEncodingException e) { // Do something 

    }

  • is there a way i can truncate it to exactly x number of characters (loss of data is not an issue) and make sure that when saved in the database only x number of characters are saved. For example if i have the string §8§8§8§8§8§8§8 and i say truncate and save only 5 characters it should only save §8§

As far as I understand you want to limit the String length in a way that the encoded UTF-8 representation does not exceed 60 bytes. You can do it this way:

String s=…;
CharsetEncoder enc=StandardCharsets.UTF_8.newEncoder();
ByteBuffer bb=ByteBuffer.allocate(60);// note the limit
CharBuffer cb = CharBuffer.wrap(s);
CoderResult r = enc.encode(cb, bb, true);
if(r.isOverflow()) {
    System.out.println(s+" is too long for "
                      +bb.capacity()+" "+enc.charset()+" bytes");
    s=cb.flip().toString();
    System.out.println("truncated to "+s);
}

This is my quick hack: a function to truncate a string to given number of bytes in UTF-8 encoding:

public static String truncateUtf8(String original, int byteCount) {
    if (original.length() * 3 <= byteCount) {
        return original;
    }
    StringBuilder sb = new StringBuilder();
    int count = 0;
    for (int i = 0; i < original.length(); i++) {
        char c = original.charAt(i);
        int newCount;
        if (c <= 0x7f) newCount = count + 1;
        else if (c <= 0x7ff) newCount = count + 2;
        else newCount = count + 3;
        if (newCount > byteCount) {
            break;
        }
        count = newCount;
        sb.append(c);
    }
    return sb.toString();
}

It does not work as expected for characters outside of BMP – counts them as 6 bytes instead of 4. It may also break grapheme clusters. But for most simple tasks it should be OK.

truncateUtf8("e", 1) => "e"
truncateUtf8("ée", 1) => ""
truncateUtf8("ée", 2) => "é"
truncateUtf8("ée", 3) => "ée"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM