多字节字符拆分在保存到数据库时生成垃圾符号

Question

In my application, dynamic long strings are generated. 在我的应用程序中，生成了动态长字符串。 these values I am saving in a database with a maximum length. 这些值我将以最大长度保存在数据库中。 when the maximum length is crossed, the string is split using a custom code and a new line gets inserted in database. 当超过最大长度时，将使用自定义代码分割字符串，并在数据库中插入新行。

The problem here occurs when multi-byte characters are used. 使用多字节字符时，会出现此问题。 At the split of the string if a word is getting split at a Vowel signs (matra), then it generates a junk symbols like a diamond with question mark in between . 在字符串的分割处，如果一个单词在元音符号（matra）处被分割，那么它将生成一个垃圾符号，例如菱形，中间带有问号 。

    int blockSize = 12;
    String str1 = "<SOME STRING>";

    byte[] b = str1.getBytes("UTF-8");    

    int loopCount = x; // in actual code dynamically generated
    String outString = "";
    for (int i = 0; i <= loopCount; i++) {
        if (i != loopCount) {
            outString = new String(b, i * blockSize, blockSize, "UTF-8");
        } else {
            outString =
                    new String(b, i * blockSize, (b.length - loopCount * blockSize));
        }
    }

How can I avoid splitting of string when in between a word and instead take to complete word to the next time.? 如何在一个单词之间避免字符串分割，而又将单词补全到下一次？ 2.Or is there any other way for stopping generation of junk symbols. 2.或者还有其他方法可以停止生成垃圾符号。

Answer 1

Text as conceived in Unicode has its problems on several levels. Unicode中设想的文本在多个层面上都有其问题。

As pure text composed from Unicode code points . 由Unicode 代码点组成的纯文本。 ĉ can be represented as one code point U+109, in UTF-16 (binary format) as one char '\ĉ' , or as c plus a zero-width so called combining diacritical mark for ^ . So splitting between code points already is problematic. ĉ可以表示为一个代码点U + 109，以UTF-16（二进制格式）表示为一个char '\\ u0109' , or as c plus a zero-width so called combining diacritical mark for ^ plus a zero-width so called combining diacritical mark for . So splitting between code points already is problematic. . So splitting between code points already is problematic. java.text.Normalizer` can normalize to either composed or decomposed form. java.text.Normalizer`可以规范化为组合形式或分解形式。 Then there are the Left-To-Right and Right-To-Left markers to consider when using a part of a text. 然后，当使用文本的一部分时，要考虑从左到右和从右到左的标记。

On the UTF-16 level, java char , some code points need 2 chars, a so called surrogate pair. 在UTF-16级别java char ，某些代码点需要2个字符，即所谓的代理对。 This is testable in java using Character . 这可以在Java中使用Character测试。 The Character class and also regular expression Pattern has a rather good Unicode support. Character类以及正则表达式Pattern具有相当好的Unicode支持。 One can find categories like combining diacritical marks. 可以找到类似合并变音标记的类别。

On the UTF-8 level some (non-ASCII) chars or code points need multibyte sequences, so splitting a byte array causes UTF-8 illegal garbage at the split point. 在UTF-8级别上，某些（非ASCII）字符或代码点需要多字节序列，因此拆分字节数组会导致在拆分点处出现UTF-8非法垃圾。

The solution? 解决方案？

Maybe sensible to normalize text; 使文本规范化也许是明智的； mind file names. 介意文件名。
Do not consider byte sub-arrays as valid text 不要将字节子数组视为有效文本
Treat boundaries of byte arrays careful, even a c at the end might be ĉ , consider a shifting boundary buffer. 小心处理字节数组的边界，即使最后一个c可能也是ĉ ，请考虑移动边界缓冲区。

多字节字符拆分在保存到数据库时生成垃圾符号

问题描述

1 个解决方案

解决方案1
0 2016-05-12 08:43:06

多字节字符拆分在保存到数据库时生成垃圾符号

问题描述

1 个解决方案

解决方案1 0 2016-05-12 08:43:06

解决方案1
0 2016-05-12 08:43:06