简体   繁体   English

多字节字符拆分在保存到数据库时生成垃圾符号

[英]Multi-byte character split generates junk symbols on saving to database

In my application, dynamic long strings are generated. 在我的应用程序中,生成了动态长字符串。 these values I am saving in a database with a maximum length. 这些值我将以最大长度保存在数据库中。 when the maximum length is crossed, the string is split using a custom code and a new line gets inserted in database. 当超过最大长度时,将使用自定义代码分割字符串,并在数据库中插入新行。

The problem here occurs when multi-byte characters are used. 使用多字节字符时,会出现此问题。 At the split of the string if a word is getting split at a Vowel signs (matra), then it generates a junk symbols like a diamond with question mark in between . 在字符串的分割处,如果一个单词在元音符号(matra)处被分割,那么它将生成一个垃圾符号,例如菱形,中间带有问号

    int blockSize = 12;
    String str1 = "<SOME STRING>";

    byte[] b = str1.getBytes("UTF-8");    

    int loopCount = x; // in actual code dynamically generated
    String outString = "";
    for (int i = 0; i <= loopCount; i++) {
        if (i != loopCount) {
            outString = new String(b, i * blockSize, blockSize, "UTF-8");
        } else {
            outString =
                    new String(b, i * blockSize, (b.length - loopCount * blockSize));
        }
    }
  1. How can I avoid splitting of string when in between a word and instead take to complete word to the next time.? 如何在一个单词之间避免字符串分割,而又将单词补全到下一次? 2.Or is there any other way for stopping generation of junk symbols. 2.或者还有其他方法可以停止生成垃圾符号。

Text as conceived in Unicode has its problems on several levels. Unicode中设想的文本在多个层面上都有其问题。

As pure text composed from Unicode code points . 由Unicode 代码点组成的纯文本。 ĉ can be represented as one code point U+109, in UTF-16 (binary format) as one char '\ĉ' , or as c plus a zero-width so called combining diacritical mark for ^ . So splitting between code points already is problematic. ĉ可以表示为一个代码点U + 109,以UTF-16(二进制格式)表示为一个char '\\ u0109' , or as c plus a zero-width so called combining diacritical mark for ^ plus a zero-width so called combining diacritical mark for . So splitting between code points already is problematic. . So splitting between code points already is problematic. java.text.Normalizer` can normalize to either composed or decomposed form. java.text.Normalizer`可以规范化为组合形式或分解形式。 Then there are the Left-To-Right and Right-To-Left markers to consider when using a part of a text. 然后,当使用文本的一部分时,要考虑从左到右和从右到左的标记。

On the UTF-16 level, java char , some code points need 2 chars, a so called surrogate pair. UTF-16级别java char ,某些代码点需要2个字符,即所谓的代理对。 This is testable in java using Character . 这可以在Java中使用Character测试。 The Character class and also regular expression Pattern has a rather good Unicode support. Character类以及正则表达式Pattern具有相当好的Unicode支持。 One can find categories like combining diacritical marks. 可以找到类似合并变音标记的类别。

On the UTF-8 level some (non-ASCII) chars or code points need multibyte sequences, so splitting a byte array causes UTF-8 illegal garbage at the split point. UTF-8级别上,某些(非ASCII)字符或代码点需要多字节序列,因此拆分字节数组会导致在拆分点处出现UTF-8非法垃圾。

The solution? 解决方案?

  1. Maybe sensible to normalize text; 使文本规范化也许是明智的; mind file names. 介意文件名。
  2. Do not consider byte sub-arrays as valid text 不要将字节子数组视为有效文本
  3. Treat boundaries of byte arrays careful, even a c at the end might be ĉ , consider a shifting boundary buffer. 小心处理字节数组的边界,即使最后一个c可能也是ĉ ,请考虑移动边界缓冲区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM