[英]Multi-byte character split generates junk symbols on saving to database
In my application, dynamic long strings are generated. 在我的应用程序中,生成了动态长字符串。 these values I am saving in a database with a maximum length. 这些值我将以最大长度保存在数据库中。 when the maximum length is crossed, the string is split using a custom code and a new line gets inserted in database. 当超过最大长度时,将使用自定义代码分割字符串,并在数据库中插入新行。
The problem here occurs when multi-byte characters are used. 使用多字节字符时,会出现此问题。 At the split of the string if a word is getting split at a Vowel signs (matra), then it generates a junk symbols like a diamond with question mark in between . 在字符串的分割处,如果一个单词在元音符号(matra)处被分割,那么它将生成一个垃圾符号,例如菱形,中间带有问号 。
int blockSize = 12;
String str1 = "<SOME STRING>";
byte[] b = str1.getBytes("UTF-8");
int loopCount = x; // in actual code dynamically generated
String outString = "";
for (int i = 0; i <= loopCount; i++) {
if (i != loopCount) {
outString = new String(b, i * blockSize, blockSize, "UTF-8");
} else {
outString =
new String(b, i * blockSize, (b.length - loopCount * blockSize));
}
}
Text as conceived in Unicode has its problems on several levels. Unicode中设想的文本在多个层面上都有其问题。
As pure text composed from Unicode code points . 由Unicode 代码点组成的纯文本。 ĉ
can be represented as one code point U+109, in UTF-16 (binary format) as one char
'\ĉ' , or as
c plus a zero-width so called combining diacritical mark for
^ . So splitting between code points already is problematic.
ĉ
可以表示为一个代码点U + 109,以UTF-16(二进制格式)表示为一个char
'\\ u0109' , or as
c plus a zero-width so called combining diacritical mark for
^ plus a zero-width so called combining diacritical mark for
. So splitting between code points already is problematic.
. So splitting between code points already is problematic.
java.text.Normalizer` can normalize to either composed or decomposed form. java.text.Normalizer`可以规范化为组合形式或分解形式。 Then there are the Left-To-Right and Right-To-Left markers to consider when using a part of a text. 然后,当使用文本的一部分时,要考虑从左到右和从右到左的标记。
On the UTF-16 level, java char
, some code points need 2 chars, a so called surrogate pair. 在UTF-16级别java char
,某些代码点需要2个字符,即所谓的代理对。 This is testable in java using Character
. 这可以在Java中使用Character
测试。 The Character class and also regular expression Pattern
has a rather good Unicode support. Character类以及正则表达式Pattern
具有相当好的Unicode支持。 One can find categories like combining diacritical marks. 可以找到类似合并变音标记的类别。
On the UTF-8 level some (non-ASCII) chars or code points need multibyte sequences, so splitting a byte array causes UTF-8 illegal garbage at the split point. 在UTF-8级别上,某些(非ASCII)字符或代码点需要多字节序列,因此拆分字节数组会导致在拆分点处出现UTF-8非法垃圾。
The solution? 解决方案?
c
at the end might be ĉ
, consider a shifting boundary buffer. 小心处理字节数组的边界,即使最后一个c
可能也是ĉ
,请考虑移动边界缓冲区。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.