如何在Java中替换/删除UTF-8字符串中的4（+）字节字符？

Question

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings. 因为MySQL 5.1不支持4字节UTF-8序列，所以我需要替换/删除这些字符串中的4字节序列。

I'm looking a clean way to replace these characters. 我正在寻找一种干净的方法来替换这些角色。

Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course. 在这种情况下，Apache库正在用问号替换字符，但是当然，ASCII等价物会更好。

NB The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time. NB输入来自外部源（电子邮件名称），此时升级数据库不是解决方案。

Answer 1

We ended up implementing the following method in Java for this problem. 我们最终在Java中为此问题实现了以下方法。 Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char. Basicaly用更高的代码点替换字符，然后用最后的3字节UTF-8字符替换。

The offset calculations are to make sure we stay on the unicode code points. 偏移量计算是为了确保我们保持unicode代码点。

public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD"; 

public static String toValid3ByteUTF8String(String s)  {
    final int length = s.length();
    StringBuilder b = new StringBuilder(length);
    for (int offset = 0; offset < length; ) {
       final int codepoint = s.codePointAt(offset);

       // do something with the codepoint
       if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
           b.append(CharUtils.REPLACEMENT_CHAR);
       } else {
           if (Character.isValidCodePoint(codepoint)) {
               b.appendCodePoint(codepoint);
           } else {
               b.append(CharUtils.REPLACEMENT_CHAR);
           }
       }
       offset += Character.charCount(codepoint);
    }
    return b.toString();
}

Answer 2

Another simple solution is to use regular expression [^\-\] . 另一个简单的解决方案是使用正则表达式[^\-\] 。 For example in java: 例如在java中：

text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");

Answer 3

5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. 5字节utf-8序列以111110xx字节开头，6字节utf-8序列以1111110x字节开头。 Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx. 需要注意的是，没有1-4字节utf-8序列的后续字节包含大的字节，因为后续字节总是10xxxxxx的形式。

Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' 因此，你只需要查看字节，每次看到一个111110xx的字节，然后只发出'？' to the output-stream/array while skipping the next 4 bytes from the input; 输出流/数组，同时从输入跳过接下来的4个字节; analogue for the 6-byte-sequences. 6字节序列的模拟。

如何在Java中替换/删除UTF-8字符串中的4（+）字节字符？

问题描述

3 个解决方案

解决方案1
11 已采纳 2013-05-16 07:38:24

解决方案2
10 2014-08-01 07:32:33

解决方案3
2 2012-02-13 12:56:32

如何在Java中替换/删除UTF-8字符串中的4（+）字节字符？

问题描述

3 个解决方案

解决方案1 11 已采纳 2013-05-16 07:38:24

解决方案2 10 2014-08-01 07:32:33

解决方案3 2 2012-02-13 12:56:32

解决方案1
11 已采纳 2013-05-16 07:38:24

解决方案2
10 2014-08-01 07:32:33

解决方案3
2 2012-02-13 12:56:32