从字符串中删除不适合 UTF-8 编码的字符

Question

I have a text-area on website where user can write anything.我在网站上有一个文本区域，用户可以在其中编写任何内容。 Problem happens when user copy paste some text or something which contains non-UTF 8 characters and submit them to server.当用户复制粘贴一些文本或包含非 UTF 8 字符的内容并将它们提交到服务器时会发生问题。

Java successfully handles it, as it support UTF-16 but my mySql table support UTF-8 and thus insertion fails. Java 成功地处理了它，因为它支持 UTF-16，但我的 mySql 表支持 UTF-8，因此插入失败。

I was trying to implement some way in business logic itself, to remove any characters which is not suitable for UTF-8 encoding.我试图在业务逻辑本身中实现某种方式，以删除任何不适合 UTF-8 编码的字符。

Currently I am using this code:目前我正在使用此代码：

new String(java.nio.charset.Charset.forName("UTF-8").encode(myString).array());

But it replaces characters not suitable for UTF-8 with some other obscure characters.但它用其他一些晦涩的字符替换了不适合 UTF-8 的字符。 Which also does not look good to end user.这对最终用户来说也不好看。 Could someone please throw some light over any possible solution to tackle this using Java code?有人可以介绍一下使用 Java 代码解决这个问题的任何可能的解决方案吗？

EDIT : For example, exception I got while insertion of such values编辑：例如，插入此类值时出现异常

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x8A\x0D\x0A...' for column

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x80\xF0\x9F...' for column

Answer 1

UTF-8 is not a character set, it's a character encoding , just like UTF-16. UTF-8 不是字符集，它是一种字符编码，就像 UTF-16 一样。

UTF-8 is capable to encode any unicode character and any unicode text to a sequence of bytes, so there is no such thing as characters not suitable for UTF-8. UTF-8 能够将任何 unicode 字符和任何 unicode 文本编码为字节序列，因此没有不适合 UTF-8 的字符。

You're using a constructor of String which only takes a byte array ( String(byte[] bytes) ) which according to the javadocs:您正在使用String的构造函数，它只接受一个字节数组（ String(byte[] bytes) ），根据 javadocs ：

Constructs a new String by decoding the specified array of bytes using the platform's default charset .通过使用平台的默认 charset解码指定的字节数组来构造一个新的 String 。

It uses the default charset of the platform to interpret the bytes (to convert the bytes to characters).它使用平台的默认字符集来解释字节（将字节转换为字符）。 Do not use this.不要使用这个。 Instead when converting a byte array to String , specify the encoding you wish to use explicitly with the String(byte[] bytes, Charset charset) constructor.相反，在将字节数组转换为String ，请使用String(byte[] bytes, Charset charset)构造函数指定您希望显式使用的编码。

If you have issues with certain characters, that is most likely due to using different character sets or encodings at the server side and at the client side (brownser+HTML).如果您对某些字符有问题，这很可能是由于在服务器端和客户端（浏览器 + HTML）使用了不同的字符集或编码。 Make sure you use UTF-8 everywhere, do not mix encodings and do not use the default encoding of the platform.确保在任何地方都使用 UTF-8，不要混合编码，也不要使用平台的默认编码。

Some readings how to achieve this:一些阅读如何实现这一目标：

How to get UTF-8 working in Java webapps? 如何让 UTF-8 在 Java webapps 中工作？

Answer 2

Maybe the answer with the CharsetDecoder of this question helps.也许这个问题的CharsetDecoder的答案有帮助。 You could change the CodingErrorAction to REPLACE and set a replacement in my example "?".您可以将CodingErrorAction更改为 REPLACE 并在我的示例“？”中设置替换。 This will output a given replacement string for invalid byte sequences.这将为无效字节序列输出给定的替换字符串。 In this example a UTF-8 decoder capability and stress test file is read and decoded:在此示例中，读取并解码了UTF-8 解码器功能和压力测试文件：

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf8Decoder.replaceWith("?");

// Read stress file
Path path = Paths.get("<path>/UTF-8-test.txt");
byte[] data = Files.readAllBytes(path);
ByteBuffer input = ByteBuffer.wrap(data);

// UTF-8 decoding
CharBuffer output = utf8Decoder.decode(input);

// Char buffer to string
String outputString = output.toString();

System.out.println(outputString);

Answer 3

The problem in your code is that you are calling new String on a byte[] .您的代码中的问题是您在byte[]上调用new String 。 The result of encode is a ByteBuffer, and the result of array on a ByteBuffer is a byte[] . encode的结果是一个 ByteBuffer，一个 ByteBuffer 上的array的结果是一个byte[] 。 The constructor new String(byte[]) will use the platform default encoding for your computer;构造函数new String(byte[])将使用您计算机的平台默认编码； it can be different on each computer that you run on, so that's not something that you want.它在您运行的每台计算机上都可能不同，因此这不是您想要的。 You should at least pass in a character set as the second argument to the String constructor, although I'm not sure which character set you would have in mind.您至少应该将字符集作为第二个参数传递给 String 构造函数，尽管我不确定您会想到哪个字符集。

I'm not sure why you're doing it: if your database uses UTF-8, it will do the encoding for you.我不确定您为什么要这样做：如果您的数据库使用 UTF-8，它将为您进行编码。 You just need to pass un-encoded strings into it.您只需要将未编码的字符串传递给它。

UTF-8 and UTF-16 can both encode the entire Unicode 6 character set; UTF-8 和 UTF-16 都可以编码整个 Unicode 6 字符集； there are no characters that can be encoded by UTF-16 but not by UTF-8.没有可以由 UTF-16 编码但不能由 UTF-8 编码的字符。 So that part of your question is unfortunately unanswerable.因此，不幸的是，您问题的那部分无法回答。

For some background:对于一些背景：

http://unicodebook.readthedocs.org/en/latest/unicode_encodings.html http://unicodebook.readthedocs.org/en/latest/unicode_encodings.html

Answer 4

I think this may be useful to you Easy way to remove UTF-8 accents from a string?我认为这可能对您有用从字符串中删除 UTF-8 重音的简单方法？

Try to use Normalizer as,尝试使用 Normalizer 作为，

s = Normalizer.normalize(s, Normalizer.Form.NFD);

Answer 5

You will run into this problem when the MySQL column is encoded with old utf8 using only 3 bytes per character and the value contains a 4-byte character.当 MySQL 列使用旧的utf8编码时，每个字符仅使用 3 个字节并且该值包含一个 4 字节字符时，您将遇到此问题。

The actual solution is to use utf8mb4 instead of utf8 in MySQL.实际的解决方案是在 MySQL 中使用utf8mb4而不是utf8 。

Otherwise here is my dirty workaround to remove all 4-byte chars:否则，这是我删除所有 4 字节字符的肮脏解决方法：

public String removeUtf8Mb4(String text) {
    StringBuilder result = new StringBuilder();
    StringTokenizer st = new StringTokenizer(text, text, true);
    while (st.hasMoreTokens()) {
        String current = st.nextToken();
        if(current.getBytes().length <= 3){
            result.append(current);
        }
    }
    return result.toString();
}

从字符串中删除不适合 UTF-8 编码的字符

问题描述

5 个解决方案

解决方案1
7 已采纳 2015-01-06 09:13:21

解决方案2
5 2015-01-06 09:21:06

解决方案3
1 2015-01-06 09:13:33

解决方案4
1 2015-01-06 09:28:37

解决方案5
1 2020-12-22 11:34:35

从字符串中删除不适合 UTF-8 编码的字符

问题描述

5 个解决方案

解决方案1 7 已采纳 2015-01-06 09:13:21

解决方案2 5 2015-01-06 09:21:06

解决方案3 1 2015-01-06 09:13:33

解决方案4 1 2015-01-06 09:28:37

解决方案5 1 2020-12-22 11:34:35

解决方案1
7 已采纳 2015-01-06 09:13:21

解决方案2
5 2015-01-06 09:21:06

解决方案3
1 2015-01-06 09:13:33

解决方案4
1 2015-01-06 09:28:37

解决方案5
1 2020-12-22 11:34:35