Java - 从字符串中删除奇怪的字符

Question

How do I remove strange and unwanted Unicode characters (such as a black diamond with question mark) from a String?如何从字符串中删除奇怪和不需要的 Unicode 字符（例如带问号的黑色菱形）？

Updated:更新：

Please tell me the Unicode character string or regex that correspond to "a black diamond with question mark in it".请告诉我对应于“带有问号的黑色菱形”的 Unicode 字符串或正则表达式。

Answer 1

A black diamond with a question mark is not a unicode character -- it's a placeholder for a character that your font cannot display.带问号的黑色菱形不是 Unicode 字符——它是您的字体无法显示的字符的占位符。 If there is a glyph that exists in the string that is not in the font you're using to display that string, you will see the placeholder.如果字符串中存在的字形不在用于显示该字符串的字体中，您将看到占位符。 This is defined as U+FFFD: .这被定义为 U+FFFD：。 Its appearance varies depending on the font you're using.它的外观因您使用的字体而异。

You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.您可以使用java.text.normalizer删除不在“正常”ASCII 字符集中的 Unicode 字符。

Answer 2

You can use a String.replaceAll("[my-list-of-strange-and-unwanted-chars]","")您可以使用String.replaceAll("[my-list-of-strange-and-unwanted-chars]","")

There is no Character.isStrangeAndUnWanted() , you have to define what you want.没有Character.isStrangeAndUnWanted() ，你必须定义你想要的。

If you want to remove control characters you can do如果你想删除控制字符，你可以这样做

String str = "\u0000\u001f hi \n";
str = str.replaceAll("[\u0000-\u001f]", "");

prints hi (keeps the space).打印hi （保留空间）。

EDIT If you want to know the unicode of any 16-bit character you can do编辑如果你想知道你可以做的任何 16 位字符的 unicode

int num = string.charAt(n);
System.out.println(num);

Answer 3

To delete non-Latin symbols from the string I use the following code:要从字符串中删除非拉丁符号，我使用以下代码：

String s = "小米体验版 latin string 01234567890";
s = s.replaceAll("[^\\x00-\\x7F]", "");

The output string will be: " latin string 01234567890"输出字符串将是：“拉丁字符串 01234567890”

Answer 4

Justin Thomas's was close, but this is probably closer to what you're looking for:贾斯汀·托马斯 (Justin Thomas) 很接近，但这可能更接近您要查找的内容：

String nonStrange = strangeString.replaceAll("\\p{Cntrl}", "");

The selector \\p{Cntrl} selects " A control character: [\\x00-\\x1F\\x7F]. "选择器 \\p{Cntrl} 选择“ 一个控制字符：[\\x00-\\x1F\\x7F]。 ”

Answer 5

使用String.replaceAll() ：

String clean = "♠clean".replaceAll('♠', '');

Answer 6

I did the other way.我是反过来做的。 I replace all letters that are not defined (( ^ )):我替换了所有未定义的字母 (( ^ ))：

str.replaceAll("[^a-zA-Z0-9:;.?! ]","")

so for words like : "小米体验版 latin string 01234567890" we will get: "latin string 01234567890"所以对于像“小米体验版latin string 01234567890”这样的词，我们会得到：“latin string 01234567890”

Answer 7

same happened with me when i was converting clob to string using getAsciiStream.当我使用 getAsciiStream 将 clob 转换为字符串时，我也发生了同样的情况。

efficiently solved it using使用有效地解决了它

public String getstringfromclob(Clob cl)
{
    StringWriter write = new StringWriter();
    try{
        Reader read  = cl.getCharacterStream();     
    int c = -1;
    while ((c = read.read()) != -1)
    {
        write.write(c);
    }
    write.flush();
    }catch(Exception ec)
    {
        ec.printStackTrace();
    }
    return write.toString();

}

Answer 8

过滤英文、中文、数字和标点符号

str = str.replaceAll("[^!-~\\u20000-\\uFE1F\\uFF00-\\uFFEF]", "");

Answer 9

Put the characters that you want to get rid of in an array list, then iterate through the array with a replaceAll method:将要删除的字符放在数组列表中，然后使用 replaceAll 方法遍历数组：

String str = "Some text with unicode !@#$";
ArrayList<String> badChar = new ArrayList<String>();
badChar= ['@', '~','!']; //modify this to contain the unicodes

for (String s : badChar) {
   String resultStr = str.replaceAll(s, str);
}

you will end up with a cleaned string "resultStr" haven't tested this but along the lines.你最终会得到一个清理过的字符串 "resultStr" 还没有测试过这个，但是沿着线。

Answer 10

Most probably the text that you got was encoded in something other than UTF-8.很可能您得到的文本是用 UTF-8 以外的其他格式编码的。 What you could do is to not allow text with other encodings (for example Latin-1) to be uploaded:您可以做的是不允许上传其他编码（例如 Latin-1）的文本：

try {

  CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
  charsetDecoder.onMalformedInput(CodingErrorAction.REPORT);

  return IOUtils.toString(new InputStreamReader(new FileInputStream(filePath), charsetDecoder));
}
catch (MalformedInputException e) {
  // throw an exception saying the file was not saved with UTF-8 encoding.
}

Answer 11

You can't because strings are immutable.你不能，因为字符串是不可变的。

It is possible, though, to make a new string that has the unwanted characters removed.但是，可以创建一个删除了不需要的字符的新字符串。 Look up String#replaceAll().查找 String#replaceAll()。

Java - 从字符串中删除奇怪的字符

问题描述

11 个解决方案

解决方案1
19 已采纳 2011-03-28 17:31:14

解决方案2
18 2011-03-28 17:29:41

解决方案3
7 2014-04-18 07:44:42

解决方案4
4 2011-06-14 19:13:32

解决方案5
2 2011-03-28 17:31:02

解决方案6
1 2020-09-21 19:39:17

解决方案7
0 2014-04-08 11:31:37

解决方案8
0 2017-09-13 10:15:12

解决方案9
0 2011-03-28 17:42:47

解决方案10
0 2020-05-25 08:11:48

解决方案11
-3 2011-03-28 17:30:20

Java - 从字符串中删除奇怪的字符

问题描述

11 个解决方案

解决方案1 19 已采纳 2011-03-28 17:31:14

解决方案2 18 2011-03-28 17:29:41

解决方案3 7 2014-04-18 07:44:42

解决方案4 4 2011-06-14 19:13:32

解决方案5 2 2011-03-28 17:31:02

解决方案6 1 2020-09-21 19:39:17

解决方案7 0 2014-04-08 11:31:37

解决方案8 0 2017-09-13 10:15:12

解决方案9 0 2011-03-28 17:42:47

解决方案10 0 2020-05-25 08:11:48

解决方案11 -3 2011-03-28 17:30:20

解决方案1
19 已采纳 2011-03-28 17:31:14

解决方案2
18 2011-03-28 17:29:41

解决方案3
7 2014-04-18 07:44:42

解决方案4
4 2011-06-14 19:13:32

解决方案5
2 2011-03-28 17:31:02

解决方案6
1 2020-09-21 19:39:17

解决方案7
0 2014-04-08 11:31:37

解决方案8
0 2017-09-13 10:15:12

解决方案9
0 2011-03-28 17:42:47

解决方案10
0 2020-05-25 08:11:48

解决方案11
-3 2011-03-28 17:30:20