简体   繁体   English

Java - 从字符串中删除奇怪的字符

[英]Java - removing strange characters from a String

How do I remove strange and unwanted Unicode characters (such as a black diamond with question mark) from a String?如何从字符串中删除奇怪和不需要的 Unicode 字符(例如带问号的黑色菱形)?

Updated:更新:

Please tell me the Unicode character string or regex that correspond to "a black diamond with question mark in it".请告诉我对应于“带有问号的黑色菱形”的 Unicode 字符串或正则表达式。

A black diamond with a question mark is not a unicode character -- it's a placeholder for a character that your font cannot display.带问号的黑色菱形不是 Unicode 字符——它是您的字体无法显示的字符的占位符。 If there is a glyph that exists in the string that is not in the font you're using to display that string, you will see the placeholder.如果字符串中存在的字形不在用于显示该字符串的字体中,您将看到占位符。 This is defined as U+FFFD: .这被定义为 U+FFFD: 。 Its appearance varies depending on the font you're using.它的外观因您使用的字体而异。

You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.您可以使用java.text.normalizer删除不在“正常”ASCII 字符集中的 Unicode 字符。

You can use a String.replaceAll("[my-list-of-strange-and-unwanted-chars]","")您可以使用String.replaceAll("[my-list-of-strange-and-unwanted-chars]","")

There is no Character.isStrangeAndUnWanted() , you have to define what you want.没有Character.isStrangeAndUnWanted() ,你必须定义你想要的。

If you want to remove control characters you can do如果你想删除控制字符,你可以这样做

String str = "\u0000\u001f hi \n";
str = str.replaceAll("[\u0000-\u001f]", "");

prints hi (keeps the space).打印hi (保留空间)。

EDIT If you want to know the unicode of any 16-bit character you can do编辑如果你想知道你可以做的任何 16 位字符的 unicode

int num = string.charAt(n);
System.out.println(num);

To delete non-Latin symbols from the string I use the following code:要从字符串中删除非拉丁符号,我使用以下代码:

String s = "小米体验版 latin string 01234567890";
s = s.replaceAll("[^\\x00-\\x7F]", "");

The output string will be: " latin string 01234567890"输出字符串将是:“拉丁字符串 01234567890”

Justin Thomas's was close, but this is probably closer to what you're looking for:贾斯汀·托马斯 (Justin Thomas) 很接近,但这可能更接近您要查找的内容:

String nonStrange = strangeString.replaceAll("\\p{Cntrl}", ""); 

The selector \\p{Cntrl} selects " A control character: [\\x00-\\x1F\\x7F]. "选择器 \\p{Cntrl} 选择“ 一个控制字符:[\\x00-\\x1F\\x7F]。

使用String.replaceAll()

String clean = "♠clean".replaceAll('♠', '');

I did the other way.我是反过来做的。 I replace all letters that are not defined (( ^ )):我替换了所有未定义的字母 (( ^ )):

str.replaceAll("[^a-zA-Z0-9:;.?! ]","")

so for words like : "小米体验版 latin string 01234567890" we will get: "latin string 01234567890"所以对于像“小米体验版latin string 01234567890”这样的词,我们会得到:“latin string 01234567890”

same happened with me when i was converting clob to string using getAsciiStream.当我使用 getAsciiStream 将 clob 转换为字符串时,我也发生了同样的情况。

efficiently solved it using使用有效地解决了它

public String getstringfromclob(Clob cl)
{
    StringWriter write = new StringWriter();
    try{
        Reader read  = cl.getCharacterStream();     
    int c = -1;
    while ((c = read.read()) != -1)
    {
        write.write(c);
    }
    write.flush();
    }catch(Exception ec)
    {
        ec.printStackTrace();
    }
    return write.toString();

}

过滤英文、中文、数字和标点符号

str = str.replaceAll("[^!-~\\u20000-\\uFE1F\\uFF00-\\uFFEF]", "");

Put the characters that you want to get rid of in an array list, then iterate through the array with a replaceAll method:将要删除的字符放在数组列表中,然后使用 replaceAll 方法遍历数组:

String str = "Some text with unicode !@#$";
ArrayList<String> badChar = new ArrayList<String>();
badChar= ['@', '~','!']; //modify this to contain the unicodes

for (String s : badChar) {
   String resultStr = str.replaceAll(s, str);
}

you will end up with a cleaned string "resultStr" haven't tested this but along the lines.你最终会得到一个清理过的字符串 "resultStr" 还没有测试过这个,但是沿着线。

Most probably the text that you got was encoded in something other than UTF-8.很可能您得到的文本是用 UTF-8 以外的其他格式编码的。 What you could do is to not allow text with other encodings (for example Latin-1) to be uploaded:您可以做的是不允许上传其他编码(例如 Latin-1)的文本:

try {

  CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
  charsetDecoder.onMalformedInput(CodingErrorAction.REPORT);

  return IOUtils.toString(new InputStreamReader(new FileInputStream(filePath), charsetDecoder));
}
catch (MalformedInputException e) {
  // throw an exception saying the file was not saved with UTF-8 encoding.
}

You can't because strings are immutable.你不能,因为字符串是不可变的。

It is possible, though, to make a new string that has the unwanted characters removed.但是,可以创建一个删除了不需要的字符的新字符串。 Look up String#replaceAll().查找 String#replaceAll()。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM