简体   繁体   English

从字符串中删除非 ASCII 不可打印字符

[英]Remove non-ASCII non-printable characters from a String

I get user input including non-ASCII characters and non-printable characters, such as我得到用户输入,包括非 ASCII 字符和不可打印字符,例如

\xc2d
\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0

for example:例如:

email : abc@gmail.com\xa0\xa0
street : 123 Main St.\xc2\xa0

desired output:所需的输出:

  email : abc@gmail.com
  street : 123 Main St.

What is the best way to removing them using Java?使用 Java 删除它们的最佳方法是什么?
I tried the following, but doesn't seem to work我尝试了以下方法,但似乎不起作用

public static void main(String args[]) throws UnsupportedEncodingException {
        String s = "abc@gmail\\xe9.com";
        String email = "abc@gmail.com\\xa0\\xa0";

        System.out.println(s.replaceAll("\\P{Print}", ""));
        System.out.println(email.replaceAll("\\P{Print}", ""));
    }

Output输出

abc@gmail\xe9.com
abc@gmail.com\xa0\xa0

Your requirements are not clear.你的要求不是很清楚。 All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. Java String中的所有字符都是 Unicode 字符,因此如果删除它们,您将得到一个空字符串。 I assume what you mean is that you want to remove any non-ASCII, non-printable characters.我假设您的意思是您要删除任何非 ASCII、不可打印的字符。

String clean = str.replaceAll("\\P{Print}", "");

Here, \\p{Print} represents a POSIX character class for printable ASCII characters, while \\P{Print} is the complement of that class.此处, \\p{Print} 表示可打印 ASCII 字符的 POSIX 字符类,而\\P{Print}是该类的补充。 With this expression, all characters that are not printable ASCII are replaced with the empty string.使用此表达式,所有不可打印的 ASCII 字符都将替换为空字符串。 (The extra backslash is because \\ starts an escape sequence in string literals.) (额外的反斜杠是因为\\在字符串文字中开始了一个转义序列。)


Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters.显然,所有输入字符实际上都是 ASCII 字符,表示不可打印或非 ASCII 字符的可打印编码。 Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters. Mongo 应该不会对这些字符串有任何问题,因为它们只包含普通的可打印 ASCII 字符。

This all sounds a little fishy to me.这一切对我来说听起来有点可疑。 What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation.我相信正在发生的是,数据确实包含不可打印和非 ASCII 字符,另一个组件(如日志记录框架)正在用可打印表示替换这些字符。 In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.在您的简单测试中,您未能将可打印表示转换回原始字符串,因此您错误地认为第一个正则表达式不起作用。

That's my guess, but if I've misread the situation and you really do need to strip out literal \\xHH escapes, you can do it with the following regular expression.这是我的猜测,但如果我误读了情况并且您确实需要删除文字\\xHH转义\\xHH ,则可以使用以下正则表达式来完成。

String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");

The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. Pattern类的 API 文档很好地列出了 Java 正则表达式库支持的所有语法。 For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.有关所有语法含义的详细说明,我发现Regular-Expressions.info 站点非常有用。

With Google Guava 's CharMatcher , you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:使用Google GuavaCharMatcher ,您可以删除任何不可打印的字符,然后保留所有 ASCII 字符(删除任何重音符号),如下所示:

String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);

Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.不确定这是否是您真正想要的,但它会删除问题示例数据中表示为转义序列的任何内容。

I know it's maybe late but for future reference:我知道现在可能已经晚了,但以供将来参考:

String clean = str.replaceAll("\\P{Print}", "");

Removes all non printable characters, but that includes \\n (line feed), \\t (tab) and \\r (carriage return), and sometimes you want to keep those characters.删除所有不可打印的字符,但包括\\n (换行)、 \\t (制表符)和\\r (回车),有时您想保留这些字符。

For that problem use inverted logic:对于该问题,请使用反向逻辑:

String clean = str.replaceAll("[^\\n\\r\\t\\p{Print}]", "");

You can try this code:你可以试试这个代码:

public String cleanInvalidCharacters(String in) {
    StringBuilder out = new StringBuilder();
    char current;
    if (in == null || ("".equals(in))) {
        return "";
    }
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i);
        if ((current == 0x9)
                || (current == 0xA)
                || (current == 0xD)
                || ((current >= 0x20) && (current <= 0xD7FF))
                || ((current >= 0xE000) && (current <= 0xFFFD))
                || ((current >= 0x10000) && (current <= 0x10FFFF))) {
            out.append(current);
        }

    }
    return out.toString().replaceAll("\\s", " ");
}

It works for me to remove invalid characters from String .它适用于我从String删除无效字符。

您可以使用 java.text.normalizer

Input => " This \特 text \特 is what I need " Output => " This text is what I need "输入 => "这个\特文本\特是我需要的" 输出 => "这个文本是我需要的"

If you are trying to remove Unicode characters from a string like above this code will work如果您尝试从上面的字符串中删除 Unicode 字符,则此代码将起作用

Pattern unicodeCharsPattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
    cleanData = unicodeMatcher.replaceAll("");
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM