简体   繁体   English

如何替换 Java 中不可打印的 Unicode 字符?

[英]How can I replace non-printable Unicode characters in Java?

The following will replace ASCII control characters (shorthand for [\x00-\x1F\x7F] ):以下将替换 ASCII 控制字符( [\x00-\x1F\x7F]的简写):

my_string.replaceAll("\\p{Cntrl}", "?");

The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20] ), including accented characters:以下将替换所有 ASCII 不可打印字符( [\p{Graph}\x20]的简写),包括重音字符:

my_string.replaceAll("[^\\p{Print}]", "?");

However, neither works for Unicode strings.但是,两者都不适用于 Unicode 字符串。 Does anyone has a good way to remove non-printable characters from a unicode string?有没有人有从 unicode 字符串中删除不可打印字符的好方法?

my_string.replaceAll("\\p{C}", "?");

See more about Unicode regex .查看更多关于Unicode 正则表达式 java.util.regexPattern / String.replaceAll supports them. java.util.regexPattern / String.replaceAll支持它们。

Op De Cirkel is mostly right. Op De Cirkel 大体上是对的。 His suggestion will work in most cases:他的建议在大多数情况下都有效:

 myString.replaceAll("\\p{C}", "?");

But if myString might contain non-BMP codepoints then it's more complicated.但是如果myString可能包含非 BMP 代码点,那么它会更复杂。 \p{C} contains the surrogate codepoints of \p{Cs} . \p{C}包含\p{Cs}的代理代码点。 The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair.上述替换方法有时会仅替换一半代理对,从而破坏非 BMP 代码点。 It's possible this is a Java bug rather than intended behavior.这可能是 Java 错误而不是预期行为。

Using the other constituent categories is an option:使用其他组成类别是一种选择:

myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");

However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed.但是,不会删除不属于一对的单独代理字符(每个代理字符都有一个分配的代码点)。 A non-regex approach is the only way I know to properly handle \p{C} :非正则表达式方法是我知道正确处理\p{C}的唯一方法:

StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
    int codePoint = myString.codePointAt(offset);
    offset += Character.charCount(codePoint);

    // Replace invisible control characters and unused code points
    switch (Character.getType(codePoint))
    {
        case Character.CONTROL:     // \p{Cc}
        case Character.FORMAT:      // \p{Cf}
        case Character.PRIVATE_USE: // \p{Co}
        case Character.SURROGATE:   // \p{Cs}
        case Character.UNASSIGNED:  // \p{Cn}
            newString.append('?');
            break;
        default:
            newString.append(Character.toChars(codePoint));
            break;
    }
}

You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).您可能对Unicode 类别 “其他,控制”可能“其他,格式”感兴趣(不幸的是,后者似乎同时包含不可打印和可打印字符)。

In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.在 Java 正则表达式中,您可以分别使用\p{Cc}\p{Cf}检查它们。

methods below for your goal以下方法可实现您的目标

public static String removeNonAscii(String str)
{
    return str.replaceAll("[^\\x00-\\x7F]", "");
}

public static String removeNonPrintable(String str) // All Control Char
{
    return str.replaceAll("[\\p{C}]", "");
}

public static String removeSomeControlChar(String str) // Some Control Char
{
    return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}

public static String removeFullControlChar(String str)
{
    return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
} 

I have used this simple function for this:我为此使用了这个简单的 function:

private static Pattern pattern = Pattern.compile("[^ -~]");
private static String cleanTheText(String text) {
    Matcher matcher = pattern.matcher(text);
    if ( matcher.find() ) {
        text = text.replace(matcher.group(0), "");
    }
    return text;
}

Hope this is useful.希望这是有用的。

Based on the answers by Op De Cirkel and noackjr , the following is what I do for general string cleaning: 1. trimming leading or trailing whitespaces, 2. dos2unix, 3. mac2unix, 4. removing all "invisible Unicode characters" except whitespaces:根据Op De Cirkelnoackjr的回答,以下是我对一般字符串清理所做的工作:1. 修剪前导或尾随空格,2. dos2unix,3. mac2unix,4. 删除除空格之外的所有“不可见的 Unicode 字符”:

myString.trim.replaceAll("\r\n", "\n").replaceAll("\r", "\n").replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}&&[^\\s]]", "")

Tested with Scala REPL.用 Scala REPL 测试。

I propose it remove the non printable characters like below instead of replacing it我建议它删除下面的不可打印字符而不是替换它

private String removeNonBMPCharacters(final String input) {
    StringBuilder strBuilder = new StringBuilder();
    input.codePoints().forEach((i) -> {
        if (Character.isSupplementaryCodePoint(i)) {
            strBuilder.append("?");
        } else {
            strBuilder.append(Character.toChars(i));
        }
    });
    return strBuilder.toString();
}

Supported multilanguage支持的多语言

public static String cleanUnprintableChars(String text, boolean multilanguage)
{
    String regex = multilanguage ? "[^\\x00-\\xFF]" : "[^\\x00-\\x7F]";
    // strips off all non-ASCII characters
    text = text.replaceAll(regex, "");

    // erases all the ASCII control characters
    text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");

    // removes non-printable characters from Unicode
    text = text.replaceAll("\\p{C}", "");

    return text.trim();
}

I have redesigned the code for phone numbers +9 (987) 124124 Extract digits from a string in Java我重新设计了电话号码的代码 +9 (987) 124124 从 Java 中的字符串中提取数字

 public static String stripNonDigitsV2( CharSequence input ) {
    if (input == null)
        return null;
    if ( input.length() == 0 )
        return "";

    char[] result = new char[input.length()];
    int cursor = 0;
    CharBuffer buffer = CharBuffer.wrap( input );
    int i=0;
    while ( i< buffer.length()  ) { //buffer.hasRemaining()
        char chr = buffer.get(i);
        if (chr=='u'){
            i=i+5;
            chr=buffer.get(i);
        }

        if ( chr > 39 && chr < 58 )
            result[cursor++] = chr;
        i=i+1;
    }

    return new String( result, 0, cursor );
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM