如何从Java中的®，©，™等字符串中删除高位ASCII字符

Question

I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. 我想从Java中的String中检测并删除高级ASCII字符，如®，©，™。 Is there any open-source library that can do this? 有没有可以做到这一点的开源库？

Answer 1

If you need to remove all non-US-ASCII (ie outside 0x0-0x7F) characters, you can do something like this: 如果您需要删除所有非US-ASCII（即外部0x0-0x7F）字符，您可以执行以下操作：

s = s.replaceAll("[^\\x00-\\x7f]", "");

If you need to filter many strings, it would be better to use a precompiled pattern: 如果需要过滤许多字符串，最好使用预编译模式：

private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();

And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better. 如果它真的对性能至关重要，也许Alex Nikolaenkov的建议会更好。

Answer 2

I think that you can easily filter your string by hand and check code of the particular character. 我认为您可以轻松地手动过滤字符串并检查特定字符的代码。 If it fits your requirements then add it to a StringBuilder and do toString() to it in the end. 如果它符合您的要求，那么将它添加到StringBuilder并最终对它进行toString() 。

public static String filter(String str) {
    StringBuilder filtered = new StringBuilder(str.length());
    for (int i = 0; i < str.length(); i++) {
        char current = str.charAt(i);
        if (current >= 0x20 && current <= 0x7e) {
            filtered.append(current);
        }
    }

    return filtered.toString();
}

Answer 3

A nice way to do this is to use Google Guava CharMatcher : 一个很好的方法是使用Google Guava CharMatcher ：

String newString = CharMatcher.ASCII.retainFrom(string);

newString will contain only the ASCII characters (code point < 128) from the original string. newString只包含原始字符串中的ASCII字符（代码点<128）。

This reads more naturally than a regular expression. 这比正则表达式更自然地读取。 Regular expressions can take more effort to understand for subsequent readers of your code. 正则表达式可以花费更多精力来理解代码的后续读者。

Answer 4

I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code: 我明白你需要删除：ç，ã，Ã，但是对于每个需要转换ç，ã，Ã ---> c，a，A的人，请看看这段代码：

Example Code: 示例代码：

final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
    Normalizer
        .normalize(input, Normalizer.Form.NFD)
        .replaceAll("[^\\p{ASCII}]", "")
);

Output: 输出：

This is a funky String 这是一个时髦的字符串

如何从Java中的®，©，™等字符串中删除高位ASCII字符

问题描述

4 个解决方案

解决方案1
31 已采纳 2011-02-15 19:19:24

解决方案2
16 2011-02-15 19:20:05

解决方案3
5 2011-02-15 19:24:00

解决方案4
2 2016-01-16 13:23:40

如何从Java中的®，©，™等字符串中删除高位ASCII字符

问题描述

4 个解决方案

解决方案1 31 已采纳 2011-02-15 19:19:24

解决方案2 16 2011-02-15 19:20:05

解决方案3 5 2011-02-15 19:24:00

解决方案4 2 2016-01-16 13:23:40

解决方案1
31 已采纳 2011-02-15 19:19:24

解决方案2
16 2011-02-15 19:20:05

解决方案3
5 2011-02-15 19:24:00

解决方案4
2 2016-01-16 13:23:40