简体   繁体   中英

Efficiently removing specific characters (some punctuation) from Strings in Java?

In Java, what is the most efficient way of removing given characters from a String? Currently, I have this code:

private static String processWord(String x) {
    String tmp;

    tmp = x.toLowerCase();
    tmp = tmp.replace(",", "");
    tmp = tmp.replace(".", "");
    tmp = tmp.replace(";", "");
    tmp = tmp.replace("!", "");
    tmp = tmp.replace("?", "");
    tmp = tmp.replace("(", "");
    tmp = tmp.replace(")", "");
    tmp = tmp.replace("{", "");
    tmp = tmp.replace("}", "");
    tmp = tmp.replace("[", "");
    tmp = tmp.replace("]", "");
    tmp = tmp.replace("<", "");
    tmp = tmp.replace(">", "");
    tmp = tmp.replace("%", "");

    return tmp;
}

Would it be faster if I used some sort of StringBuilder, or a regex, or maybe something else? Yes, I know: profile it and see, but I hope someone can provide an answer of the top of their head, as this is a common task.

虽然\\\\p{Punct}将指定比问题更广泛的字符,但它确实允许更短的替换表达式:

tmp = tmp.replaceAll("\\p{Punct}+", "");

Here's a late answer, just for fun.

In cases like this, I would suggest aiming for readability over speed. Of course you can be super-readable but too slow, as in this super-concise version:

private static String processWord(String x) {
    return x.replaceAll("[][(){},.;!?<>%]", "");
}

This is slow because everytime you call this method, the regex will be compiled. So you can pre-compile the regex.

private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]");

private static String processWord(String x) {
    return UNDESIRABLES.matcher(x).replaceAll("");
}

This should be fast enough for most purposes, assuming the JVM's regex engine optimizes the character class lookup. This is the solution I would use, personally.

Now without profiling, I wouldn't know whether you could do better by making your own character (actually codepoint) lookup table:

private static final boolean[] CHARS_TO_KEEP = new boolean[];

Fill this once and then iterate, making your resulting string. I'll leave the code to you. :)

Again, I wouldn't dive into this kind of optimization. The code has become too hard to read. Is performance that much of a concern? Also remember that modern languages are JITted and after warming up they will perform better, so use a good profiler.

One thing that should be mentioned is that the example in the original question is highly non-performant because you are creating a whole bunch of temporary strings! Unless a compiler optimizes all that away, that particular solution will perform the worst.

You could do something like this:

static String RemovePunct(String input) 
{
    char[] output = new char[input.length()];
    int i = 0;

    for (char ch : input.toCharArray())
    {
        if (Character.isLetterOrDigit(ch) || Character.isWhitespace(ch)) 
        {
            output[i++] = ch;
        }        
    }

    return new String(output, 0, i);
}

// ...

String s = RemovePunct("This is (a) test string.");

This will likely perform better than using regular expressions, if you find them to slow for your needs.

However, it could get messy fast if you have a long, distinct list of special characters you'd like to remove. In this case regular expressions are easier to handle.

http://ideone.com/mS8Irl

Strings are immutable so its not good to try and use them very dynamically try using StringBuilder instead of String and use all of its wonderful methods! It will let you do anything you want. Plus yes if you have something your trying to do, figure out the regex for it and it will work a lot better for you.

Use String#replaceAll(String regex, String replacement) as

tmp = tmp.replaceAll("[,.;!?(){}\\[\\]<>%]", "");

System.out.println(
   "f,i.l;t!e?r(e)d {s}t[r]i<n>g%".replaceAll(
                   "[,.;!?(){}\\[\\]<>%]", "")); // prints "filtered string"

Right now your code will iterate over all characters of tmp and compare them with all possible characters that you want to remove, so it will use
number of tmp characters x number or characters you want to remove comparisons.

To optimize your code you could use short circuit OR || and do something like

StringBuilder sb = new StringBuilder();
for (char c : tmp.toCharArray()) {
    if (!(c == ',' || c == '.' || c == ';' || c == '!' || c == '?'
            || c == '(' || c == ')' || c == '{' || c == '}' || c == '['
            || c == ']' || c == '<' || c == '>' || c == '%'))
        sb.append(c);
}
tmp = sb.toString();

or like this

StringBuilder sb = new StringBuilder();
char[] badChars = ",.;!?(){}[]<>%".toCharArray();

outer: 
for (char strChar : tmp.toCharArray()) {
    for (char badChar : badChars) {
        if (badChar == strChar)
            continue outer;// we skip `strChar` since it is bad character
    }
    sb.append(strChar);
}
tmp = sb.toString();

This way you will iterate over every tmp characters but number of comparisons for that character can decrease if it is not % (because it will be last comparison, if character would be . program would get his result in one comparison).


If I am not mistaken this approach is used with character class ( [...] ) so maybe try it this way

Pattern p = Pattern.compile("[,.;!?(){}\\[\\]<>%]"); //store it somewhere so 
                                         //you wont need to compile it again
tmp = p.matcher(tmp).replaceAll("");

You can do this:

tmp.replaceAll("\\W", "");

to remove punctuation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM