简体   繁体   中英

Regex to trim special characters from the given string

I have extracted data from the source and now it's a set of tokens. These tokens contains junk characters or special characters in the end or sometimes in the beginning. For example I have following set..

  • manufactured traffic
  • (devices
  • traffic-calming)
  • traffic-
  • synthetic,
  • artificial turf.)

This data should be as following respectively...

  • manufactured traffic
  • devices
  • traffic-calming
  • traffic
  • synthetic
  • artificial turf

To purify this string set, I have implemented below method, that is working properly. See on regex101.com...

public Filter filterSpecialCharacters() {
    String regex = "^([^a-z0-9A-Z]*)([a-z0-9A-Z])(.*)([a-z0-9A-Z])([^a-z0-9A-Z]*)$";
    set = set
        .stream()
        .map(str -> str.replaceAll(regex, "$2$3$4"))
        .collect(Collectors.toSet());
    return this;
}

But I am still not satisfied with the regex I am using because I have a large set of data. Want to see if there's better option.

I would like to use \\p{Punct} to remove all this punctuation !"#$%&'()*+,-./:;<=>?@[\\]^_ {|}~`

String regex = "^\\p{Punct}*([a-z0-9A-Z -]*)\\p{Punct}*$";
set = set.stream()
        .map(str -> str.replaceAll(regex, "$1"))
        .collect(Collectors.toSet());

=>[synthetic, devices, traffic-calming, manufactured traffic , artificial turf]

take a look at this Summary of regular-expression constructs


Or like @Ted Hopp mention in comment you can use two maps one remove special characters from begging the second to remove them from the end :

set = set.stream()
        .map(str -> str.replaceFirst("^[^a-z0-9A-Z]*", ""))
        .map(str -> str.replaceFirst("[^a-z0-9A-Z]*$", ""))
        .collect(Collectors.toSet());

You can do it in a single passive regex that works the same every time.

Globlly Find (?m)^[^a-z0-9A-Z\\r\\n]*(.*?)[^a-z0-9A-Z\\r\\n]*$
Replace $1

https://regex101.com/r/tGFbLm/1

 (?m)                          # Multi-line mode
 ^                             # BOL
 [^a-z0-9A-Z\r\n]*     
 ( .*? )                       # (1), Passive content to write back
 [^a-z0-9A-Z\r\n]* 
 $                             # EOL

Dont use regex for these kind of simple trims. Parse the string and trim it. The code is big, but is surely faster than regex.

public static List<String> filterSpecialCharacters(List<String> input) {
    Iterator<String> it = input.iterator();
    List<String> output = new ArrayList<String>();
    // For all strings in the List
    while (it.hasNext()) {
        String s = it.next();
        int endIndex = s.length() - 1;
        // Get the last index of alpha numeric char
        for (int i = endIndex; i >= 0; i--) {
            if (isAlphaNumeric(s.charAt(i))) {
                endIndex = i;
                break;
            }
        }
        StringBuilder out = new StringBuilder();
        boolean startCopying = false;
        // Parse the string till the last index of alpha numeric char
        for (int i = 0; i <= endIndex; i++) {
            // Ignore the leading occurrences non alpha-num chars
            if (!startCopying && !isAlphaNumeric(s.charAt(i))) {
                continue;
            }
            // Start copying to output buffer after(including) the first occurrence of alpha-num char 
            else {
                startCopying = true;
                out.append(s.charAt(i));
            }
        }
        // Add the trimmed string to the output list.
        output.add(out.toString());
    }

    return output;
}

// Updated this method with the characters that you dont want to trim
private static boolean isAlphaNumeric(char c) {
    return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9');
}

Please test this code to see if it satisfies your conditions. I see that this is almost 10 times faster than the regex trims (used in other answers). Also, if performance is important to you, then I recommend you to use Iterator to parse the Set , instead of stream/map/collect functions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM