简体   繁体   English

正则表达式可修剪给定字符串中的特殊字符

[英]Regex to trim special characters from the given string

I have extracted data from the source and now it's a set of tokens. 我已经从源中提取了数据,现在它是一组令牌。 These tokens contains junk characters or special characters in the end or sometimes in the beginning. 这些标记在结尾或有时在开头包含垃圾字符或特殊字符。 For example I have following set.. 例如,我有以下设置。

  • manufactured traffic 虚假交通
  • (devices (设备
  • traffic-calming) 流量调整)
  • traffic- 交通-
  • synthetic, 合成的
  • artificial turf.) 人造草坪。)

This data should be as following respectively... 该数据应分别如下。

  • manufactured traffic 虚假交通
  • devices 设备
  • traffic-calming 交通管制
  • traffic 交通
  • synthetic 合成的
  • artificial turf 人造草坪

To purify this string set, I have implemented below method, that is working properly. 为了净化此字符串集,我实现了以下方法,该方法正常工作。 See on regex101.com... 参见regex101.com ...

public Filter filterSpecialCharacters() {
    String regex = "^([^a-z0-9A-Z]*)([a-z0-9A-Z])(.*)([a-z0-9A-Z])([^a-z0-9A-Z]*)$";
    set = set
        .stream()
        .map(str -> str.replaceAll(regex, "$2$3$4"))
        .collect(Collectors.toSet());
    return this;
}

But I am still not satisfied with the regex I am using because I have a large set of data. 但是我仍然对使用的正则表达式不满意,因为我有大量的数据。 Want to see if there's better option. 想看看是否有更好的选择。

I would like to use \\p{Punct} to remove all this punctuation !"#$%&'()*+,-./:;<=>?@[\\]^_ {|}~` 我想使用\\p{Punct}删除所有这些标点符号!"#$%&'()*+,-./:;<=>?@[\\]^_ {|}〜`

String regex = "^\\p{Punct}*([a-z0-9A-Z -]*)\\p{Punct}*$";
set = set.stream()
        .map(str -> str.replaceAll(regex, "$1"))
        .collect(Collectors.toSet());

=>[synthetic, devices, traffic-calming, manufactured traffic , artificial turf]

take a look at this Summary of regular-expression constructs 看一下这个正则表达式构造摘要


Or like @Ted Hopp mention in comment you can use two maps one remove special characters from begging the second to remove them from the end : 或像@Ted Hopp在评论中提到的那样,您可以使用两个地图,其中一个从乞讨中删除特殊字符,第二个从结尾删除它们:

set = set.stream()
        .map(str -> str.replaceFirst("^[^a-z0-9A-Z]*", ""))
        .map(str -> str.replaceFirst("[^a-z0-9A-Z]*$", ""))
        .collect(Collectors.toSet());

You can do it in a single passive regex that works the same every time. 您可以在单个无源正则表达式中执行此操作,每次工作都相同。

Globlly Find (?m)^[^a-z0-9A-Z\\r\\n]*(.*?)[^a-z0-9A-Z\\r\\n]*$ 全局查找(?m)^[^a-z0-9A-Z\\r\\n]*(.*?)[^a-z0-9A-Z\\r\\n]*$
Replace $1 替换$1

https://regex101.com/r/tGFbLm/1 https://regex101.com/r/tGFbLm/1

 (?m)                          # Multi-line mode
 ^                             # BOL
 [^a-z0-9A-Z\r\n]*     
 ( .*? )                       # (1), Passive content to write back
 [^a-z0-9A-Z\r\n]* 
 $                             # EOL

Dont use regex for these kind of simple trims. 对于此类简单修饰,请勿使用正则表达式。 Parse the string and trim it. 解析字符串并修剪它。 The code is big, but is surely faster than regex. 代码很大,但是肯定比正则表达式快。

public static List<String> filterSpecialCharacters(List<String> input) {
    Iterator<String> it = input.iterator();
    List<String> output = new ArrayList<String>();
    // For all strings in the List
    while (it.hasNext()) {
        String s = it.next();
        int endIndex = s.length() - 1;
        // Get the last index of alpha numeric char
        for (int i = endIndex; i >= 0; i--) {
            if (isAlphaNumeric(s.charAt(i))) {
                endIndex = i;
                break;
            }
        }
        StringBuilder out = new StringBuilder();
        boolean startCopying = false;
        // Parse the string till the last index of alpha numeric char
        for (int i = 0; i <= endIndex; i++) {
            // Ignore the leading occurrences non alpha-num chars
            if (!startCopying && !isAlphaNumeric(s.charAt(i))) {
                continue;
            }
            // Start copying to output buffer after(including) the first occurrence of alpha-num char 
            else {
                startCopying = true;
                out.append(s.charAt(i));
            }
        }
        // Add the trimmed string to the output list.
        output.add(out.toString());
    }

    return output;
}

// Updated this method with the characters that you dont want to trim
private static boolean isAlphaNumeric(char c) {
    return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9');
}

Please test this code to see if it satisfies your conditions. 请测试此代码以查看其是否满足您的条件。 I see that this is almost 10 times faster than the regex trims (used in other answers). 我看到这几乎比正则表达式修整快10倍(在其他答案中使用)。 Also, if performance is important to you, then I recommend you to use Iterator to parse the Set , instead of stream/map/collect functions. 另外,如果性能对您很重要,则建议您使用Iterator解析Set ,而不要使用stream/map/collect函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM