简体   繁体   中英

How to filter string for unwanted characters using regex?

Basically, I am wondering if there is a handy class or method to filter a String for unwanted characters. The output of the method should be the 'cleaned' String. Ie:

String dirtyString = "This contains spaces which are not allowed"

String result = cleaner.getCleanedString(dirtyString);

Expecting result would be:

"Thiscontainsspaceswhicharenotallowed"

A better example:

String reallyDirty = " this*is#a*&very_dirty&String"

String result = cleaner.getCleanedString(dirtyString);

I expect the result to be:

"thisisaverydirtyString"

Because, i let the cleaner know that ' ', '*', '#', '&' and '_' are dirty characters. I can solve it by using a white/black list array of chars. But I don't want to re-invent the wheel.

I was wondering if there is already such a thing that can 'clean' strings using a regex. Instead of writing this myself.

Addition: If you think cleaning a String could be done differently/better then I'm all ears as well of course

Another addition: - It is not only for spaces, but for any kind of character.

根据您的更新编辑:

dirtyString.replaceAll("[^a-zA-Z0-9]","")

If you're using guava on your project (and if you're not, I believe you should consider it), the CharMatcher class handles this very nicely:

Your first example might be:

result = CharMatcher.WHITESPACE.removeFrom(dirtyString);

while your second might be:

result = CharMatcher.anyOf(" *#&").removeFrom(dirtyString);
// or alternatively
result = CharMatcher.noneOf(" *#&").retainFrom(dirtyString);

or if you want to be more flexible with whitespace (tabs etc), you can combine them rather than writing your own:

CharMatcher illegal = CharMatcher.WHITESPACE.or(CharMatcher.anyOf("*#&"));
result = illegal.removeFrom(dirtyString);

or you might instead specify legal characters, which depending on your requirements might be:

CharMatcher legal = CharMatcher.JAVA_LETTER; // based on Unicode char class
CharMatcher legal = CharMatcher.ASCII.and(CharMatcher.JAVA_LETTER); // only letters which are also ASCII, as your examples
CharMatcher legal = CharMatcher.inRange('a', 'z'); // lowercase only
CharMatcher legal = CharMatcher.inRange('a', 'z').or(CharMatcher.inRange('A', 'Z')); // either case

followed by retainFrom(dirtyString) as above.

Very nice, powerful API.

使用replaceAll

This will do it:

String dirtyString = "This contains spaces which are not allowed";
String result = dirtyString.replaceAll("\\s", "");

and works by replacing all whitespace with 'nothing'.

String resultString = subjectString.replaceAll("\\P{L}+", "");

将用任何东西替换任何非字母字符。

I also prefer the whitelisting-approach. You'll never know what comes around. There seem to be more encodings in than characters. This way you can control it all:

public String convert(String s) {
  s = StringUtils.removePattern(s, "[^A-Za-zäöüÄÖÜß?!$,. 0-9\\-\\+\\*\\?=&%\\$§\"\\!\\^#:;,_²³°\\[\\]\\{\\}<>\\|~]'`'");
  return s.trim();
}

This contains all german umlauts and french accents and ... you know - just look at your keyboard. I think I picked them all. Feel free to omit special chars like < > to prevent code-injection...

Filter code points

Regex is not the only avenue to your goal. You can get the code point integer number for each character in your string, then filter out those not considered a letter in Unicode .

The String#codePoints method returns an IntStream , a stream of int primitive values, one per character.

The Character class can tell us if the character assigned to each of those code point numbers in Unicode is considered a letter, as opposed to whitespace , digits, punctuation, and so on.

Those code points passing our test are converted back to a String by way of the StringBuilder class.

String input = " this*is#a*&very_dirty&String" ; 
String onlyLetters = 
        input 
        .codePoints()
        .filter(
            codePoint -> Character.isLetter( codePoint ) 
        )
        .collect(               
            StringBuilder :: new ,        
            StringBuilder :: appendCodePoint , 
            StringBuilder :: append    
        )        
        .toString() 
;

See this code run live at Ideone.com .

thisisaverydirtyString

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM