简体   繁体   中英

Extracting both matching and not matching regex

I have a String like this one abc3a de'f gHi?jk I want to split it into the substrings abc3a , de'f , gHi , ? and jk . In other terms, I want to return Strings that match the regular expression [a-zA-Z0-9'] and the Strings that do not match this regular expression. If there is a way to tell whether each resulting substring is a match or not, this will be a plus.

Thanks!

You can use this regex:

"[a-zA-Z0-9']+|[^a-zA-Z0-9' ]+"

Will give:

["abc3a", "de'f", "gHi", "?", "jk"]

Online Demo: http://regex101.com/r/xS0qG4

Java code:

Pattern p = Pattern.compile("[a-zA-Z0-9']+|[^a-zA-Z0-9' ]+");
Matcher m = p.matcher("abc3a de'f gHi?jk");
while (m.find())
    System.out.println(m.group());

OUTPUT

abc3a
de'f
gHi
?
jk
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class HelloWorld{

     public static void main(String []args){
        Pattern pattern = Pattern.compile("([a-zA-Z0-9']*)?([^a-zA-Z0-9']*)?");
        String str = "abc3a de'f gHi?jk";
        Matcher matcher = pattern.matcher(str);
        while(matcher.find()){
            if(matcher.group(1).length() > 0)
                System.out.println("Match:" + matcher.group(1));
            if(matcher.group(2).length() > 0)
                System.out.println("Miss: `" + matcher.group(2) + "`");
        }
     }
}

Output:

Match:abc3a
Miss: ` `
Match:de'f
Miss: ` `
Match:gHi
Miss: `?`
Match:jk

If you don't want white space.

Pattern pattern = Pattern.compile("([a-zA-Z0-9']*)?([^a-zA-Z0-9'\\s]*)?");

Output:

Match:abc3a
Match:de'f
Match:gHi
Miss: `?`
Match:jk
myString.split("\\s+|(?<=[a-zA-Z0-9'])(?=[^a-zA-Z0-9'\\s])|(?<=[^a-zA-Z0-9'\\s])(?=[a-zA-Z0-9'])")

splits at all the boundaries between runs of characters in that charset.

The lookbehind (?<=...) matches after a character in a run, while the lookahead (?=...) matches before a character in a run of characters outside the set.

The \\\\s+ is not a boundary match, and matches a run of whitespace characters. This has the effect of removing white-space from the result entirely.

The | allows causing splitting to happy at either boundary or at a run of white-space.

Since the lookbehind and lookahead are both positive, the boundaries will not match at the start or end of the string, so there's no need to ignore empty strings in the output unless there is white-space there.

You can use anchors to split

    private static String[] splitString(final String s) {
        final String [] arr = s.split("(?=[^a-zA-Z0-9'])|(?<=[^a-zA-Z0-9'])");
        final ArrayList<String> strings = new ArrayList<String>(arr.length);
        for (final String str : arr) {
            if(!"".equals(str.trim())) {
                strings.add(str);
            }
        }
        return strings.toArray(new String[strings.size()]);
    }

(?=xxx) means xxx will follow here and (?<=xxx) mean xxx precedes this position.

As you did not want to include all-whitespace-matches into the result you need to filter the Array given by split.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM