简体   繁体   中英

String split special regular Expression

Im trying to tokenize a string input, but I cant get my head around how to do it. The Idea is, to split the string into instances of alphabetical words and non alphabetical symbols. For example the String "Test, ( abc)" would be split into ["Test" , "," , "(" , "abc" , ")" ].

Right now I use this regular Expression: "(?<=[a-zA-Z])(?=[^a-zA-Z])" but it doesnt do what I want.

Any ideas what else I could use?

I see that you want to group the alphabets (like Test and abc) but no grouping of the non-alphabetical characters. Also I see that you do not want to show space char. For this I will use "(\\\\w+|\\\\W)" after removing all spaces from the strings to match.

Sample code

String str = "Test, ( abc)";
str = str.replaceAll(" ",""); // in case you do not want space as separate char.
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Output

Test , ( abc ) I hope this answers your question.

Try this:

String s = "I want to walk my dog, and why not?";
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Outputs:

I
want
to
walk
my
dog
,
and
why
not
?

\\w can be used to match word characters ([A-Za-z0-9_]), so that punctuation is removed from the results

(Taken from: here )

Try this:

public static ArrayList<String> res(String a) {
        String[] tokens = a.split("\\s+");
        ArrayList<String> strs = new ArrayList<>();
        for (String token : tokens) {
            String[] alpha = token.split("\\W+");
            String[] nonAlpha = token.split("\\w+");
            for (String str : alpha) {
                if (!str.isEmpty()) strs.add(str);
            }
            for (String str : nonAlpha) {
                if (!str.isEmpty()) strs.add(str);
            }
        }
        return strs;
    }

I guess in the simplest form, split using

"(?<=[a-zA-Z])(?=[^\\sa-zA-Z])|(?<=[^\\sa-zA-Z])(?=[a-zA-Z])|\\s+"

Explained

    (?<= [a-zA-Z] )               # Letter behind
    (?= [^\sa-zA-Z] )             # not letter/wsp ahead
 |                              # or,
    (?<= [^\sa-zA-Z] )            # Not letter/wsp behind
    (?= [a-zA-Z] )                # letter ahead
 |                              # or,
    \s+                           # whitespaces (disgarded)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM