简体   繁体   中英

tokenize a string with regex having special characters

I am trying to find the tokens in a string, which has words, numbers, and special chars. I tried the following code:

String Pattern = "(\\s)+";
String Example = "This `99 is my small \"yy\"  xx`example ";
String[] splitString = (Example.split(Pattern));
System.out.println(splitString.length);
for (String string : splitString) {
    System.out.println(string);
}

And got the following output:

This:`99:is:my:small:"yy":xx`example:

But what I actually want is this, ie I want the special chars also as separate tokens:

This:`:99:is:my:small:":yy:":xx:`:example:

I tried to put the special chars inside the pattern, but now the special characters vanished completely:

String Pattern = "(\"|`|\\.|\\s+)";
This::99:is:my:small::yy::xx:example:

With what pattern will I get my desired output? Or should I try a different approach than using regex?

You may use a matching approach to match streaks of letters (with or without combining marks), digits or single chars other than word and whitespace. I think _ should be treated as a special char in this approach.

Use

"(?U)(?>[^\\W\\d]\\p{M}*+)+|\\d+|[^\\w\\s]"

See the regex demo .

Details :

  • (?U) - the inline version of Pattern.UNICODE_CHARACTER_CLASS modifier
  • (?>[^\\\\W\\\\d]\\\\p{M}*+)+ - 1 or more letters or _ with/without combining marks after
  • | - or
  • \\\\d+ - any 1+ digits
  • | - or
  • [^\\\\w\\\\s] - a single char that is either any char but a word and whitespace.

See the Java demo :

String str = "This `99 is my small \"yy\"  xx`example_and_more ";
Pattern ptrn = Pattern.compile("(?U)(?>[^\\W\\d]\\p{M}*+)+|\\d+|[^\\w\\s]");
List<String> res = new ArrayList<>();
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    res.add(matcher.group());
}
System.out.println(res);
// => [This, `, 99, is, my, small, ", yy, ", xx, `, example_and_more]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM