Im trying to tokenize a string input, but I cant get my head around how to do it. The Idea is, to split the string into instances of alphabetical words and non alphabetical symbols. For example the String "Test, ( abc)"
would be split into ["Test" , "," , "(" , "abc" , ")" ].
Right now I use this regular Expression: "(?<=[a-zA-Z])(?=[^a-zA-Z])"
but it doesnt do what I want.
Any ideas what else I could use?
I see that you want to group the alphabets (like Test and abc) but no grouping of the non-alphabetical characters. Also I see that you do not want to show space char. For this I will use "(\\\\w+|\\\\W)"
after removing all spaces from the strings to match.
Sample code
String str = "Test, ( abc)";
str = str.replaceAll(" ",""); // in case you do not want space as separate char.
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output
Test , ( abc )
I hope this answers your question.
Try this:
String s = "I want to walk my dog, and why not?";
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group());
}
Outputs:
I
want
to
walk
my
dog
,
and
why
not
?
\\w can be used to match word characters ([A-Za-z0-9_]), so that punctuation is removed from the results
(Taken from: here )
Try this:
public static ArrayList<String> res(String a) {
String[] tokens = a.split("\\s+");
ArrayList<String> strs = new ArrayList<>();
for (String token : tokens) {
String[] alpha = token.split("\\W+");
String[] nonAlpha = token.split("\\w+");
for (String str : alpha) {
if (!str.isEmpty()) strs.add(str);
}
for (String str : nonAlpha) {
if (!str.isEmpty()) strs.add(str);
}
}
return strs;
}
I guess in the simplest form, split using
"(?<=[a-zA-Z])(?=[^\\sa-zA-Z])|(?<=[^\\sa-zA-Z])(?=[a-zA-Z])|\\s+"
Explained
(?<= [a-zA-Z] ) # Letter behind
(?= [^\sa-zA-Z] ) # not letter/wsp ahead
| # or,
(?<= [^\sa-zA-Z] ) # Not letter/wsp behind
(?= [a-zA-Z] ) # letter ahead
| # or,
\s+ # whitespaces (disgarded)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.