How can i split this text below with split-cretiria: FIRST, NOW, THEN:
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
Expected are three sentences:
This code doesn't work, because of button "NOW CLICK"
String[] textArray = text.split("FIRST|NOW|THEN");
If I understand you correctly you
FIRST
NOW
THEN
and preserve them in resulting parts If my guess is correct instead of split
method, you can use find
to iterate over all
This would let you add all quotes and whitespaces to result and focus only on checking words which are not inside quotation to see if you should split on them or not.
Regex representing such parts can look like Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
IMPORTANT : we need to search for ".." first, otherwise \\S+
would also match "NOW CLICK"
as "NOW
and CLICK"
as two separate parts which will prevent it to be seen as single quotation. This is why we want to place "[^"]*"
regex (which represents quotations) at start of subregex1|subregex2|subregex3
series.
This regex will allow us to iterate over text
FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result.
as tokens
FIRST
i
go
to
the
homepage
NOW
i
click
on
button
"NOW CLICK"
very
quick
THEN
i
will
become
a
text
result.
THEN
i
will
become
a
text
result.
Notice that "NOW CLICK"
will be treated as single token. Because of that even if it will contain inside keyword on which you want to split, it will never be equal to such keyword (because it will contain other characters like "
, or simply other words in quote). This will prevent it from being treated as delimiter on which text should be split.
Using this idea we can create code like:
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
List<String> keywordsToSplitOn = List.of("FIRST", "NOW", "THEN");
//lets search for quotes ".." | words | whitespaces
Pattern p = Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
Matcher m = p.matcher(text);
StringBuilder sb = new StringBuilder();
List<String> result = new ArrayList<>();
while(m.find()){
String token = m.group();
if (keywordsToSplitOn.contains(token) && sb.length() != 0){
result.add(sb.toString());
sb.delete(0, sb.length());//clear sb
}
sb.append(token);
}
if (sb.length() != 0){//include rest of text after last keyword
result.add(sb.toString());
}
result.forEach(System.out::println);
Output:
FIRST i go to the homepage
NOW i click on button "NOW CLICK" very quick
THEN i will become a text result.
You need to use lookaheads and a lookbehind (mentioned briefly here ).
Simply changing the regex in your split method to the following should do it:
String[] textArray = text.split("((?=FIRST)|(?=NOW(?! CLICK))|(?=THEN))");
May be better even to include a space in each expression to prevent splitting on, eg, NOWHERE:
String[] textArray = text.split("((?=FIRST )|(?=NOW (?!CLICK))|(?=THEN ))");
You may use a Pattern and matcher to split the input using groups:
Pattern pattern = Pattern.compile("^(FIRST.*?)(NOW.*?)(THEN.*)$");
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
Output:
FIRST i go to the homepage
NOW i click on button "NOW CLICK" very quick
THEN i will become a text result.
You could match the following regular expression.
/\bFIRST +(?:(?!\bNOW\b)[^\n])+(?<! )|\bNOW +(?:(?!\bTHEN\b)[^\n])+(?<! )|\bTHEN +.*/
Java's regex engine performs the following operations.
\bFIRST + : match 'FIRST' preceded by a word boundary,
followed by 1+ spaces
(?: : begin a non-capture group
(?!\bNOW\b) : use a negative lookahead to assert that
the following chars are not 'NOW'
[^\n] : match any char other than a line terminator
) : end non-capture group
+ : execute non-capture group 1+ times
(?<! ) : use negative lookbehind to assert that the
previous char is not a space
| : or
\bNOW + : match 'NOW' preceded by a word boundary,
followed by 1+ spaces
(?: : begin a non-capture group
(?!\bTHEN\b) : use a negative lookahead to assert that
the following chars are not 'THEN'
[^\n] : match any char other than a line terminator
) : end non-capture group
+ : execute non-capture group 1+ times
(?<! ) : use negative lookbehind to assert that the
previous char is not a space
| : or
\bTHEN +.* : match 'THEN' preceded by a word boundary,
followed by 1+ spaces then 0+ chars
This uses a technique called the tempered greedy token solution .
You can use these (Lookahead and Lookbehind ):
public static void main(String args[]) {
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
String[] textArray = text.split("(?=FIRST)|(?=\\b NOW \\b)|(?=THEN)");
for(String s: textArray) {
System.out.println(s);
}
}
Output:
FIRST i go to the homepage
NOW i click on button "NOW CLICK" very quick
THEN i will become a text result.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.