[英]split string not included a string in java
How can i split this text below with split-cretiria: FIRST, NOW, THEN:如何使用 split-cretiria 拆分以下文本:首先,现在,然后:
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
Expected are three sentences:预期是三句话:
This code doesn't work, because of button "NOW CLICK"此代码不起作用,因为按钮“NOW CLICK”
String[] textArray = text.split("FIRST|NOW|THEN");
If I understand you correctly you如果我理解正确你
FIRST
NOW
THEN
and preserve them in resulting partsNOW
您的文本, THEN
FIRST
它们保存在结果部分中 If my guess is correct instead of split
method, you can use find
to iterate over all如果我的猜测是正确的而不是
split
方法,您可以使用find
来遍历所有
This would let you add all quotes and whitespaces to result and focus only on checking words which are not inside quotation to see if you should split on them or not.这将允许您将所有引号和空格添加到结果中,并仅专注于检查不在引号内的单词,以查看是否应该拆分它们。
Regex representing such parts can look like Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
表示这些部分的正则表达式看起来像
Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
IMPORTANT : we need to search for ".." first, otherwise \\S+
would also match "NOW CLICK"
as "NOW
and CLICK"
as two separate parts which will prevent it to be seen as single quotation.重要提示:我们需要首先搜索“..”,否则
\\S+
也会将"NOW CLICK"
匹配为"NOW
and CLICK"
作为两个单独的部分,这将防止它被视为单引号。 This is why we want to place "[^"]*"
regex (which represents quotations) at start of subregex1|subregex2|subregex3
series.这就是为什么我们要在
subregex1|subregex2|subregex3
系列的开头放置"[^"]*"
正则表达式(表示引号)。
This regex will allow us to iterate over text这个正则表达式将允许我们迭代文本
FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result.
as tokens作为令牌
FIRST
i
go
to
the
homepage
NOW
i
click
on
button
"NOW CLICK"
very
quick
THEN
i
will
become
a
text
result.
THEN
i
will
become
a
text
result.
Notice that "NOW CLICK"
will be treated as single token.请注意,
"NOW CLICK"
将被视为单个令牌。 Because of that even if it will contain inside keyword on which you want to split, it will never be equal to such keyword (because it will contain other characters like "
, or simply other words in quote). This will prevent it from being treated as delimiter on which text should be split.因此,即使它包含要拆分的内部关键字,它也永远不会等于这样的关键字(因为它将包含其他字符,例如
"
,或者只是引号中的其他单词)。这将阻止它被处理作为应该分割文本的分隔符。
Using this idea we can create code like:使用这个想法,我们可以创建如下代码:
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
List<String> keywordsToSplitOn = List.of("FIRST", "NOW", "THEN");
//lets search for quotes ".." | words | whitespaces
Pattern p = Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
Matcher m = p.matcher(text);
StringBuilder sb = new StringBuilder();
List<String> result = new ArrayList<>();
while(m.find()){
String token = m.group();
if (keywordsToSplitOn.contains(token) && sb.length() != 0){
result.add(sb.toString());
sb.delete(0, sb.length());//clear sb
}
sb.append(token);
}
if (sb.length() != 0){//include rest of text after last keyword
result.add(sb.toString());
}
result.forEach(System.out::println);
Output: Output:
FIRST i go to the homepage
NOW i click on button "NOW CLICK" very quick
THEN i will become a text result.
You need to use lookaheads and a lookbehind (mentioned briefly here ).您需要使用前瞻和后瞻( 此处简要提及)。
Simply changing the regex in your split method to the following should do it:只需将拆分方法中的正则表达式更改为以下内容即可:
String[] textArray = text.split("((?=FIRST)|(?=NOW(?! CLICK))|(?=THEN))");
May be better even to include a space in each expression to prevent splitting on, eg, NOWHERE:甚至在每个表达式中包含一个空格以防止拆分可能会更好,例如,NOWHERE:
String[] textArray = text.split("((?=FIRST )|(?=NOW (?!CLICK))|(?=THEN ))");
You may use a Pattern and matcher to split the input using groups:您可以使用模式和匹配器来使用组拆分输入:
Pattern pattern = Pattern.compile("^(FIRST.*?)(NOW.*?)(THEN.*)$");
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
Output: Output:
FIRST i go to the homepage
NOW i click on button "NOW CLICK" very quick
THEN i will become a text result.
You could match the following regular expression.您可以匹配以下正则表达式。
/\bFIRST +(?:(?!\bNOW\b)[^\n])+(?<! )|\bNOW +(?:(?!\bTHEN\b)[^\n])+(?<! )|\bTHEN +.*/
Java's regex engine performs the following operations. Java 的正则表达式引擎执行以下操作。
\bFIRST + : match 'FIRST' preceded by a word boundary,
followed by 1+ spaces
(?: : begin a non-capture group
(?!\bNOW\b) : use a negative lookahead to assert that
the following chars are not 'NOW'
[^\n] : match any char other than a line terminator
) : end non-capture group
+ : execute non-capture group 1+ times
(?<! ) : use negative lookbehind to assert that the
previous char is not a space
| : or
\bNOW + : match 'NOW' preceded by a word boundary,
followed by 1+ spaces
(?: : begin a non-capture group
(?!\bTHEN\b) : use a negative lookahead to assert that
the following chars are not 'THEN'
[^\n] : match any char other than a line terminator
) : end non-capture group
+ : execute non-capture group 1+ times
(?<! ) : use negative lookbehind to assert that the
previous char is not a space
| : or
\bTHEN +.* : match 'THEN' preceded by a word boundary,
followed by 1+ spaces then 0+ chars
This uses a technique called the tempered greedy token solution .这使用了一种称为缓和贪婪令牌解决方案的技术。
You can use these (Lookahead and Lookbehind ):您可以使用这些(Lookahead 和 Lookbehind ):
public static void main(String args[]) {
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
String[] textArray = text.split("(?=FIRST)|(?=\\b NOW \\b)|(?=THEN)");
for(String s: textArray) {
System.out.println(s);
}
}
Output: Output:
FIRST i go to the homepage
NOW i click on button "NOW CLICK" very quick
THEN i will become a text result.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.