简体   繁体   English

拆分字符串不包括 java 中的字符串

[英]split string not included a string in java

How can i split this text below with split-cretiria: FIRST, NOW, THEN:如何使用 split-cretiria 拆分以下文本:首先,现在,然后:

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";

Expected are three sentences:预期是三句话:

  1. FIRST i go to the homepage首先我 go 到主页
  2. NOW i click on button "NOW CLICK" very quick现在我点击按钮“现在点击”非常快
  3. THEN i will become a text result.那么我将成为文本结果。

This code doesn't work, because of button "NOW CLICK"此代码不起作用,因为按钮“NOW CLICK”

String[] textArray = text.split("FIRST|NOW|THEN");

If I understand you correctly you如果我理解正确你

  • want to separate your text on keywords FIRST NOW THEN and preserve them in resulting parts想先在关键字上NOW您的文本, THEN FIRST它们保存在结果部分中
  • but don't want to split on those keywords if they appear inside quotes.但如果它们出现在引号内,则不想拆分这些关键字。

If my guess is correct instead of split method, you can use find to iterate over all如果我的猜测是正确的而不是split方法,您可以使用find来遍历所有

  • quotes引号
  • words which are not inside quotes,不在引号内的单词,
  • whitespaces.空格。

This would let you add all quotes and whitespaces to result and focus only on checking words which are not inside quotation to see if you should split on them or not.这将允许您将所有引号和空格添加到结果中,并仅专注于检查不在引号内的单词,以查看是否应该拆分它们。

Regex representing such parts can look like Pattern.compile("\"[^\"]*\"|\\S+|\\s+");表示这些部分的正则表达式看起来像Pattern.compile("\"[^\"]*\"|\\S+|\\s+");

IMPORTANT : we need to search for ".." first, otherwise \\S+ would also match "NOW CLICK" as "NOW and CLICK" as two separate parts which will prevent it to be seen as single quotation.重要提示:我们需要首先搜索“..”,否则\\S+也会将"NOW CLICK"匹配为"NOW and CLICK"作为两个单独的部分,这将防止它被视为单引号。 This is why we want to place "[^"]*" regex (which represents quotations) at start of subregex1|subregex2|subregex3 series.这就是为什么我们要在subregex1|subregex2|subregex3系列的开头放置"[^"]*"正则表达式(表示引号)。

This regex will allow us to iterate over text这个正则表达式将允许我们迭代文本

FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result.

as tokens作为令牌

FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result. THEN i will become a text result.

Notice that "NOW CLICK" will be treated as single token.请注意, "NOW CLICK"将被视为单个令牌。 Because of that even if it will contain inside keyword on which you want to split, it will never be equal to such keyword (because it will contain other characters like " , or simply other words in quote). This will prevent it from being treated as delimiter on which text should be split.因此,即使它包含要拆分的内部关键字,它也永远不会等于这样的关键字(因为它将包含其他字符,例如" ,或者只是引号中的其他单词)。这将阻止它被处理作为应该分割文本的分隔符

Using this idea we can create code like:使用这个想法,我们可以创建如下代码:

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
List<String> keywordsToSplitOn = List.of("FIRST", "NOW", "THEN");

//lets search for quotes ".." | words | whitespaces
Pattern p = Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
Matcher m = p.matcher(text);

StringBuilder sb = new StringBuilder();
List<String> result = new ArrayList<>();
while(m.find()){
    String token = m.group();
    if (keywordsToSplitOn.contains(token) && sb.length() != 0){
        result.add(sb.toString());
        sb.delete(0, sb.length());//clear sb
    }
    sb.append(token);
}
if (sb.length() != 0){//include rest of text after last keyword 
    result.add(sb.toString());
}

result.forEach(System.out::println);

Output: Output:

FIRST i go to the homepage 
NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

You need to use lookaheads and a lookbehind (mentioned briefly here ).您需要使用前瞻和后瞻( 此处简要提及)。

Simply changing the regex in your split method to the following should do it:只需将拆分方法中的正则表达式更改为以下内容即可:

String[] textArray = text.split("((?=FIRST)|(?=NOW(?! CLICK))|(?=THEN))");

May be better even to include a space in each expression to prevent splitting on, eg, NOWHERE:甚至在每个表达式中包含一个空格以防止拆分可能会更好,例如,NOWHERE:

String[] textArray = text.split("((?=FIRST )|(?=NOW (?!CLICK))|(?=THEN ))");

You may use a Pattern and matcher to split the input using groups:您可以使用模式和匹配器来使用组拆分输入:

Pattern pattern = Pattern.compile("^(FIRST.*?)(NOW.*?)(THEN.*)$");

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";

Matcher matcher = pattern.matcher(text);
        
if (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
    System.out.println(matcher.group(3));
}

Output: Output:

FIRST i go to the homepage 
NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

You could match the following regular expression.您可以匹配以下正则表达式。

/\bFIRST +(?:(?!\bNOW\b)[^\n])+(?<! )|\bNOW +(?:(?!\bTHEN\b)[^\n])+(?<! )|\bTHEN +.*/

Start your engine!启动你的引擎!

Java's regex engine performs the following operations. Java 的正则表达式引擎执行以下操作。

\bFIRST +      : match 'FIRST' preceded by a word boundary,
                 followed by 1+ spaces
(?:            : begin a non-capture group
  (?!\bNOW\b)  : use a negative lookahead to assert that
                 the following chars are not 'NOW'  
  [^\n]        : match any char other than a line terminator
)              : end non-capture group
+              : execute non-capture group 1+ times
(?<! )         : use negative lookbehind to assert that the
                 previous char is not a space
|              : or
\bNOW +        : match 'NOW' preceded by a word boundary,
                 followed by 1+ spaces
(?:            : begin a non-capture group
  (?!\bTHEN\b) : use a negative lookahead to assert that
                 the following chars are not 'THEN'  
  [^\n]        : match any char other than a line terminator
)              : end non-capture group
+              : execute non-capture group 1+ times
(?<! )         : use negative lookbehind to assert that the
                 previous char is not a space
|              : or
\bTHEN +.*     : match 'THEN' preceded by a word boundary,
                 followed by 1+ spaces then 0+ chars

This uses a technique called the tempered greedy token solution .这使用了一种称为缓和贪婪令牌解决方案的技术。

You can use these (Lookahead and Lookbehind ):您可以使用这些(Lookahead 和 Lookbehind ):

public static void main(String args[]) { 
    String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
    String[] textArray = text.split("(?=FIRST)|(?=\\b NOW \\b)|(?=THEN)");
    
    for(String s: textArray) {
        System.out.println(s);
    }
}

Output: Output:

FIRST i go to the homepage
 NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM