简体   繁体   中英

split string not included a string in java

How can i split this text below with split-cretiria: FIRST, NOW, THEN:

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";

Expected are three sentences:

  1. FIRST i go to the homepage
  2. NOW i click on button "NOW CLICK" very quick
  3. THEN i will become a text result.

This code doesn't work, because of button "NOW CLICK"

String[] textArray = text.split("FIRST|NOW|THEN");

If I understand you correctly you

  • want to separate your text on keywords FIRST NOW THEN and preserve them in resulting parts
  • but don't want to split on those keywords if they appear inside quotes.

If my guess is correct instead of split method, you can use find to iterate over all

  • quotes
  • words which are not inside quotes,
  • whitespaces.

This would let you add all quotes and whitespaces to result and focus only on checking words which are not inside quotation to see if you should split on them or not.

Regex representing such parts can look like Pattern.compile("\"[^\"]*\"|\\S+|\\s+");

IMPORTANT : we need to search for ".." first, otherwise \\S+ would also match "NOW CLICK" as "NOW and CLICK" as two separate parts which will prevent it to be seen as single quotation. This is why we want to place "[^"]*" regex (which represents quotations) at start of subregex1|subregex2|subregex3 series.

This regex will allow us to iterate over text

FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result.

as tokens

FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result. THEN i will become a text result.

Notice that "NOW CLICK" will be treated as single token. Because of that even if it will contain inside keyword on which you want to split, it will never be equal to such keyword (because it will contain other characters like " , or simply other words in quote). This will prevent it from being treated as delimiter on which text should be split.

Using this idea we can create code like:

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
List<String> keywordsToSplitOn = List.of("FIRST", "NOW", "THEN");

//lets search for quotes ".." | words | whitespaces
Pattern p = Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
Matcher m = p.matcher(text);

StringBuilder sb = new StringBuilder();
List<String> result = new ArrayList<>();
while(m.find()){
    String token = m.group();
    if (keywordsToSplitOn.contains(token) && sb.length() != 0){
        result.add(sb.toString());
        sb.delete(0, sb.length());//clear sb
    }
    sb.append(token);
}
if (sb.length() != 0){//include rest of text after last keyword 
    result.add(sb.toString());
}

result.forEach(System.out::println);

Output:

FIRST i go to the homepage 
NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

You need to use lookaheads and a lookbehind (mentioned briefly here ).

Simply changing the regex in your split method to the following should do it:

String[] textArray = text.split("((?=FIRST)|(?=NOW(?! CLICK))|(?=THEN))");

May be better even to include a space in each expression to prevent splitting on, eg, NOWHERE:

String[] textArray = text.split("((?=FIRST )|(?=NOW (?!CLICK))|(?=THEN ))");

You may use a Pattern and matcher to split the input using groups:

Pattern pattern = Pattern.compile("^(FIRST.*?)(NOW.*?)(THEN.*)$");

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";

Matcher matcher = pattern.matcher(text);
        
if (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
    System.out.println(matcher.group(3));
}

Output:

FIRST i go to the homepage 
NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

You could match the following regular expression.

/\bFIRST +(?:(?!\bNOW\b)[^\n])+(?<! )|\bNOW +(?:(?!\bTHEN\b)[^\n])+(?<! )|\bTHEN +.*/

Start your engine!

Java's regex engine performs the following operations.

\bFIRST +      : match 'FIRST' preceded by a word boundary,
                 followed by 1+ spaces
(?:            : begin a non-capture group
  (?!\bNOW\b)  : use a negative lookahead to assert that
                 the following chars are not 'NOW'  
  [^\n]        : match any char other than a line terminator
)              : end non-capture group
+              : execute non-capture group 1+ times
(?<! )         : use negative lookbehind to assert that the
                 previous char is not a space
|              : or
\bNOW +        : match 'NOW' preceded by a word boundary,
                 followed by 1+ spaces
(?:            : begin a non-capture group
  (?!\bTHEN\b) : use a negative lookahead to assert that
                 the following chars are not 'THEN'  
  [^\n]        : match any char other than a line terminator
)              : end non-capture group
+              : execute non-capture group 1+ times
(?<! )         : use negative lookbehind to assert that the
                 previous char is not a space
|              : or
\bTHEN +.*     : match 'THEN' preceded by a word boundary,
                 followed by 1+ spaces then 0+ chars

This uses a technique called the tempered greedy token solution .

You can use these (Lookahead and Lookbehind ):

public static void main(String args[]) { 
    String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
    String[] textArray = text.split("(?=FIRST)|(?=\\b NOW \\b)|(?=THEN)");
    
    for(String s: textArray) {
        System.out.println(s);
    }
}

Output:

FIRST i go to the homepage
 NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM