简体   繁体   中英

How to split string by java regex with look behind?

I read this string from file:

abc | abc (abc\\|abc)|def

I want to get array inludes 3 items:

  1. abc
  2. abc (abc\\|abc)
  3. def

How to write regex correctly? line.split("(?!<=\\\\)\\\\|") doesn't work.

Code:

public class __QuickTester {

    public static void main (String [] args) {

        String test = "abc|abc (abc\\|abc)|def|banana\\|apple|orange";

        // \\\\ becomes \\ <-- String
        // \\ becomes \ <-- In Regex
        String[] result = test.split("(?<!\\\\)\\|");

        for(String part : result) {
            System.out.println(part);
        }
    }
}

Output:

abc
abc (abc\|abc)
def
banana\|apple
orange


Note: You need \\\\\\\\ (4 backslashes) to get \\\\ (2 backslashes) as a String, and then \\\\ (2 backslashes) becomes a single \\ in Regex.

试试这个正则表达式: ([\\w()]|(\\\\|))+

Main problem in your approach is that \\ is special in regex, but also in String. So to create \\ literal you need to escape it twice:

  • in regex \\\\
  • in String "\\\\\\\\" .

so you would need to write it as split("(?<!\\\\\\\\)\\\\|")

But there are also possible problems with this approach since splitting on | which is simple preceded by \\ can be error-prone. Because you are using \\ as special character to create \\ literal you probably need to write it as \\\\ , for instance to create c:\\foo\\bar\\ you probably need to write it in your text as c:\\\\foo\\\\bar\\\\ .

So in that case lets say that you want to split text like

abc|foo\|c:\\bar\\|cde

I assume that you want to split only in this places

abc|foo\|c:\\bar\\|cde
   ^              ^

because

  • in abc|foo pipe | have no \\ before it,
  • in bar\\\\|cde despite pipe having \\ before it, we know that this \\ wasn't used to escape | , but to generate text representing \\ literal (so generally | which have non or even number of \\ characters are OK to split on).

But split(onEachPipeWhichHaveBackslashBeforeIt) like split("(?<!\\\\\\\\)\\\\|") you will not split between bar\\\\|cde because there is \\ before | which will prevent such split.

To solve this problem you could check if there are odd number of \\ before | , but this is hard to do in Java since look-behind needs to have limited width.

Possible solution would be split("(?<!(?<!\\\\\\\\)((\\\\\\\\){2}){0,1000}\\\\\\\\)\\\\|") and assumption that string will never contain more than 1000 continuous \\ characters, but it seems like overkill.

IMO better solution would be searching for strings you want to find, ninstead of searching for strings you want to split on. And strings you want to find are

  • all characters except |
  • all characters which are preceded by \\ (including | since \\ will simply escape it).

So our regex could look like (\\\\\\\\.|[^|])+ (I placed \\\\\\\\. at start to prevent [^|] consuming \\ which will be used to escape other characters).

Example:

Pattern p = Pattern.compile("(\\\\.|[^|])+");
Matcher m = p.matcher(text);
while (m.find()){
    System.out.println(m.group());
}

Output:

abc
foo\|c:\\bar\\
cde

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM