简体   繁体   中英

Splitting a String (especially in Java with java.util.regex or something else)

Does anyone know how to split a string on a character taking into account its escape sequence?

For example, if the character is ':', "a:b" is split into two parts ("a" and "b"), whereas "a:b" is not split at all.

I think this is hard (impossible?) to do with regular expressions.

Thank you in advance,

Kedar

(?<=^|[^\\\\]): gets you close, but doesn't address escaped slashes. (That's a literal regex, of course you have to escape the slashes in it to get it into a java string)

(?<=(^|[^\\\\])(\\\\\\\\)*): How about that? I think that should satisfy any ':' that is preceded by an even number of slashes.

Edit: don't vote this up. MizardX's solution is better :)

Since Java supports variable-length look-behinds (as long as they are finite), you could do do it like this:

import java.util.regex.*;

public class RegexTest {
    public static void main(String[] argv) {

        Pattern p = Pattern.compile("(?<=(?<!\\\\)(?:\\\\\\\\){0,10}):");

        String text = "foo:bar\\:baz\\\\:qux\\\\\\:quux\\\\\\\\:corge";

        String[] parts = p.split(text);

        System.out.printf("Input string: %s\n", text);
        for (int i = 0; i < parts.length; i++) {
            System.out.printf("Part %d: %s\n", i+1, parts[i]);
        }

    }
}
  • (?<=(?<!\\\\)(?:\\\\\\\\){0,10}) looks behind for an even number of back-slashes (including zero, up to a maximum of 10).

Output:

Input string: foo:bar\\:baz\\\\:qux\\\\\\:quux\\\\\\\\:corge
Part 1: foo
Part 2: bar\\:baz\\\\
Part 3: qux\\\\\\:quux\\\\\\\\
Part 4: corge

Another way would be to match the parts themselves, instead of split at the delimiters.

Pattern p2 = Pattern.compile("(?<=\\A|\\G:)((?:\\\\.|[^:\\\\])*)");
List<String> parts2 = new LinkedList<String>();
Matcher m = p2.matcher(text);
while (m.find()) {
    parts2.add(m.group(1));
}

The strange syntax stems from that it need to handle the case of empty pieces at the start and end of the string. When a match spans exactly zero characters, the next attempt will start one character past the end of it. If it didn't, it would match another empty string, and another, ad infinitum…

  • (?<=\\A|\\G:) will look behind for either the start of the string (the first piece), or the end of the previous match, followed by the separator. If we did (?:\\A|\\G:) , it would fail if the first piece is empty (input starts with a separator).
  • \\\\. matches any escaped character.
  • [^:\\\\] matches any character that is not in an escape sequence (because \\\\. consumed both of those).
  • ((?:\\\\.|[^:\\\\])*) captures all characters up until the first non-escaped delimiter into capture-group 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM