简体   繁体   English

拆分字符串(特别是在Java中使用java.util.regex或其他东西)

[英]Splitting a String (especially in Java with java.util.regex or something else)

Does anyone know how to split a string on a character taking into account its escape sequence? 有没有人知道如何在角色上拆分字符串并考虑其转义序列?

For example, if the character is ':', "a:b" is split into two parts ("a" and "b"), whereas "a:b" is not split at all. 例如,如果字符是':',则“a:b”被分成两部分(“a”和“b”),而“a:b”根本不分开。

I think this is hard (impossible?) to do with regular expressions. 我认为这对正则表达式来说很难(不可能?)。

Thank you in advance, 先感谢您,

Kedar 基达

(?<=^|[^\\\\]): gets you close, but doesn't address escaped slashes. (?<=^|[^\\\\]):让你关闭,但不解决转义斜杠。 (That's a literal regex, of course you have to escape the slashes in it to get it into a java string) (这是一个文字正则表达式,当然你必须逃避它中的斜线才能将它变成一个java字符串)

(?<=(^|[^\\\\])(\\\\\\\\)*): How about that? (?<=(^|[^\\\\])(\\\\\\\\)*):怎么样? I think that should satisfy any ':' that is preceded by an even number of slashes. 我认为应该满足任何':'前面有偶数个斜杠。

Edit: don't vote this up. 编辑:不要投票。 MizardX's solution is better :) MizardX的解决方案更好:)

Since Java supports variable-length look-behinds (as long as they are finite), you could do do it like this: 由于Java支持可变长度的后视(只要它们是有限的),你可以这样做:

import java.util.regex.*;

public class RegexTest {
    public static void main(String[] argv) {

        Pattern p = Pattern.compile("(?<=(?<!\\\\)(?:\\\\\\\\){0,10}):");

        String text = "foo:bar\\:baz\\\\:qux\\\\\\:quux\\\\\\\\:corge";

        String[] parts = p.split(text);

        System.out.printf("Input string: %s\n", text);
        for (int i = 0; i < parts.length; i++) {
            System.out.printf("Part %d: %s\n", i+1, parts[i]);
        }

    }
}
  • (?<=(?<!\\\\)(?:\\\\\\\\){0,10}) looks behind for an even number of back-slashes (including zero, up to a maximum of 10). (?<=(?<!\\\\)(?:\\\\\\\\){0,10})查看偶数个反斜杠(包括零,最多10个)。

Output: 输出:

Input string: foo:bar\\:baz\\\\:qux\\\\\\:quux\\\\\\\\:corge
Part 1: foo
Part 2: bar\\:baz\\\\
Part 3: qux\\\\\\:quux\\\\\\\\
Part 4: corge

Another way would be to match the parts themselves, instead of split at the delimiters. 另一种方法是匹配部件本身,而不是在分隔符处分开。

Pattern p2 = Pattern.compile("(?<=\\A|\\G:)((?:\\\\.|[^:\\\\])*)");
List<String> parts2 = new LinkedList<String>();
Matcher m = p2.matcher(text);
while (m.find()) {
    parts2.add(m.group(1));
}

The strange syntax stems from that it need to handle the case of empty pieces at the start and end of the string. 奇怪的语法源于它需要在字符串的开头和结尾处理空片的情况。 When a match spans exactly zero characters, the next attempt will start one character past the end of it. 当一个匹配恰好为零个字符时,下一次尝试将在它结束后开始一个字符。 If it didn't, it would match another empty string, and another, ad infinitum… 如果没有,它将匹配另一个空字符串,另一个,无限广告......

  • (?<=\\A|\\G:) will look behind for either the start of the string (the first piece), or the end of the previous match, followed by the separator. (?<=\\A|\\G:)会查看字符串的开头(第一部分)或上一个匹配的结尾,然后是分隔符。 If we did (?:\\A|\\G:) , it would fail if the first piece is empty (input starts with a separator). 如果我们做了(?:\\A|\\G:) ,如果第一个部分为空(输入以分隔符开始),它将失败。
  • \\\\. matches any escaped character. 匹配任何转义字符。
  • [^:\\\\] matches any character that is not in an escape sequence (because \\\\. consumed both of those). [^:\\\\]匹配任何不在转义序列中的字符(因为\\\\.消耗了这两个字符)。
  • ((?:\\\\.|[^:\\\\])*) captures all characters up until the first non-escaped delimiter into capture-group 1. ((?:\\\\.|[^:\\\\])*)捕获所有字符,直到第一个未转义的分隔符进入捕获组1。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM