[英]Regex matching unescaped commas in Java
Problem description 问题描述
I am trying to split a into separate strings, with the split() method that the String class provides. 我试图使用String类提供的split()方法将a拆分为单独的字符串。 The documentation tells me that it will split around matches of the argument, which is a regular expression.
文档告诉我它将分割参数的匹配,这是一个正则表达式。 The delimiter that I use is a comma, but commas can also be escaped.
我使用的分隔符是逗号,但逗号也可以转义。 Escaping character that I use is a forward slash / (just to make things easier by not using a backslash, because that requires additional escaping in string literals in both Java and the regular expressions).
我使用的转义字符是正斜杠/(只是为了通过不使用反斜杠来简化操作,因为这需要在Java和正则表达式中的字符串文字中进行额外的转义)。
For instance, the input might be this: 例如,输入可能是这样的:
a,b/,b//,c///,//,d///,
And the output should be: 输出应该是:
a
b,b/
c/,/
d/,
So, the string should be split at each comma, unless that comma is preceded by an odd number of slashes (1, 3, 5, 7, ..., ∞) because that would mean that the comma is escaped. 因此,字符串应该在每个逗号处拆分,除非该逗号前面有奇数个斜杠(1,3,5,7,...,∞),因为这意味着逗号被转义。
Possible solutions 可能的解决方案
My initial guess would be to split it like this: 我最初的猜测是将它拆分为:
String[] strings = longString.split("(?<![^/](//)*/),");
but that is not allowed because Java doesn't allow infinite look-behind groups. 但这是不允许的,因为Java不允许无限的后视组。 I could limit the recurrence to, say, 2000 by replacing the * with {0,2000}:
我可以通过用{0,2000}替换*来将重现限制为,例如,2000:
String[] strings = longString.split("(?<![^/](//){0,2000}/),");
but that still puts constraints on the input. 但这仍然会对输入产生限制。 So I decided to take the recurrence out of the look-behind group, and came up with this:
因此,我决定将这一反复出现在后视组中,并想出了这个:
String[] strings = longString.split("(?<!/)(?:(//)*),");
However, its output is the following list of strings: 但是,它的输出是以下字符串列表:
a
b,b (the final slash is lacking in the output)
c/, (the final slash is lacking in the output)
d/,
Why are those slashes omitted in the 2nd and 3rd string, and how can I solve it (in Java)? 为什么在第2和第3个字符串中省略了这些斜杠,我该如何解决它(在Java中)?
You can achieve the split using a positive look behind for an even number of slashes preceding the comma: 您可以使用正面外观在逗号前面加上偶数个斜杠来实现拆分:
String[] strings = longString.split("(?<=[^/](//){0,999999999}),");
But to display the output you want, you need a further step of removing the remaining escapes: 但要显示所需的输出,您需要进一步删除剩余的转义:
String longString = "a,b/,b//,c///,//,d///,";
String[] strings = longString.split("(?<=[^/](//){0,999999999}),");
for (String s : strings)
System.out.println(s.replaceAll("/(.)", "$1"));
Output: 输出:
a
b,b/
c/,/
d/,
You are pretty close. 你很近。 To overcome lookbehind error you can use this workaround:
要克服lookbehind错误,您可以使用此解决方法:
String[] strings = longString.split("(?<![^/](//){0,99}/),")
If you don't mind another method with regex, I suggest using .matcher
: 如果您不介意使用正则表达式的另一种方法,我建议使用
.matcher
:
Pattern pattern = Pattern.compile("(?:[^,/]+|/.)+");
String test = "a,b/,b//,c///,//,d///,";
Matcher matcher = pattern.matcher(test);
while (matcher.find()) {
System.out.println(matcher.group().replaceAll("/(.)", "$1"));
}
Output: 输出:
a
b,b/
c/,/
d/,
This method will match everything except the delimiting commas (kind of the reverse). 此方法将匹配除分隔逗号之外的所有内容(反向类型)。 The advantage is that it doesn't rely on lookarounds.
优点是它不依赖于外观。
I love regexes, but wouldn't it be easy to write the code manually here, ie 我喜欢正则表达式,但在这里手动编写代码并不容易,即
boolean escaped = false;
for(int i = 0, len = s.length() ; i < len ; i++){
switch(s.charAt(i)){
case "/": escaped = !escaped; break;
case ",":
if(!escaped){
//found a segment, do something with it
}
//Fallthrough!
default:
escaped = false;
}
}
// handle last segment
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.