[英]Split comma separated string with quotes and commas within quotes and escaped quotes within quotes
I searched even on page 3 at google for this problem, but it seems there is no proper solution. 我什至在Google的第3页上搜索了此问题,但似乎没有适当的解决方案。
The following string 以下字符串
"zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\'polo'"
should be splitted by comma in Java. 在Java中应以逗号分隔。 The quotes can be double quotes or single.
引号可以是双引号或单引号。 I tried the following regex
我尝试了以下正则表达式
,(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)
but because of the escaped quote at 'marc o\\'polo' it fails... 但是由于'marc o \\'polo'的引号引起来,它失败了...
Can somebody help me out? 有人可以帮我吗?
Code for tryout: 试用代码:
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc \'opolo'";
Pattern COMMA_PATTERN = Pattern.compile(",(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)");
String[] splits = COMMA_PATTERN.split(checkString);
for (String split : splits) {
System.out.println(split);
}
You can do it like this: 您可以这样做:
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);
while(m.find()) {
result.add(m.group());
}
Splitting CSV with regex is not the right solution... which is probably why you are struggling to find one with split/csv/regex search terms. 用正则表达式拆分CSV不是正确的解决方案...这可能就是为什么您要努力查找带有split / csv / regex搜索字词的原因。
Using a dedicated library with a state machine is typically the best solution. 将专用库与状态机一起使用通常是最佳解决方案。 There are a number of them:
有很多:
What I can say, is that regex and CSV get very, very complicated relatively quickly (as you have discovered), and that for performance reasons alone, a 'raw' parser is better. 我可以说的是,正则表达式和CSV变得非常非常非常复杂(如您所见),并且仅出于性能方面的考虑,“原始”解析器会更好。
If you are parsing CVS (or something very similar) than using one of the stablished frameworks normally is a good idea as they cover most corner-cases and are tested by a wider audience thorough usage in different projects. 如果要解析CVS(或与之类似的东西),通常不使用其中一个稳定的框架是一个好主意,因为它们涵盖了大多数极端情况,并且已通过更广泛的受众广泛测试了不同项目的使用。
If however libraries are no option you could go with eg this: 但是,如果没有库,则可以使用例如:
public class Curios {
public static void main(String[] args) {
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\\'polo'";
List<String> result = splitValues(checkString);
System.out.println(result);
System.out.println(splitValues("zhg\\,wi\\'mö,'astor wohnideen','multistore 2002',\"yo\\\"nza\",'asdf, saflk\\\\','marc o\\'polo',"));
}
public static List<String> splitValues(String checkString) {
List<String> result = new ArrayList<String>();
// Used for reporting errors and detecting quotes
int startOfValue = 0;
// Used to mark the next character as being escaped
boolean charEscaped = false;
// Is the current value quoted?
boolean quoted = false;
// Quote-character in use (only valid when quoted == true)
char quote = '\0';
// All characters read from current value
final StringBuilder currentValue = new StringBuilder();
for (int i = 0; i < checkString.length(); i++) {
final char charAt = checkString.charAt(i);
if (i == startOfValue && !quoted) {
// We have not yet decided if this is a quoted value, but we are right at the beginning of the next value
if (charAt == '\'' || charAt == '"') {
// This will be a quoted String
quote = charAt;
quoted = true;
startOfValue++;
continue;
}
}
if (!charEscaped) {
if (charAt == '\\') {
charEscaped = true;
} else if (quoted && charAt == quote) {
if (i + 1 == checkString.length()) {
// So we don't throw an exception
quoted = false;
// Last value will be added to result outside loop
break;
} else if (checkString.charAt(i + 1) == ',') {
// Ensure we don't parse , again
i++;
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
quoted = false;
} else {
throw new IllegalStateException(String.format(
"Value was quoted with %s but prematurely terminated at position %d " +
"maybe a \\ is missing before this %s or a , after? " +
"Value up to this point: \"%s\"",
quote, i, quote, checkString.substring(startOfValue, i + 1)));
}
} else if (!quoted && charAt == ',') {
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
} else {
// a boring character
currentValue.append(charAt);
}
} else {
// So we don't forget to reset for next char...
charEscaped = false;
// Here we can do interpolations
switch (charAt) {
case 'n':
currentValue.append('\n');
break;
case 'r':
currentValue.append('\r');
break;
case 't':
currentValue.append('\t');
break;
default:
currentValue.append(charAt);
}
}
}
if(charEscaped) {
throw new IllegalStateException("Input ended with a stray \\");
} else if (quoted) {
throw new IllegalStateException("Last value was quoted with "+quote+" but there is no terminating quote.");
}
// Add the last value to the result
result.add(currentValue.toString());
return result;
}
}
Why not simply a regular expression? 为什么不简单地使用一个正则表达式呢?
Regular expressions don't understand nesting very well. 正则表达式不能很好地理解嵌套。 While certainly the regular expression by Casimir does a good job, differences between quoted and unquoted values are easier to model in some form of a state-machine.
尽管Casimir的正则表达式当然可以很好地完成工作,但带引号和不带引号的值之间的差异更容易以某种形式的状态机建模。 You see how difficult it was to ensure you don't accidentally match an ecaped or quoted
,
. 您会看到确保不意外匹配以引号或引号引起的困难
,
。 Also while you are allready evaluating every character it is easy to interpret escape-sequences like \\n
同样,当您准备好评估每个字符时,很容易解释
\\n
类的转义序列\\n
What to watch out for? 要注意什么?
\\n
, \\r
, \\t
, \\\\
like most C-style language interpreters while reading \\x
as x
(this can easily be changed) \\n
, \\r
, \\t
, \\\\
同时将\\x
读为x
(这可以轻松更改)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.