简体   繁体   English

用引号和引号内的逗号分隔逗号分隔的字符串,并在引号内使用转义引号

[英]Split comma separated string with quotes and commas within quotes and escaped quotes within quotes

I searched even on page 3 at google for this problem, but it seems there is no proper solution. 我什至在Google的第3页上搜索了此问题,但似乎没有适当的解决方案。

The following string 以下字符串

"zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\'polo'"

should be splitted by comma in Java. 在Java中应以逗号分隔。 The quotes can be double quotes or single. 引号可以是双引号或单引号。 I tried the following regex 我尝试了以下正则表达式

,(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)

but because of the escaped quote at 'marc o\\'polo' it fails... 但是由于'marc o \\'polo'的引号引起来,它失败了...

Can somebody help me out? 有人可以帮我吗?

Code for tryout: 试用代码:

String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc \'opolo'";
Pattern COMMA_PATTERN = Pattern.compile(",(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)");
String[] splits = COMMA_PATTERN.split(checkString);
for (String split : splits) {
  System.out.println(split);
}

You can do it like this: 您可以这样做:

List<String> result = new ArrayList<String>();

Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);

while(m.find()) {
    result.add(m.group());
}

Splitting CSV with regex is not the right solution... which is probably why you are struggling to find one with split/csv/regex search terms. 用正则表达式拆分CSV不是正确的解决方案...这可能就是为什么您要努力查找带有split / csv / regex搜索字词的原因。

Using a dedicated library with a state machine is typically the best solution. 将专用库与状态机一起使用通常是最佳解决方案。 There are a number of them: 有很多:

  • This closed question seems relevant: https://stackoverflow.com/questions/12410538/which-is-the-best-csv-parser-in-java 这个封闭的问题似乎很相关: https : //stackoverflow.com/questions/12410538/which-is-the-best-csv-parser-in-java
  • I have used opencsv in the past, and I beleive the apache csv tool is good too. 我过去使用过opencsv,并且我相信apache csv工具也不错。 I am sure there are others. 我敢肯定还有其他人。 I am specifically not linking any library because you should o your own research on what to use. 我明确地不链接任何库,因为您应该自己研究使用什么。
  • I have been involved in a number of commercail projects where the csv parser was custom-built, but I see no reason why that should still be done. 我参与了一些定制化csv解析器的商业项目,但我认为没有理由为什么仍要这样做。

What I can say, is that regex and CSV get very, very complicated relatively quickly (as you have discovered), and that for performance reasons alone, a 'raw' parser is better. 我可以说的是,正则表达式和CSV变得非常非常非常复杂(如您所见),并且仅出于性能方面的考虑,“原始”解析器会更好。

If you are parsing CVS (or something very similar) than using one of the stablished frameworks normally is a good idea as they cover most corner-cases and are tested by a wider audience thorough usage in different projects. 如果要解析CVS(或与之类似的东西),通常不使用其中一个稳定的框架是一个好主意,因为它们涵盖了大多数极端情况,并且已通过更广泛的受众广泛测试了不同项目的使用。

If however libraries are no option you could go with eg this: 但是,如果没有库,则可以使用例如:

public class Curios {

    public static void main(String[] args) {
        String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\\'polo'";
        List<String> result = splitValues(checkString);
        System.out.println(result);

        System.out.println(splitValues("zhg\\,wi\\'mö,'astor wohnideen','multistore 2002',\"yo\\\"nza\",'asdf, saflk\\\\','marc o\\'polo',"));
    }

    public static List<String> splitValues(String checkString) {
        List<String> result = new ArrayList<String>();

        // Used for reporting errors and detecting quotes
        int startOfValue = 0;
        // Used to mark the next character as being escaped
        boolean charEscaped = false;
        // Is the current value quoted?
        boolean quoted = false;
        // Quote-character in use (only valid when quoted == true)
        char quote = '\0';
        // All characters read from current value
        final StringBuilder currentValue = new StringBuilder();

        for (int i = 0; i < checkString.length(); i++) {
            final char charAt = checkString.charAt(i);
            if (i == startOfValue && !quoted) {
                // We have not yet decided if this is a quoted value, but we are right at the beginning of the next value
                if (charAt == '\'' || charAt == '"') {
                    // This will be a quoted String
                    quote = charAt;
                    quoted = true;
                    startOfValue++;
                    continue;
                }
            }
            if (!charEscaped) {
                if (charAt == '\\') {
                    charEscaped = true;
                } else if (quoted && charAt == quote) {
                    if (i + 1 == checkString.length()) {
                        // So we don't throw an exception
                        quoted = false;
                        // Last value will be added to result outside loop
                        break;
                    } else if (checkString.charAt(i + 1) == ',') {
                        // Ensure we don't parse , again
                        i++;
                        // Add the value to the result
                        result.add(currentValue.toString());
                        // Prepare for next value
                        currentValue.setLength(0);
                        startOfValue = i + 1;
                        quoted = false;
                    } else {
                        throw new IllegalStateException(String.format(
                                "Value was quoted with %s but prematurely terminated at position %d " +
                                        "maybe a \\ is missing before this %s or a , after? " +
                                        "Value up to this point: \"%s\"",
                                quote, i, quote, checkString.substring(startOfValue, i + 1)));
                    }
                } else if (!quoted && charAt == ',') {
                    // Add the value to the result
                    result.add(currentValue.toString());
                    // Prepare for next value
                    currentValue.setLength(0);
                    startOfValue = i + 1;
                } else {
                    // a boring character
                    currentValue.append(charAt);
                }
            } else {
                // So we don't forget to reset for next char...
                charEscaped = false;
                // Here we can do interpolations
                switch (charAt) {
                    case 'n':
                        currentValue.append('\n');
                        break;
                    case 'r':
                        currentValue.append('\r');
                        break;
                    case 't':
                        currentValue.append('\t');
                        break;
                    default:
                        currentValue.append(charAt);
                }
            }
        }
        if(charEscaped) {
            throw new IllegalStateException("Input ended with a stray \\");
        } else if (quoted) {
            throw new IllegalStateException("Last value was quoted with "+quote+" but there is no terminating quote.");
        }

        // Add the last value to the result
        result.add(currentValue.toString());

        return result;
    }

}

Why not simply a regular expression? 为什么不简单地使用一个正则表达式呢?

Regular expressions don't understand nesting very well. 正则表达式不能很好地理解嵌套。 While certainly the regular expression by Casimir does a good job, differences between quoted and unquoted values are easier to model in some form of a state-machine. 尽管Casimir的正则表达式当然可以很好地完成工作,但带引号和不带引号的值之间的差异更容易以某种形式的状态机建模。 You see how difficult it was to ensure you don't accidentally match an ecaped or quoted , . 您会看到确保不意外匹配以引号或引号引起的困难, Also while you are allready evaluating every character it is easy to interpret escape-sequences like \\n 同样,当您准备好评估每个字符时,很容易解释\\n类的转义序列\\n

What to watch out for? 要注意什么?

  • My function was not written for white-space arround values (this can be changed) 我的函数不是为空白arround值编写的(可以更改)
  • My function will interpret the escape-sequences \\n , \\r , \\t , \\\\ like most C-style language interpreters while reading \\x as x (this can easily be changed) 我的函数将像大多数C样式语言解释器一样解释转义序列\\n\\r\\t\\\\同时将\\x读为x (这可以轻松更改)
  • My function accepts quotes and escapes inside unquoted values (this can easily be changed) 我的函数接受引号并转义未引号的值(可以很容易地更改)
  • I did only a few tests and tried my best to exhibit a good memory-management and timing, but you will need to see if it fits your needs. 我仅进行了几次测试,并尽力展现出良好的内存管理和时序,但是您将需要查看它是否符合您的需求。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用逗号分隔字符串但忽略括号或引号中的逗号 - Split string by comma but ignore commas in brackets or in quotes 正则表达式Java拆分逗号分隔的字符串但忽略引号+大括号+递归括号内的逗号 - regex Java splitting a comma-separated String but ignoring commas within quotes+braces+recursive brackets 在逗号上拆分一个不带双引号的逗号的字符串 - Split a string on commas not contained within double-quotes with a twist Java正则表达式:拆分逗号分隔的值,但忽略引号中的逗号 - Java regex: split comma-separated values but ignore commas in quotes Java:拆分逗号分隔的字符串但忽略引号中的逗号 - Java: splitting a comma-separated string but ignoring commas in quotes 计算字符串中逗号的数量,但双引号之间的逗号除外 - Count number of commas within a string except for commas between double quotes 正则表达式:逗号分割,但在括号和引号中排除逗号(单双和双) - Regex : Split on comma , but exclude commas within parentheses and quotes(Both single & Double) 正则表达式:引号内的引号 - regular expression: quotes within quotes 使用引号内的引号进行命令解析 - Command parsing with quotes within quotes 在Java中使用正则表达式将字符串分成两个用冒号分隔的字符串,并忽略标记和引号内的冒号 - Split string into two separated by colon with regex in Java and ignore colons within tags and quotes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM