简体   繁体   中英

Split comma separated string with quotes and commas within quotes and escaped quotes within quotes

I searched even on page 3 at google for this problem, but it seems there is no proper solution.

The following string

"zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\'polo'"

should be splitted by comma in Java. The quotes can be double quotes or single. I tried the following regex

,(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)

but because of the escaped quote at 'marc o\\'polo' it fails...

Can somebody help me out?

Code for tryout:

String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc \'opolo'";
Pattern COMMA_PATTERN = Pattern.compile(",(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)");
String[] splits = COMMA_PATTERN.split(checkString);
for (String split : splits) {
  System.out.println(split);
}

You can do it like this:

List<String> result = new ArrayList<String>();

Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);

while(m.find()) {
    result.add(m.group());
}

Splitting CSV with regex is not the right solution... which is probably why you are struggling to find one with split/csv/regex search terms.

Using a dedicated library with a state machine is typically the best solution. There are a number of them:

  • This closed question seems relevant: https://stackoverflow.com/questions/12410538/which-is-the-best-csv-parser-in-java
  • I have used opencsv in the past, and I beleive the apache csv tool is good too. I am sure there are others. I am specifically not linking any library because you should o your own research on what to use.
  • I have been involved in a number of commercail projects where the csv parser was custom-built, but I see no reason why that should still be done.

What I can say, is that regex and CSV get very, very complicated relatively quickly (as you have discovered), and that for performance reasons alone, a 'raw' parser is better.

If you are parsing CVS (or something very similar) than using one of the stablished frameworks normally is a good idea as they cover most corner-cases and are tested by a wider audience thorough usage in different projects.

If however libraries are no option you could go with eg this:

public class Curios {

    public static void main(String[] args) {
        String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\\'polo'";
        List<String> result = splitValues(checkString);
        System.out.println(result);

        System.out.println(splitValues("zhg\\,wi\\'mö,'astor wohnideen','multistore 2002',\"yo\\\"nza\",'asdf, saflk\\\\','marc o\\'polo',"));
    }

    public static List<String> splitValues(String checkString) {
        List<String> result = new ArrayList<String>();

        // Used for reporting errors and detecting quotes
        int startOfValue = 0;
        // Used to mark the next character as being escaped
        boolean charEscaped = false;
        // Is the current value quoted?
        boolean quoted = false;
        // Quote-character in use (only valid when quoted == true)
        char quote = '\0';
        // All characters read from current value
        final StringBuilder currentValue = new StringBuilder();

        for (int i = 0; i < checkString.length(); i++) {
            final char charAt = checkString.charAt(i);
            if (i == startOfValue && !quoted) {
                // We have not yet decided if this is a quoted value, but we are right at the beginning of the next value
                if (charAt == '\'' || charAt == '"') {
                    // This will be a quoted String
                    quote = charAt;
                    quoted = true;
                    startOfValue++;
                    continue;
                }
            }
            if (!charEscaped) {
                if (charAt == '\\') {
                    charEscaped = true;
                } else if (quoted && charAt == quote) {
                    if (i + 1 == checkString.length()) {
                        // So we don't throw an exception
                        quoted = false;
                        // Last value will be added to result outside loop
                        break;
                    } else if (checkString.charAt(i + 1) == ',') {
                        // Ensure we don't parse , again
                        i++;
                        // Add the value to the result
                        result.add(currentValue.toString());
                        // Prepare for next value
                        currentValue.setLength(0);
                        startOfValue = i + 1;
                        quoted = false;
                    } else {
                        throw new IllegalStateException(String.format(
                                "Value was quoted with %s but prematurely terminated at position %d " +
                                        "maybe a \\ is missing before this %s or a , after? " +
                                        "Value up to this point: \"%s\"",
                                quote, i, quote, checkString.substring(startOfValue, i + 1)));
                    }
                } else if (!quoted && charAt == ',') {
                    // Add the value to the result
                    result.add(currentValue.toString());
                    // Prepare for next value
                    currentValue.setLength(0);
                    startOfValue = i + 1;
                } else {
                    // a boring character
                    currentValue.append(charAt);
                }
            } else {
                // So we don't forget to reset for next char...
                charEscaped = false;
                // Here we can do interpolations
                switch (charAt) {
                    case 'n':
                        currentValue.append('\n');
                        break;
                    case 'r':
                        currentValue.append('\r');
                        break;
                    case 't':
                        currentValue.append('\t');
                        break;
                    default:
                        currentValue.append(charAt);
                }
            }
        }
        if(charEscaped) {
            throw new IllegalStateException("Input ended with a stray \\");
        } else if (quoted) {
            throw new IllegalStateException("Last value was quoted with "+quote+" but there is no terminating quote.");
        }

        // Add the last value to the result
        result.add(currentValue.toString());

        return result;
    }

}

Why not simply a regular expression?

Regular expressions don't understand nesting very well. While certainly the regular expression by Casimir does a good job, differences between quoted and unquoted values are easier to model in some form of a state-machine. You see how difficult it was to ensure you don't accidentally match an ecaped or quoted , . Also while you are allready evaluating every character it is easy to interpret escape-sequences like \\n

What to watch out for?

  • My function was not written for white-space arround values (this can be changed)
  • My function will interpret the escape-sequences \\n , \\r , \\t , \\\\ like most C-style language interpreters while reading \\x as x (this can easily be changed)
  • My function accepts quotes and escapes inside unquoted values (this can easily be changed)
  • I did only a few tests and tried my best to exhibit a good memory-management and timing, but you will need to see if it fits your needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM