简体   繁体   中英

Tokenize CSV line escape double quotes

I have a CSV line that is comma delimited:

1000102257,b,N,159999,3,4545656,4,,,,"6,125% NORDRHEIN-WESTF.LA.SCHA.R.239 21.12. "18"

The tokens that contains the comma delimiter(,) as content are double quoted to escape it.

As you see, the last token is isolate between double quote, but another double quote appears ("18) which ruins the tokenize mechanism:

"6,125% NORDRHEIN-WESTF.LA.SCHA.R.239 21.12. "18"

This is my code to split in tokens the line:

public static void main(String[] args) {
    final String cvsSplitterEscapingQuotes = ",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)";
    String strLine = "1000102257,b,N,159999,3,4545656,4,,,,\"6,125% NORDRHEIN-WESTF.LA.SCHA.R.239 21.12. \"18\"";
    String[] tokens = strLine.split(cvsSplitterEscapingQuotes, -1);
}

How can I escape the middle double quotes that are inside a quoted token?

Don't parse a CSV yourself, use a library. Even such a simple format as CSV has nuances: fields can be escaped with quotes or unescaped, the file can have or have not a header and so on. Besides that you have to test and maintain the code you've wrote. So writing less code and reusing libraries is good.

There are a plenty of libraries for CSV in Java:

IMHO, the first two are the most popular.

Here is an example for Apache Commons CSV:

final Reader in = new FileReader("counties.csv");
final Iterable<CSVRecord> records = CSVFormat.DEFAULT.parse(in);

for (final CSVRecord record : records) { // Simply iterate over the records via foreach loop. All the parsing is handler for you
    String populationString = record.get(7); // Indexes are zero-based
    String populationString = record.get("population"); // Or, if your file has headers, you can just use them

    … // Do whatever you want with the population
}

Look how easy it is. And it will be similar with other parsers.

Just ignore the double quote that does not follow a comma or a line break

This unescaped regex, tested here :

(".*"|[^,"]+|(?<=,)(?=,))

splits your string with the commas, but without the one in quotes. Here's how it works:

(                          // Start the match
 ".*"                      // Greedily match anything in quotes
     |[^,"]+               // Or, greedily match anything that isn't a comma or quote
            |(?<=,)(?=,)   // Or, look behind for a comma and ahead for a comma
                           //    (the empty match)
                        )  // End match.

Of course, this won't match empty fields on the beginning or end of a comma delimited string, but you can add an additional bit:

|^(?=,)           // At the beginning, look forward for a comma
       |(?<=,)$   // Look back for a comma, and at the end

So the whole pattern is:

(".*"|[^,"]+|(?<=,)(?=,))|^(?=,)|(?<=,)$

But as @madhead says, unless this is homework, use a library!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM