简体   繁体   中英

How to parse a comma separated line (CSV) with some items in quotation marks?

I am trying to parse a comma separated string using:

val array = input.split(",")

Then I notice that some input lines have "," inside a quotation mark:

data0, "data1", data2, data3, "data4-1, data4-2, data4-3", data5

*Note that the data is not very clean, so some fields are inside quotation marks while some don't


How do I split such line into:

array(0) = data0
array(1) = data1
array(2) = data2
array(3) = data3
array(4) = data4-1, data4-2, data4-3
array(5) = data5

As per my comments:

Parsing CSV files can be notoriously tricky due to its behaviour around quotes, and commas and quotes included in quoted values. I recommend pulling in a library which is well regarded for dealing robustly with all the edge cases.

Options you could consider include scala-csv , and traversable-csv . Or use a Java library like opencsv .

Otherwise, if you don't want to or can't use a library, you could look at this SO answer or this SO answer to see how others have tackled roll-your-own CSV parsers.

I would recommend using a CSV library to parse CSV data - the format is a mess and painful to get right.

I would suggest kantan.csv , mainly because I'm the author but also because it lets you got a bit further than turning a CSV stream into a list of arrays of strings. Take, for example, the following input:

1,Foo,2.0
2,Bar,false

Using kantan.csv, you can write:

import kantan.csv.ops._

new File("path/to/csv").asUnsafeCsvRows[(Int, String, Either[Float, Boolean])](',', false)

Calling toList on the result will yield:

List((1,Foo,Left(2.0)), (2,Bar,Right(false)))

Note how the last column is either a float or a boolean, but this is captured in the type of each element of the iterator.

Below is my solution to parse CSV row:

String[] res = row.split(";");
for (int i = 0; i < res.length; i++) {
    res[i] = deQuotes(res[i]);
}
return res;

remove quotes with REGEXP:

static final Pattern PATTERN_DE_QUOTES = Pattern.compile("(?i)^\\\"(.*)\\\"$");

static String deQuotes(String s) {
    Matcher matcher;
    if ((matcher = PATTERN_DE_QUOTES.matcher(s)).find()) {
        return matcher.group(1).replaceAll("\"\"", "\"");
    }
    return s;
}

I hope it will help you.

You can actually split that line with a regex expression.

val s = """data0, "data1", data2, data3, "data4-1, data4-2, data4-3", data5"""

"""((".*?")|('.*?')|[^"',]+)+""".r.findAllIn(s).foreach(println)

btw. any library that can parse csv files can also parse a single csv line. Just wrap the string into a StringReader.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM