I am trying to parse the following line to an data array:
"John,Doe","123 Main St","Brown Eyes"
I wanted to have an array data
like below:
data(0) = John,Doe
data(1) = 123 Main St
data(2) = Brown Eyes
I used the following CSV parser from website:
import scala.util.parsing.combinator._
object CSV extends RegexParsers {
override protected val whiteSpace = """[ \t]""".r
def COMMA = ","
def DQUOTE = "\""
def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
def CR = "\r"
def LF = "\n"
def CRLF = "\r\n"
def TXT = "[^\",\r\n]".r
def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
def record: Parser[List[String]] = rep1sep(field, COMMA)
def field: Parser[String] = (escaped|nonescaped)
def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }
def parse(s: String) = parseAll(file, s) match {
case Success(res, _) => res
case _ => List[List[String]]()
}
}
Then all the spaces are trimmed. The data array actually look like:
data(0) = John,Doe
data(1) = 123MainSt
data(2) = BrownEyes
How do I avoid such unwanted "removing whitespace" for the CSV parser? Thanks!
Your code says to take a sequence of escaped or nonescaped tokens and join them with no intervening space:
...* ^^ { case ls => ls.mkString("") }
Per the docs for RegexParsers
,
Try turning off skipWhitespace
:
override protected val skipWhitespace = false
Is there a particular reason for hand-writing a CSV-decoder instead of using one of many existing, well-tested ones? Like OpenCSV
or Jackson CSV module
. It should be much simpler to use an existing lib, and you wouldn't bump into various issues in trying to unescape quotes, trim (or not) spaces, and so on.
The precise answer to your question was given by Robert Starling: set skipWhitespace
to false
.
The answer to the question "how do I parse CSV reliably?", which I'm assuming is what you really want to know, is "use a dedicated library".
You can use one of the Java ones - opencsv, commons-csv, jackson-csv, univocity... or one of the Scala ones - product-collections, purecsv, kantan.csv...
Don't write your own without a good reason - I wrote tabulate because I needed better type handling than was available at the time - and if you do, don't use one of the Scala parser combinator libraries: they load the whole data as a string in memory before parsing, which doesn't scale at all when your data starts growing.
If you must write your own and want to use a parser combinator library (because, let's face it, it's a fun problem and those libraries are cool), consider fastparse instead, or parboiled, which are both of a higher quality than the standard Scala one.
You can do this job in one line :
line.split((",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
The regex is from here and then apply this on all lines of your file. This will split only the comma outside two quotes.
Code to parse the csv File :
scala> scala.io.Source.fromFile("toto.csv").getLines.toList.map(_.split((",(?=([^\\"]*\\"[^\\"]*\\")*[^\\"]*$)"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.