Scala CSV parser removed spaces

Question

I am trying to parse the following line to an data array:

"John,Doe","123 Main St","Brown Eyes"

I wanted to have an array data like below:

data(0) = John,Doe
data(1) = 123 Main St
data(2) = Brown Eyes

I used the following CSV parser from website:

import scala.util.parsing.combinator._

object CSV extends RegexParsers {
  override protected val whiteSpace = """[ \t]""".r

  def COMMA   = ","
  def DQUOTE  = "\""
  def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
  def CR      = "\r"
  def LF      = "\n"
  def CRLF    = "\r\n"
  def TXT     = "[^\",\r\n]".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
  def record: Parser[List[String]] = rep1sep(field, COMMA)
  def field: Parser[String] = (escaped|nonescaped)
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}

Then all the spaces are trimmed. The data array actually look like:

data(0) = John,Doe
data(1) = 123MainSt
data(2) = BrownEyes

How do I avoid such unwanted "removing whitespace" for the CSV parser? Thanks!

Answer 1

Your code says to take a sequence of escaped or nonescaped tokens and join them with no intervening space:

...* ^^ { case ls => ls.mkString("") }

Per the docs for RegexParsers ,

The parsing methods call the method skipWhitespace (defaults to true) and, if true, skip any whitespace before each parser is called.
Protected val whiteSpace returns a regex that identifies whitespace.

Try turning off skipWhitespace :

override protected val skipWhitespace = false

Answer 2

Is there a particular reason for hand-writing a CSV-decoder instead of using one of many existing, well-tested ones? Like OpenCSV or Jackson CSV module . It should be much simpler to use an existing lib, and you wouldn't bump into various issues in trying to unescape quotes, trim (or not) spaces, and so on.

Answer 3

The precise answer to your question was given by Robert Starling: set skipWhitespace to false .

The answer to the question "how do I parse CSV reliably?", which I'm assuming is what you really want to know, is "use a dedicated library".

You can use one of the Java ones - opencsv, commons-csv, jackson-csv, univocity... or one of the Scala ones - product-collections, purecsv, kantan.csv...

Don't write your own without a good reason - I wrote tabulate because I needed better type handling than was available at the time - and if you do, don't use one of the Scala parser combinator libraries: they load the whole data as a string in memory before parsing, which doesn't scale at all when your data starts growing.

If you must write your own and want to use a parser combinator library (because, let's face it, it's a fun problem and those libraries are cool), consider fastparse instead, or parboiled, which are both of a higher quality than the standard Scala one.

Answer 4

You can do this job in one line :

line.split((",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")

The regex is from here and then apply this on all lines of your file. This will split only the comma outside two quotes.

Code to parse the csv File :

scala> scala.io.Source.fromFile("toto.csv").getLines.toList.map(_.split((",(?=([^\\"]*\\"[^\\"]*\\")*[^\\"]*$)"))

Scala CSV parser removed spaces

Question

4 answers

solution1
1 2016-01-04 22:45:19

solution2
1 2016-01-04 23:19:32

solution3
1 2016-01-06 12:47:19

solution4
0 ACCPTED 2016-01-04 22:41:17

Scala CSV parser removed spaces

Question

4 answers

solution1 1 2016-01-04 22:45:19

solution2 1 2016-01-04 23:19:32

solution3 1 2016-01-06 12:47:19

solution4 0 ACCPTED 2016-01-04 22:41:17

solution1
1 2016-01-04 22:45:19

solution2
1 2016-01-04 23:19:32

solution3
1 2016-01-06 12:47:19

solution4
0 ACCPTED 2016-01-04 22:41:17