简体   繁体   中英

Scala CSV parser removed spaces

I am trying to parse the following line to an data array:

"John,Doe","123 Main St","Brown Eyes"

I wanted to have an array data like below:

data(0) = John,Doe
data(1) = 123 Main St
data(2) = Brown Eyes

I used the following CSV parser from website:

import scala.util.parsing.combinator._

object CSV extends RegexParsers {
  override protected val whiteSpace = """[ \t]""".r

  def COMMA   = ","
  def DQUOTE  = "\""
  def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
  def CR      = "\r"
  def LF      = "\n"
  def CRLF    = "\r\n"
  def TXT     = "[^\",\r\n]".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
  def record: Parser[List[String]] = rep1sep(field, COMMA)
  def field: Parser[String] = (escaped|nonescaped)
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}

Then all the spaces are trimmed. The data array actually look like:

data(0) = John,Doe
data(1) = 123MainSt
data(2) = BrownEyes

How do I avoid such unwanted "removing whitespace" for the CSV parser? Thanks!

Your code says to take a sequence of escaped or nonescaped tokens and join them with no intervening space:

...* ^^ { case ls => ls.mkString("") }

Per the docs for RegexParsers ,

  • The parsing methods call the method skipWhitespace (defaults to true) and, if true, skip any whitespace before each parser is called.
  • Protected val whiteSpace returns a regex that identifies whitespace.

Try turning off skipWhitespace :

override protected val skipWhitespace = false

Is there a particular reason for hand-writing a CSV-decoder instead of using one of many existing, well-tested ones? Like OpenCSV or Jackson CSV module . It should be much simpler to use an existing lib, and you wouldn't bump into various issues in trying to unescape quotes, trim (or not) spaces, and so on.

The precise answer to your question was given by Robert Starling: set skipWhitespace to false .

The answer to the question "how do I parse CSV reliably?", which I'm assuming is what you really want to know, is "use a dedicated library".

You can use one of the Java ones - opencsv, commons-csv, jackson-csv, univocity... or one of the Scala ones - product-collections, purecsv, kantan.csv...

Don't write your own without a good reason - I wrote tabulate because I needed better type handling than was available at the time - and if you do, don't use one of the Scala parser combinator libraries: they load the whole data as a string in memory before parsing, which doesn't scale at all when your data starts growing.

If you must write your own and want to use a parser combinator library (because, let's face it, it's a fun problem and those libraries are cool), consider fastparse instead, or parboiled, which are both of a higher quality than the standard Scala one.

You can do this job in one line :

line.split((",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")

The regex is from here and then apply this on all lines of your file. This will split only the comma outside two quotes.

Code to parse the csv File :

scala> scala.io.Source.fromFile("toto.csv").getLines.toList.map(_.split((",(?=([^\\"]*\\"[^\\"]*\\")*[^\\"]*$)"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM