[英]Scala CSV parser with comments
First of all: credits. 首先:学分。 This code is based on the solution from here: Use Scala parser combinator to parse CSV files
这段代码基于此处的解决方案: 使用Scala解析器组合器来解析CSV文件
The CSV files I want to parse can have comments, lines starting with #. 我要解析的CSV文件可以包含注释,以#开头的行。 And to avoid confusion: The CSV files are tabulator-separated.
并且避免混淆:CSV文件由制表符分隔。 There are more constraints which would make the parser a lot easier, but since I am completly new to Scala I thought it would be best to stay as close to the (working) original as possible.
还有更多的限制使解析器容易得多,但是由于我对Scala完全陌生,因此我认为最好保持尽可能接近(有效的)原始语言。
The problem I have is that I get a type mismatch. 我的问题是我遇到类型不匹配的情况。 Obviously the regex for a comment does not yield a list.
显然,用于评论的正则表达式不会产生列表。 I was hoping that Scala would interpret a comment as a 1-element-list, but this is not the case.
我希望Scala将注释解释为1元素列表,但事实并非如此。
So how would I need to modify my code that I can handle this comment lines? 那么,我该如何修改可以处理此注释行的代码? And closly related: Is there an elegant way to query the parser result so I can write in myfunc something like
和密切相关:是否有一种优雅的方法来查询解析器结果,所以我可以在myfunc中编写类似
if (isComment(a)) continue
So here is the actual code: 所以这是实际的代码:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.util.parsing.combinator._
object MyParser extends RegexParsers {
override val skipWhitespace = false // meaningful spaces in CSV
def COMMA = ","
def TAB = "\t"
def DQUOTE = "\""
def HASHTAG = "#"
def DQUOTE2 = "\"\"" ^^ { case _ => "\"" } // combine 2 dquotes into 1
def CRLF = "\r\n" | "\n"
def TXT = "[^\",\r\n]".r
def SPACES = "[ ]+".r
def file: Parser[List[List[String]]] = repsep((comment|record), CRLF) <~ (CRLF?)
def comment: Parser[List[String]] = HASHTAG<~TXT
def record: Parser[List[String]] = "[^#]".r<~repsep(field, TAB)
def field: Parser[String] = escaped|nonescaped
def escaped: Parser[String] = {
((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ {
case ls => ls.mkString("")
}
}
def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }
def applyParser(s: String) = parseAll(file, s) match {
case Success(res, _) => res
case e => throw new Exception(e.toString)
}
def myfunc( a: (String, String)) = {
val parserResult = applyParser(a._2)
println("APPLY PARSER FOR " + a._1)
for( a <- parserResult ){
a.foreach { println }
}
}
def main(args: Array[String]) {
val filesPath = "/home/user/test/*.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.wholeTextFiles(filesPath).cache()
logData.foreach( x => myfunc(x))
}
}
Since the parser for comment and the parser for record are "or-ed" together they must be of the same type. 由于注释解析器和记录解析器是“或”的,所以它们必须是同一类型。
You need to make the following changes: 您需要进行以下更改:
def comment: Parser[List[String]] = HASHTAG<~TXT ^^^ {List()}
By using ^^^
we are converting the result of the parser (which is the result returned by HASHTAG parser) to an empty List. 通过使用
^^^
我们将解析器的结果(由HASHTAG解析器返回的结果)转换为空的List。
Also change: 同时更改:
def record: Parser[List[String]] = repsep(field, TAB)
Note that because comment and record parser are or-ed and because comment comes first, if the row begins with a "#"
it will be parsed by the comment parser. 请注意,因为注释和记录解析器是按顺序排列的,并且因为注释位于第一位,所以如果该行以
"#"
开头,它将由注释解析器进行解析。
Edit: 编辑:
In order to keep the comments text as an output of the parser (say if you want to print them later), and because you are using |
为了保留注释文本作为解析器的输出(例如,是否要在以后打印它们),并且因为您正在使用
|
you can do the following: 您可以执行以下操作:
Define the following classes: 定义以下类:
trait Line
case class Comment(text: String) extends Line
case class Record(elements: List[String]) extends Line
Now define comment, record & file parsers as follows: 现在定义注释,记录和文件解析器,如下所示:
val comment: Parser[Comment] = "#" ~> TXT ^^ Comment
val record :Parser[Line]= repsep(field, TAB) ^^ Record
val file: Parser[List[Line]] = repsep(comment | record, CRLF) <~ (CRLF?)
Now you can define the printing function myFunc
: 现在,您可以定义打印功能
myFunc
:
def myfunc( a: (String, String)) = {
parseAll(file, a._2).map { lines =>
lines.foreach{
case Comment(t) => println(s"This is a comment: $t")
case Record(elems) => println(s"This is a record: ${elems.mkString(",")}")
}
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.