简体   繁体   English

使用parboiled2来解析多行而不是String

[英]Using parboiled2 to parse multiple lines instead of a String

I would like to use parboiled2 to parse multiple CSV lines instead of a single CSV String. 我想使用parboiled2来解析多个CSV行而不是单个CSV String。 The result would be something like: 结果将是这样的:

val parser = new CSVRecordParser(fieldSeparator)
io.Source.fromFile("my-file").getLines().map(line => parser.record.run(line))

where CSVRecordParser is my parboiled parser of CSV records. 其中CSVRecordParser是我记录的CSV记录解析器。 The problem that I have is that, for what I've tried, I cannot do this because parboiled parsers requires the input in the constructor, not in the run method. 我遇到的问题是,对于我所尝试的,我不能这样做,因为半熟的解析器需要构造函数中的输入,而不是run方法。 Thus, I can either create a new parser for each line, that is not good, or find a way to pass the input to the parser for every input that I have. 因此,我可以为每一行创建一个新的解析器,这是不好的,或者找到一种方法将输入传递给解析器,用于我拥有的每个输入。 I tried to hack a bit the parser, by setting the input as variable and wrapping the parser in another object 我试图通过将输入设置为变量并将解析器包装在另一个对象中来破解一些解析器

object CSVRecordParser {

  private object CSVRecordParserWrapper extends Parser with StringBuilding {

    val textBase = CharPredicate.Printable -- '"'
    val qTextData = textBase ++ "\r\n"

    var input: ParserInput = _
    var fieldDelimiter: Char = _

    def record = rule { zeroOrMore(field).separatedBy(fieldDelimiter) ~> (Seq[String] _) }
    def field = rule { quotedField | unquotedField }
    def quotedField = rule {
      '"' ~ clearSB() ~ zeroOrMore((qTextData | '"' ~ '"') ~ appendSB()) ~ '"' ~ ows ~ push(sb.toString)
    }
    def unquotedField = rule { capture(zeroOrMore(textData)) }
    def textData = textBase -- fieldDelimiter

    def ows = rule { zeroOrMore(' ') }
  }

  def parse(input: ParserInput, fieldDelimiter: Char): Result[Seq[String]] = {
    CSVRecordParserWrapper.input = input
    CSVRecordParserWrapper.fieldDelimiter = fieldDelimiter
    wrapTry(CSVRecordParserWrapper.record.run())
  }
}

and then just call CSVRecordParser.parse(input, separator) when I want to parse a line. 然后在我想解析一行时调用CSVRecordParser.parse(input, separator) Besides the fact that this is horrible, it doesn't work and I often have strange errors related to previous usages of the parser. 除了这是可怕的事实,它不起作用,我经常有与之前的解析器用法相关的奇怪错误。 I know this is not the way I should write a parser using parboiled2 and I was wondering what is the best way to achieve what I would like to do with this library. 我知道这不是我应该使用parboiled2编写解析器的方式,我想知道什么是实现我想用这个库做什么的最好方法。

Why not add an end of record rule to the parser. 为什么不向解析器添加记录结束规则。

def EOR = rule { "\r\n" | "\n" }

def record = rule { zeroOrMore(field).separatedBy(fieldDelimiter) ~ EOR ~> (Seq[String] _) }

Then you can pass in as many lines as you want. 然后你可以根据需要传入尽可能多的行。

I've done this for CSV files of over 1 million records, in a project that requires high speed and low resources, and I find it works well to instantiate a new parser for each line. 我已经为超过100万条记录的CSV文件做了这个,在一个需要高速和低资源的项目中,我发现它可以很好地为每一行实例化一个新的解析器。

I tried this approach after I noticed that the parboiled2 readme mentions that the parsers are extremely light weight. 我注意到parboiled2自述文件提到解析器的重量非常轻,我尝试了这种方法。

I have not needed even to increase JVM memory or heap limits from their defaults. 我甚至不需要从默认值增加JVM内存或堆限制。 Parser instantiation for each line works very well. 每行的解析器实例化非常有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM