简体   繁体   English

解析大文件时,为什么Scala的组合子解析速度慢? 我能做什么?

[英]Why is Scala's combinator parsing slow when parsing large files? What can I do?

I need to parse files that have millions of lines. 我需要解析拥有数百万行的文件。 I noticed that my combinator parser gets slower and slower as it parses more and more lines. 我注意到我的组合器解析器变得越来越慢,因为它解析了越来越多的行。 The problem seems to be in scala's "rep" or regex parsers, because this behaviour occurs even for the simple example parser shown below: 问题似乎是在scala的“rep”或正则表达式解析器中,因为即使对于下面显示的简单示例解析器,也会出现这种情况:

def file: Parser[Int] = rep(line) ^^ { 1 }  // a file is a repetition of lines

def line: Parser[Int] = """(?m)^.*$""".r ^^ { 0 } // reads a line and returns 0

When I try to parse a file with millions of lines of equal length with this simple parser, in the beginning it parses 46 lines/ms. 当我尝试使用这个简单的解析器解析具有数百万行等长的文件时,在开始时它会解析46行/ ms。 After 370000 lines, the speed drops to 20 lines/ms. 在370000行之后,速度降至20行/ ms。 After 840000 lines, it drops to 10 lines/ms. 在840000行之后,它下降到10行/ ms。 After 1790000 lines, 5 lines/ms... 经过1790000行,5行/ ms ......

My questions are: 我的问题是:

  • Why does this happen? 为什么会这样?

  • What can I do to prevent this? 我该怎么做才能防止这种情况发生?

This is probably a result of the change in Java 7u6 that doesn't have substrings as a part of the original string. 这可能是Java 7u6中没有子串作为原始字符串的一部分而发生变化的结果。 So big strings get copied over and over, causing lots and lots of memory churn (among other things). 这么大的字符串会一遍又一遍地被复制,导致大量的内存流失(以及其他事情)。 As you increase the amount of stuff you've parsed (I'm assuming you're storing at least some of it), the garbage collector has more and more work to do, so creating all that extra garbage has a steeper and steeper penalty. 随着你增加你解析的东西的数量(我假设你至少存储了一些东西),垃圾收集器有越来越多的工作要做,所以创建所有额外的垃圾会有更陡峭和更陡的惩罚。

There is a ticket to fix the memory usage , and code from Zach Moazeni there that lets you wrap your strings inside a construct that will make substrings properly (which you can pass into the parser in place of strings). 有一个修复内存使用情况票证 ,以及来自Zach Moazeni的代码,它允许你将字符串包装在一个构造中,该构造将正确地创建子字符串(你可以将它传递给解析器来代替字符串)。

This won't necessarily change the overall result that parsing eventually slows down, but it should help reduce the time overall. 这不一定会改变解析最终减慢的总体结果,但它应该有助于减少总体时间。

Also, I wouldn't advise making a file be a repetition of lines. 此外,我不建议使文件重复行。 You're making the parser keep track of the entire file when it really need not. 您正在使解析器在不需要时跟踪整个文件。 I'd feed it in a line at a time. 我一次喂它一行。 (And then if the lines are short, you may not need the above fix.) (如果线条很短,你可能不需要上面的修复。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM