简体   繁体   English

CSV语法的ANTLR4侦听器导致大文件OutOfMemoryError

[英]ANTLR4 listener for CSV grammar causes OutOfMemoryError for big files

I'm having a relativley simple ANTLR4 grammar for csv files that may contain a header line and then only dataline that are seperated with spaces. 我对csv文件有一个简单的ANTLR4语法,该文件可能包含标题行,然后仅包含用空格分隔的数据行。 The values are as follow Double Double Int String Date Time where Date is in yyyy-mm-dd format and Time is in hh:mm:ss.xxx format. 值如下所示: Double Double Int String Date Time其中Dateyyyy-mm-dd格式,而Timehh:mm:ss.xxx格式。

This resulted in the following grammar: 这导致了以下语法:

grammar CSVData;

start       :   (headerline | dataline) (NL dataline)* ;

headerline  :   STRING (' ' STRING)* ;
dataline    :   FLOAT ' ' FLOAT ' ' INT ' ' STRING ' ' DAY ' ' TIME ; //lat lon floor hid day time

NL          :   '\r'? '\n' ;
DAY         :   INT '-' INT '-' INT ; //yyyy-mm-dd
TIME        :   INT ':' INT ':' INT '.' INT ; //hh:mm:ss.xxx
INT         :   DIGIT+ ;
FLOAT       :   '-'? DIGIT* '.' DIGIT+ ;
STRING      :   LETTER (LETTER | DIGIT | SPECIALCHAR)* | (DIGIT | SPECIALCHAR)+ LETTER (LETTER | DIGIT | SPECIALCHAR)* ;

fragment LETTER     :   [A-Za-z] ;
fragment DIGIT      :   [0-9] ;
fragment SPECIALCHAR:   [_:] ;

In my Java application I use a listener that extends CSVDataBaseListener and only overwrites the enterDataline(CSVDataParser.DatalineContext ctx) method. 在我的Java应用程序中,我使用一个扩展CSVDataBaseListener的侦听器,并且仅覆盖enterDataline(CSVDataParser.DatalineContext ctx)方法。 There I simply fetch the tokens and create one object for every line. 我在那里简单地获取令牌并为每一行创建一个对象。

When loading a file of 10 MB this all works as intended. 加载10 MB的文件时,所有这些均按预期工作。 But when I try to load a file of 110 MB size my application will cause an OutOfMemoryError: GC overhead limit exceeded . 但是,当我尝试加载110 MB的文件时,我的应用程序将导致OutOfMemoryError: GC overhead limit exceeded Im running my application with 1 GB of RAM and the filesize shouldn't be a problem in my opinion. 我认为我的应用程序具有1 GB的RAM和文件大小应该不是问题。

I also tried writing a parser simply in Java itself that uses String.split(" ") . 我还尝试使用Java编写简单的解析器,该解析器使用String.split(" ") This parser works as intended, also for the 110 MB input file. 该解析器也可以按预期工作,也适用于110 MB输入文件。

To get an estimation of the size of the objects I created I simply serialized my objects as suggested in this answer . 为了估计我创建的对象的大小,我只需按照此答案中的建议序列化我的对象。 The resulting size for the 110 MB input file was 86,513,392 Bytes, which is far away from consuming the 1 GB RAM. 110 MB输入文件的最终大小为86,513,392字节,这与消耗1 GB RAM的差距很大。

So I'd like to know why ANTLR needs so much RAM for such a simple grammar. 所以我想知道为什么ANTLR这么简单的语法需要这么多的RAM。 Is there any way to make my grammar better, so ANTLR is using less memory? 有什么方法可以使我的语法更好,所以ANTLR使用的内存更少?

EDIT 编辑

I made some deeper memory analysis by loading a file with 1 million lines (approx. 77 MB on disk). 我通过加载一个具有100万行的文件(在磁盘上大约77 MB)进行了更深入的内存分析。 For every single line my grammar finds 12 tokens (the six values per line plus five spaces and one new line). 我的语法为每一行找到12个标记(每行六个值加上五个空格和一个新行)。 This can be stripped down to six tokens per line if the grammar ignores whitespace, but that's still a lot worse than writing a parser by yourself. 如果语法忽略空白,则可以将其简化为每行六个标记,但这比您自己编写解析器还差很多。

For 1 million input lines the memory dumps had the following size: 对于一百万条输入线,内存转储的大小如下:

  • My grammar above: 1,926 MB 我的上方语法:1,926 MB
  • The grammar finding six tokens per line: 1,591 MB 每行可找到六个标记的语法:1,591 MB
  • My self-written parser: 415 MB 我自己写的解析器:415 MB

So having less tokens also results in less memory being used, but still for simple grammars, I'd recommend writing an own parser, because it's not that hard anyway plus you can save a lot of memory usage from the ANTLR overhead. 因此,拥有更少的令牌也会导致使用更少的内存,但是对于简单的语法来说,我还是建议编写自己的解析器,因为它并不那么困难,而且您可以从ANTLR开销中节省很多内存使用量。

According to your grammar, I'm going to assume that your input uses ASCII characters. 根据您的语法,我将假设您的输入使用ASCII字符。 If you store the file on disk as UTF-8, then simply loading the file into the ANTLRInputStream , which uses UTF-16, will consume 220MB. 如果将文件以UTF-8格式存储在磁盘上,则仅将文件加载到使用UTF-16的ANTLRInputStream ,将消耗220MB。 In addition to that you'll have overhead of approximately 48 bytes per CommonToken (last I checked), along with overhead from the DFA cache and the ParserRuleContext instances. 除此之外,每个CommonToken (最后一次检查)的开销约为48字节,还有DFA缓存和ParserRuleContext实例的开销。

The only way to get an accurate picture of the memory used by a Java application is through a profiler, and in 64-bit mode not all profilers properly account for Compressed OOP object storage (YourKit does though). 准确了解Java应用程序使用的内存的唯一方法是通过探查器,并且在64位模式下,并非所有探查器都正确地考虑了压缩OOP对象存储(尽管您的工具包确实如此)。 The first thing to try is simply increasing the allowed heap size. 要尝试的第一件事就是简单地增加允许的堆大小。 Once you know the specific data structure(s) using the memory, you can target that area for reduction. 一旦知道使用内存的特定数据结构,就可以针对该区域进行缩减。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM