是什么导致Antlr创建大令牌流而导致内存不足

Question

One of our web applications regulary dies because its out of memory. 我们的Web应用程序之一经常死于内存不足。 The sparse data we gathered from memory dumps suggests there is an issue in our antlr parsing implementation. 我们从内存转储中收集的稀疏数据表明我们的antlr解析实现中存在问题。 What we see is a antlr tokenstream containing more than a million items. 我们看到的是一个antlr令牌流，其中包含超过一百万个项目。 The input text which causes this has yet to be found. 尚未找到导致此问题的输入文本。

Is it possible this is somehow related to an zero width item beeing matched? 这可能与零宽度项目beeing匹配有关吗？ Could there be another issue in the grammer resulting in excessive memory usage? 语法中是否可能会导致内存使用过多的另一个问题？

Here is the current grammar we use: 这是我们当前使用的语法：

grammar AdvancedQueries;

options {
  language = Java;
  output = AST;
  ASTLabelType=CommonTree;
}

tokens {
FOR;
END;
FIELDSEARCH;
TARGETFIELD;
RELATION;
NOTNODE;
ANDNODE;
NEARDISTANCE;
OUTOFPLACE;
}

@header {
package de.bsmo.fast.parsing;
}

@lexer::header {
package de.bsmo.fast.parsing;
}

startExpression  : orEx;

expressionLevel4    
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression | outofplace;

expressionLevel3    
: (fieldExpression) | expressionLevel4 ;

expressionLevel2    
: (nearExpression) | expressionLevel3 ;

expressionLevel1    
: (countExpression) | expressionLevel2 ;


notEx   : NOT^? a=expressionLevel1 ;

andEx   : (notEx        -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;

orEx    : andEx (OR^  andEx)*;

countExpression  : COUNT LPARENTHESIS countSub RPARENTHESIS RELATION NUMBERS -> ^(COUNT countSub RELATION NUMBERS);

countSub 
    :   orEx;

nearExpression  : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS));

fieldExpression : WORD PROPERTYSEPERATOR fieldSub  -> ^(FIELDSEARCH ^(TARGETFIELD WORD) fieldSub );

fieldSub 
    :   WORD | PHRASE | LPARENTHESIS! orEx RPARENTHESIS!;  

atomicExpression 
: WORD
| PHRASE
| NUMBERS
;

//Out of place are elements captured that may be in the parseable input but need to be ommited from output later
//Those unwanted elements are captured here.
//MULTIPLESEPERATOR capture unwanted "," 
outofplace
: MULTIPLESEPERATOR -> ^(OUTOFPLACE ^(MULTIPLESEPERATOR));

fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE     : ('"');
fragment LESSTHEN : '<';
fragment MORETHEN: '>';
fragment EQUAL: '=';
fragment SPACE     : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');

fragment WORDMATTER:  ('!'|'0'..'9'|'\u0023'..'\u0027'|'*'|'+'|'\u002D'..'\u0039'|'\u003F'..'\u007E'|'\u00A1'..'\uFFFE');

LPARENTHESIS : '(';
RPARENTHESIS : ')';

AND    : ('A'|'a')('N'|'n')('D'|'d');
OR     : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT    : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';

WS     : (SPACE) { $channel=HIDDEN; };
NUMBERS : (NUMBER)+;
RELATION : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '='
 | (LESSTHEN | MORETHEN);        // '<' or '>'
PHRASE : (QUOTE)(.)*(QUOTE);
WORD   : WORDMATTER* ;

Answer 1

The most common cause of this is a token that can have length 0. There can be an infinite number of such a token between any two other tokens in the file. 造成这种情况的最常见原因是令牌的长度可以为0。在文件中的任何其他两个令牌之间可以有无数个这样的令牌。 Defining a token like this now results in a compiler warning in ANTLR 4. 现在，定义这样的令牌会在ANTLR 4中导致编译器警告。

The following rule can match the empty string: 以下规则可以匹配空字符串：

WORD : WORDMATTER*;

Perhaps you meant to use the following instead? 也许您打算使用以下内容代替？

WORD : WORDMATTER+;

是什么导致Antlr创建大令牌流而导致内存不足

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-05-02 12:58:45

是什么导致Antlr创建大令牌流而导致内存不足

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-05-02 12:58:45

解决方案1
1 已采纳 2013-05-02 12:58:45