What does cause Antlr to create a large tokenstream resulting in out of memory

Question

One of our web applications regulary dies because its out of memory. The sparse data we gathered from memory dumps suggests there is an issue in our antlr parsing implementation. What we see is a antlr tokenstream containing more than a million items. The input text which causes this has yet to be found.

Is it possible this is somehow related to an zero width item beeing matched? Could there be another issue in the grammer resulting in excessive memory usage?

Here is the current grammar we use:

grammar AdvancedQueries;

options {
  language = Java;
  output = AST;
  ASTLabelType=CommonTree;
}

tokens {
FOR;
END;
FIELDSEARCH;
TARGETFIELD;
RELATION;
NOTNODE;
ANDNODE;
NEARDISTANCE;
OUTOFPLACE;
}

@header {
package de.bsmo.fast.parsing;
}

@lexer::header {
package de.bsmo.fast.parsing;
}

startExpression  : orEx;

expressionLevel4    
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression | outofplace;

expressionLevel3    
: (fieldExpression) | expressionLevel4 ;

expressionLevel2    
: (nearExpression) | expressionLevel3 ;

expressionLevel1    
: (countExpression) | expressionLevel2 ;


notEx   : NOT^? a=expressionLevel1 ;

andEx   : (notEx        -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;

orEx    : andEx (OR^  andEx)*;

countExpression  : COUNT LPARENTHESIS countSub RPARENTHESIS RELATION NUMBERS -> ^(COUNT countSub RELATION NUMBERS);

countSub 
    :   orEx;

nearExpression  : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS));

fieldExpression : WORD PROPERTYSEPERATOR fieldSub  -> ^(FIELDSEARCH ^(TARGETFIELD WORD) fieldSub );

fieldSub 
    :   WORD | PHRASE | LPARENTHESIS! orEx RPARENTHESIS!;  

atomicExpression 
: WORD
| PHRASE
| NUMBERS
;

//Out of place are elements captured that may be in the parseable input but need to be ommited from output later
//Those unwanted elements are captured here.
//MULTIPLESEPERATOR capture unwanted "," 
outofplace
: MULTIPLESEPERATOR -> ^(OUTOFPLACE ^(MULTIPLESEPERATOR));

fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE     : ('"');
fragment LESSTHEN : '<';
fragment MORETHEN: '>';
fragment EQUAL: '=';
fragment SPACE     : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');

fragment WORDMATTER:  ('!'|'0'..'9'|'\u0023'..'\u0027'|'*'|'+'|'\u002D'..'\u0039'|'\u003F'..'\u007E'|'\u00A1'..'\uFFFE');

LPARENTHESIS : '(';
RPARENTHESIS : ')';

AND    : ('A'|'a')('N'|'n')('D'|'d');
OR     : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT    : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';

WS     : (SPACE) { $channel=HIDDEN; };
NUMBERS : (NUMBER)+;
RELATION : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '='
 | (LESSTHEN | MORETHEN);        // '<' or '>'
PHRASE : (QUOTE)(.)*(QUOTE);
WORD   : WORDMATTER* ;

Answer 1

The most common cause of this is a token that can have length 0. There can be an infinite number of such a token between any two other tokens in the file. Defining a token like this now results in a compiler warning in ANTLR 4.

The following rule can match the empty string:

WORD : WORDMATTER*;

Perhaps you meant to use the following instead?

WORD : WORDMATTER+;

What does cause Antlr to create a large tokenstream resulting in out of memory

Question

1 answers

solution1
1 ACCPTED 2013-05-02 12:58:45

What does cause Antlr to create a large tokenstream resulting in out of memory

Question

1 answers

solution1 1 ACCPTED 2013-05-02 12:58:45

solution1
1 ACCPTED 2013-05-02 12:58:45