为什么我的 antlr 词法分析器 java class “代码太大”？

Question

This is the lexer in Antlr (sorry for a long file):这是 Antlr 中的词法分析器（对不起，文件很长）：

lexer grammar SqlServerDialectLexer;
/* T-SQL words */
AND: 'AND';
BIGINT: 'BIGINT';
BIT: 'BIT';
CASE: 'CASE';
CHAR: 'CHAR';
COUNT: 'COUNT';
CREATE: 'CREATE';
CURRENT_TIMESTAMP: 'CURRENT_TIMESTAMP';
DATETIME: 'DATETIME';
DECLARE: 'DECLARE';
ELSE: 'ELSE';
END: 'END';
FLOAT: 'FLOAT';
FROM: 'FROM';
GO: 'GO';
IMAGE: 'IMAGE';
INNER: 'INNER';
INSERT: 'INSERT';
INT: 'INT';
INTO: 'INTO';
IS: 'IS';
JOIN: 'JOIN';
NOT: 'NOT';
NULL: 'NULL';
NUMERIC: 'NUMERIC';
NVARCHAR: 'NVARCHAR';
ON: 'ON';
OR: 'OR';
SELECT: 'SELECT';
SET: 'SET';
SMALLINT: 'SMALLINT';
TABLE: 'TABLE';
THEN: 'THEN';
TINYINT: 'TINYINT';
UPDATE: 'UPDATE';
USE: 'USE';
VALUES: 'VALUES';
VARCHAR: 'VARCHAR';
WHEN: 'WHEN';
WHERE: 'WHERE';

QUOTE: '\'' { textMode = !textMode; };
QUOTED: {textMode}?=> ~('\'')*;

EQUALS: '=';
NOT_EQUALS: '!=';
SEMICOLON: ';';
COMMA: ',';
OPEN: '(';
CLOSE: ')';
VARIABLE: '@' NAME;
NAME:
    ( LETTER | '#' | '_' ) ( LETTER | NUMBER | '#' | '_' | '.' )*
    ;
NUMBER: DIGIT+;

fragment LETTER: 'a'..'z' | 'A'..'Z';
fragment DIGIT: '0'..'9';
SPACE
    :
    ( ' ' | '\t' | '\n' | '\r' )+
    { skip(); }
    ;

JDK 1.6 says code too large and can't compile it. JDK 1.6 说code too large ，无法编译。 Why and how to solve the problem?为什么以及如何解决问题？

Answer 1

Actually I wouldn't say this is a big grammar, and there must be a reason why it doesn't produce reasonably sized code.实际上我不会说这是一个大语法，并且它不能产生合理大小的代码肯定是有原因的。

I think the problem is directly related to this rule:我认为问题与此规则直接相关：

QUOTED: {textMode}?=> ~('\'')*;

Is there any particular reason why you want the QUOTED part as a separate token, rather than leaving it combined with the quote, as Bart also put it in his grammar?是否有任何特殊原因让您希望将 QUOTED 部分作为一个单独的标记，而不是将其与引号结合在一起，因为 Bart 也将其放在他的语法中？ This would also make the textMode variable obsolete.这也会使textMode变量过时。

Dropping the QUOTE and replacing QUOTED with删除 QUOTE 并将 QUOTED 替换为

QUOTED: '\'' (~'\'')* '\'';

most probably will solve the problem, even without splitting the grammar.即使不拆分语法，也很可能会解决问题。

Answer 2

Divide your grammar into several composite grammars .把你的语法分成几个复合语法。 Be careful what you place where.小心你把什么放在哪里。 For example, you don't want to place the NAME rule in you top-grammar and keywords into an imported grammar: the NAME would "overwrite" the keywords from being matched.例如，您不想将NAME规则放在您的顶级语法中，并将关键字放入导入的语法中： NAME会“覆盖”匹配的关键字。

This works:这有效：

Ag银

lexer grammar A;

SELECT: 'SELECT';
SET: 'SET';
SMALLINT: 'SMALLINT';
TABLE: 'TABLE';
THEN: 'THEN';
TINYINT: 'TINYINT';
UPDATE: 'UPDATE';
USE: 'USE';
VALUES: 'VALUES';
VARCHAR: 'VARCHAR';
WHEN: 'WHEN';
WHERE: 'WHERE';

QUOTED: '\'' ('\'\'' | ~'\'')* '\'';

EQUALS: '=';
NOT_EQUALS: '!=';
SEMICOLON: ';';
COMMA: ',';
OPEN: '(';
CLOSE: ')';
VARIABLE: '@' NAME;
NAME:
    ( LETTER | '#' | '_' ) ( LETTER | NUMBER | '#' | '_' | '.' )*
    ;
NUMBER: DIGIT+;

fragment LETTER: 'a'..'z' | 'A'..'Z';
fragment DIGIT: '0'..'9';
SPACE
    :
    ( ' ' | '\t' | '\n' | '\r' )+
    { skip(); }
    ;

SqlServerDialectLexer.g SqlServerDialectLexer.g

lexer grammar SqlServerDialectLexer;

import A;

AND: 'AND';
BIGINT: 'BIGINT';
BIT: 'BIT';
CASE: 'CASE';
CHAR: 'CHAR';
COUNT: 'COUNT';
CREATE: 'CREATE';
CURRENT_TIMESTAMP: 'CURRENT_TIMESTAMP';
DATETIME: 'DATETIME';
DECLARE: 'DECLARE';
ELSE: 'ELSE';
END: 'END';
FLOAT: 'FLOAT';
FROM: 'FROM';
GO: 'GO';
IMAGE: 'IMAGE';
INNER: 'INNER';
INSERT: 'INSERT';
INT: 'INT';
INTO: 'INTO';
IS: 'IS';
JOIN: 'JOIN';
NOT: 'NOT';
NULL: 'NULL';
NUMERIC: 'NUMERIC';
NVARCHAR: 'NVARCHAR';
ON: 'ON';
OR: 'OR';

And it compiles fine:它编译得很好：

java -cp antlr-3.3.jar org.antlr.Tool SqlServerDialectLexer.g 
javac -cp antlr-3.3.jar *.java

As you can see, invoking the org.antlr.Tool on your "top-lexer" is enough: ANTLR automatically generates classes for the imported grammar(s).如您所见，在“top-lexer”上调用org.antlr.Tool就足够了：ANTLR 自动为导入的语法生成类。 If you have more grammars to import, do it like this:如果您要导入更多语法，请执行以下操作：

import A, B, C;

EDIT编辑

Gunther is correct: changing the QUOTED rule is enough. Gunther 是正确的：改变QUOTED规则就足够了。 I'll leave my answer though, because when you're going to add more keywords, or add quite a few parser rules (inevitable with SQL grammars), you'll most probably stumble upon the "code too large" error again.不过我会留下我的答案，因为当您要添加更多关键字或添加很多解析器规则（SQL 语法不可避免）时，您很可能会再次偶然发现“代码太大”错误。 In that case, you can use my proposed solution.在这种情况下，您可以使用我提出的解决方案。

If you're going to accept an answer, please accept Gunther's.如果您要接受答案，请接受 Gunther 的。

Answer 3

Hmm.唔。 I don't suppose you can further break that down into separate files with import statements?我不认为您可以使用导入语句将其进一步分解为单独的文件？

Apparently someone wrote a post-processor to split things up automatically, but I haven't tried it.显然有人写了一个后处理器来自动拆分，但我还没有尝试过。

为什么我的 antlr 词法分析器 java class “代码太大”？

问题描述

3 个解决方案

解决方案1
6 已采纳 2011-06-09 11:27:39

解决方案2
5 2011-06-08 20:12:47

Ag银

SqlServerDialectLexer.g SqlServerDialectLexer.g

EDIT编辑

解决方案3
0 2011-06-08 19:27:58

为什么我的 antlr 词法分析器 java class “代码太大”？

问题描述

3 个解决方案

解决方案1 6 已采纳 2011-06-09 11:27:39

解决方案2 5 2011-06-08 20:12:47

Ag银

SqlServerDialectLexer.g SqlServerDialectLexer.g

EDIT编辑

解决方案3 0 2011-06-08 19:27:58

解决方案1
6 已采纳 2011-06-09 11:27:39

解决方案2
5 2011-06-08 20:12:47

解决方案3
0 2011-06-08 19:27:58