简体   繁体   English

简单的Antlr3令牌解析

[英]Simple Antlr3 Token parsing

while i'm somewhat comforted by the amount of questions regarding Antlr grammar (it's not just me trying to shave this yak shaped thing), i haven't found a question/answer that comes close to helping with my issue. 尽管我对与Antlr语法有关的问题感到有些欣慰(不仅仅是我试图剃掉这种牛形状的东西),但我还没有找到一个能帮助解决我问题的问题/答案。

I'm using Antlr3.3 with a mixed Token/Parser lexer. 我正在将Antlr3.3与混合的Token / Parser词法分析器一起使用。

I'm using gUnit to help prove the grammar, and some jUnit tests; 我正在使用gUnit来帮助证明语法和一些jUnit测试。 this is where the fun begins. 这就是乐趣的开始。

I have a simple config file i want to parse: 我有一个要解析的简单配置文件:

identifier foobar {
port=8080
stub plusone.google.com {
        status-code = 206
        header = []
        body = []
  }
 }

I'm having trouble parsing the "identifier" (foobar in this example): Valid names i want to allow are: 我在解析“标识符”时遇到麻烦(此示例中为foobar):我要允许的有效名称为:

foobar
foo-bar
foo_bar
foobar2
foo-bar2
foo_bar2
3foobar
_foo-bar3

and so on, therefore a valid name can use the characters 'a..z'|'A..Z', '0..9' '_' and '-' 依此类推,因此有效名称可以使用字符'a..z'|'A..Z', '0..9' '_' and '-'

The grammar i've arrived at is this (note this isnt the full grammar, just the portion pertinent to this question): 我得出的语法是这样的(请注意,这不是完整的语法,只是与该问题有关的部分):

fragment HYPHEN : '-' ;

fragment UNDERSCORE : '_' ;

fragment DIGIT  : '0'..'9' ;

fragment LETTER : 'a'..'z' |'A'..'Z' ;

fragment NUMBER : DIGIT+ ;

fragment WORD : LETTER+  ;

IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;

and the corresponding gUnit test 和相应的gUnit测试

IDENTIFIER:
"foobar" OK
"foo_bar" OK
"foo-bar" OK
"foobar1" OK
"foobar12" OK
"foo-bar2" OK
"foo_bar2" OK
"foo-bar-2" OK
"foo-bar_2" OK
"5foobar" OK
"f_2-a" OK
"aA0_" OK
// no "funny chars"
"foo@bar" FAIL
// not with whitepsace
"foo bar" FAIL

Running the gUnit tests only fails for "5foobar". 仅对“ 5foobar”运行gUnit测试失败。 I've managed to parse the difficult stuff, and yet the seemingly simple task of parsing an identifier has beaten me. 我已经成功解析了一些困难的内容,但是解析标识符的看似简单的任务却使我不胜其烦。

Can anyone point me to where i'm going wrong? 谁能指出我要去哪里了? How can i match without being greedy? 我如何不贪心搭配?

Many thanks in advance. 提前谢谢了。

-- UPDATE -- -更新-

I changed the grammar as per Barts answer, to this: 我根据Barts的答案更改了语法:

IDENTIFIER : ('0'..'9'| 'a'..'z'|'A'..'Z' | '_'|'-') ('_'|'-'|'a'..'z'|'A'..'Z'|'0'..'9')* ;

and this fixed the failing gUnit tests, but broke an unreleated jUnit test, that tests the "port" parameter. 这样就解决了失败的gUnit测试,但又破坏了一个未发布的jUnit测试,该测试可以测试“ port”参数。 The following grammar deals with the "port=8080" element of the config snippet above: 以下语法处理上述配置代码片段的“ port = 8080”元素:

configurationStatement[MiddlemanConfiguration config]
        :   PORT EQ port=NUMBER {
config.setConfigurationPort(Integer.parseInt(port.getText())); }
            |   def=proxyDefinition { config.add(def); }
;

The message i get is: 我收到的消息是:

mismatched input '8080' expecting NUMBER

Where NUMBER is defined as NUMBER : ('0'..'9')+ ; 其中NUMBER定义为NUMBER : ('0'..'9')+ ;

Moving the rule for NUMBER above the IDENTIFIER block, fixed this issue. 将NUMBER的规则移到IDENTIFIER块上方,可以解决此问题。

IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;

is equivalent to: 等效于:

IDENTIFIER 
 : DIGIT 
 | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
 ;

So, an IDENTIFIER is eiter a single DIGIT , or starts with a LETTER followed by (LETTER | DIGIT | HYPHEN | UNDERSCORE)* . 因此, IDENTIFIER是单个DIGIT ,或者以LETTER开头,后跟(LETTER | DIGIT | HYPHEN | UNDERSCORE)*

You probably meant: 您可能的意思是:

IDENTIFIER 
 : (DIGIT | LETTER | UNDERSCORE) (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
 ;

However, that also allows for 3---3 as being a valid IDENTIFIER , is that correct? 但是,这也允许3---3作为有效的IDENTIFIER ,对吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM