简体   繁体   English

JavaCC:匹配一个空字符串

[英]JavaCC: Matching an empty string

I am having trouble with ambiguous tokens. 我遇到了含糊不清的令牌。 My grammar defines two productions, a numeric constant of the form 2e3 or 100e1 , and identifiers of the form abc or uvw123 . 我的语法定义了两个产品,形式为2e3100e1的数字常量,以及形式为abcuvw123的标识符

The problem is that e1 is a valid identifier, but also constitutes part of a numeric constant. 问题是e1是有效的标识符,但也构成数字常量的一部分。 So for example, if my input consists of 2e3 , it will be tokenized as a number followed by an identifier ( 2 + e3 ), which is not what I want. 因此,例如,如果我的输入包含2e3 ,它将被标记为数字,后跟标识符( 2 + e3 ),这不是我想要的。

I could match numeric constants by writing a more general regex that includes the e , instead of leaving that to a grammar production, but then the token value/image will require parsing to separate the integer and exponent parts, which is not what I want. 我可以通过编写包含e的更一般的正则表达式来匹配数字常量,而不是将其留给语法生成,但是令牌值/图像将需要解析来分离整数和指数部分,这不是我想要的。 This is not what I want. 这不是我想要的。

I have attempted to solve this problem by using tokenizer states. 我试图通过使用标记化器状态来解决此问题。 Because an identifier cannot begin with a digit, a digit must indicate the beginning of a numeric constant, and so I transition to STATE_NUMBER . 由于标识符不能以数字开头,因此数字必须指示数字常量的开头,因此我转换为STATE_NUMBER In this state I define an e token to refer to the exponent part of the numeric constant. 在这种状态下,我定义了一个e标记来引用数字常量的指数部分。 I then have a "catch everything else" token, with the intention of transitioning back to the DEFAULT state. 然后我有一个“捕获其他所有”令牌,意图转换回DEFAULT状态。 In the default state, an e would been matched by the identifier regex. 在默认状态下, e将与标识符regex匹配。

TOKEN : {
  < digit_sequence: (["0"-"9"])+ > : STATE_NUMBER
}

<STATE_NUMBER> TOKEN : {
  < exponent_prefix: "e" >
}

<STATE_NUMBER> MORE : {
  < end_number: ~[] > : DEFAULT
}

TOKEN : {
  < identifier: ["a"-"z"] (["0"-"9","a"-"z"])* >
}

This does not work as expected. 这不能按预期工作。 The character matched by the MORE token appears to be discarded instead of becoming the first character of an identifier. MORE标记匹配的字符似乎被丢弃而不是成为标识符的第一个字符。

I'd like to know how to write a proper grammar for this. 我想知道如何为此写出正确的语法。 I would prefer it if I did not have to use any inline Java code. 如果我不必使用任何内联Java代码,我更喜欢它。

The problem is that < end_number: ~[] > : DEFAULT matches any character that is not an e . 问题是< end_number: ~[] > : DEFAULT匹配任何不是e字符。 What you want to match instead is an empty string. 你要匹配的是一个空字符串。 Try 尝试

< end_number: "" > : DEFAULT

I think the following will work. 我认为以下内容可行。

TOKEN : {
  < (["0"-"9"])+ > : STATE_NUMBER0
}

<STATE_NUMBER0> TOKEN : {
  < "e" > : STATE_NUMBER1
}

<STATE_NUMBER0> MORE : {
  < number_without_exponent: "" > : DEFAULT
}

<STATE_NUMBER1> MORE : {
   < number_with_exponent: (["0"-"9"])+ > : DEFAULT
}

This makes 123e an error, as is 123edf . 这使123e成为错误, 123edf也是123edf If you don't want these to be errors, you can get away with one fewer states. 如果您不希望这些是错误,那么您可以减少一个状态。

TOKEN : {
  < (["0"-"9"])+ > : STATE_NUMBER
}

<STATE_NUMBER> TOKEN : {
  < number_with_exponent: "e" (["0"-"9"])+ > : DEFAULT
}

<STATE_NUMBER> MORE : {
  < number_without_exponent: "" > : DEFAULT
}

This makes 123e a number_without_exponent followed by an identifier , "e". 这使得123e成为number_without_exponent后跟identifier “e”。 If you'd prefer that it be just a number_without_exponent , change the last + to a * . 如果您希望它只是一个number_without_exponent ,请将最后一个+更改为*

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM