简体   繁体   中英

JavaCC: Matching an empty string

I am having trouble with ambiguous tokens. My grammar defines two productions, a numeric constant of the form 2e3 or 100e1 , and identifiers of the form abc or uvw123 .

The problem is that e1 is a valid identifier, but also constitutes part of a numeric constant. So for example, if my input consists of 2e3 , it will be tokenized as a number followed by an identifier ( 2 + e3 ), which is not what I want.

I could match numeric constants by writing a more general regex that includes the e , instead of leaving that to a grammar production, but then the token value/image will require parsing to separate the integer and exponent parts, which is not what I want. This is not what I want.

I have attempted to solve this problem by using tokenizer states. Because an identifier cannot begin with a digit, a digit must indicate the beginning of a numeric constant, and so I transition to STATE_NUMBER . In this state I define an e token to refer to the exponent part of the numeric constant. I then have a "catch everything else" token, with the intention of transitioning back to the DEFAULT state. In the default state, an e would been matched by the identifier regex.

TOKEN : {
  < digit_sequence: (["0"-"9"])+ > : STATE_NUMBER
}

<STATE_NUMBER> TOKEN : {
  < exponent_prefix: "e" >
}

<STATE_NUMBER> MORE : {
  < end_number: ~[] > : DEFAULT
}

TOKEN : {
  < identifier: ["a"-"z"] (["0"-"9","a"-"z"])* >
}

This does not work as expected. The character matched by the MORE token appears to be discarded instead of becoming the first character of an identifier.

I'd like to know how to write a proper grammar for this. I would prefer it if I did not have to use any inline Java code.

The problem is that < end_number: ~[] > : DEFAULT matches any character that is not an e . What you want to match instead is an empty string. Try

< end_number: "" > : DEFAULT

I think the following will work.

TOKEN : {
  < (["0"-"9"])+ > : STATE_NUMBER0
}

<STATE_NUMBER0> TOKEN : {
  < "e" > : STATE_NUMBER1
}

<STATE_NUMBER0> MORE : {
  < number_without_exponent: "" > : DEFAULT
}

<STATE_NUMBER1> MORE : {
   < number_with_exponent: (["0"-"9"])+ > : DEFAULT
}

This makes 123e an error, as is 123edf . If you don't want these to be errors, you can get away with one fewer states.

TOKEN : {
  < (["0"-"9"])+ > : STATE_NUMBER
}

<STATE_NUMBER> TOKEN : {
  < number_with_exponent: "e" (["0"-"9"])+ > : DEFAULT
}

<STATE_NUMBER> MORE : {
  < number_without_exponent: "" > : DEFAULT
}

This makes 123e a number_without_exponent followed by an identifier , "e". If you'd prefer that it be just a number_without_exponent , change the last + to a * .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM