JavaCC: Matching an empty string

Question

I am having trouble with ambiguous tokens. My grammar defines two productions, a numeric constant of the form 2e3 or 100e1 , and identifiers of the form abc or uvw123 .

The problem is that e1 is a valid identifier, but also constitutes part of a numeric constant. So for example, if my input consists of 2e3 , it will be tokenized as a number followed by an identifier ( 2 + e3 ), which is not what I want.

I could match numeric constants by writing a more general regex that includes the e , instead of leaving that to a grammar production, but then the token value/image will require parsing to separate the integer and exponent parts, which is not what I want. This is not what I want.

I have attempted to solve this problem by using tokenizer states. Because an identifier cannot begin with a digit, a digit must indicate the beginning of a numeric constant, and so I transition to STATE_NUMBER . In this state I define an e token to refer to the exponent part of the numeric constant. I then have a "catch everything else" token, with the intention of transitioning back to the DEFAULT state. In the default state, an e would been matched by the identifier regex.

TOKEN : {
  < digit_sequence: (["0"-"9"])+ > : STATE_NUMBER
}

<STATE_NUMBER> TOKEN : {
  < exponent_prefix: "e" >
}

<STATE_NUMBER> MORE : {
  < end_number: ~[] > : DEFAULT
}

TOKEN : {
  < identifier: ["a"-"z"] (["0"-"9","a"-"z"])* >
}

This does not work as expected. The character matched by the MORE token appears to be discarded instead of becoming the first character of an identifier.

I'd like to know how to write a proper grammar for this. I would prefer it if I did not have to use any inline Java code.

Answer 1

The problem is that < end_number: ~[] > : DEFAULT matches any character that is not an e . What you want to match instead is an empty string. Try

< end_number: "" > : DEFAULT

I think the following will work.

TOKEN : {
  < (["0"-"9"])+ > : STATE_NUMBER0
}

<STATE_NUMBER0> TOKEN : {
  < "e" > : STATE_NUMBER1
}

<STATE_NUMBER0> MORE : {
  < number_without_exponent: "" > : DEFAULT
}

<STATE_NUMBER1> MORE : {
   < number_with_exponent: (["0"-"9"])+ > : DEFAULT
}

This makes 123e an error, as is 123edf . If you don't want these to be errors, you can get away with one fewer states.

TOKEN : {
  < (["0"-"9"])+ > : STATE_NUMBER
}

<STATE_NUMBER> TOKEN : {
  < number_with_exponent: "e" (["0"-"9"])+ > : DEFAULT
}

<STATE_NUMBER> MORE : {
  < number_without_exponent: "" > : DEFAULT
}

This makes 123e a number_without_exponent followed by an identifier , "e". If you'd prefer that it be just a number_without_exponent , change the last + to a * .

JavaCC: Matching an empty string

Question

1 answers

solution1
0 ACCPTED 2015-03-04 13:07:52

JavaCC: Matching an empty string

Question

1 answers

solution1 0 ACCPTED 2015-03-04 13:07:52

solution1
0 ACCPTED 2015-03-04 13:07:52