简体   繁体   中英

Is there a way to make ANTLR4 use enums for generated tokens?

In ANTLR4 the generated lexer in Java contains a public field for each token where the type of the field is a simple 'int'. Is there a reason why ANTLR4 does not use enums instead, or is there an option to make it use enums?

This is a simplified example off the top of my head

x.g4

A: 'a';
B: 'b';

XLexer.java

public class XLexer extends Lexer{
   public static final int A = 1, B = 2;
}

I would prefer for XLexer to instead contain

public class XLexer extends Lexer{
  public static enum Token{
    A(1), B(2)
  }
}

This is useful for debugging purposes when dumping tokens. Right now the token name is not printed, instead only the integer representation is provided.

[@-1,0:0='a',<1>,1:0]

A more readable version would have <A> instead of <1>

[@-1,0:0='a',<A>,1:0]

要将int令牌类型转换为其符号值,只需使用

String tokenName = YourLexer.VOCABULARY.getSymbolicName(type);

Here is my current workaround. I create a custom token and provide a TokenFactory to the XLexer via

lexer.setTokenFactory(new MyTokenFactory());

And I override the toString() method in my token class.

public class MyToken extends Token{
  @Override
  public String toString(){
    StringBuilder out = new StringBuilder();

    out.append("[");
    out.append("'").append(getText()).append("'");
    out.append(" type ").append(getName()); //getName() is implemented by this class

    int start = getCharPositionInLine();
    int end = start + getText().length();
    out.append(" at ").append(getLine()).append(":").append(start).append("-").append(end);
    out.append("]");

    return out.toString();
}

Where instead of showing the integer for the type the class uses getName() to convert the integer to a string.

// inside the token class
private String getName(){
   switch (getType()){
     case XLexer.A: return "A";
     case XLexer.B: return "B";
     default: throw new RuntimeException("unknown token " + getType());
  }
}

This produces the following output

['A' type A at 1:5-6]

This solution is somewhat brittle in that getName() has to be updated to remain in sync with the current tokens defined by the g4 file. There is no way to enforce this property, as the compiler cannot know if all the token types are handled in the switch inside getName().

Reason why ANTLR4 uses int s instead of enums are simplicity and performance .

For debugging purposes, you may modify string-representation of tokens as follows:

  • Create your own implementation of token, extending CommonToken . Define the toString() method as you like.

  • Create a TokenFactory implementation, which returns the tokens of your custom type.

  • Set token factory for lexer and for parser .

See also :


EDIT , addressing the problem you've mentioned in your answer.

To avoid keeping token names in sync with .g4 manually, you may build a mapping from XLexer dynamically using reflection .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM