简体   繁体   中英

In Python2.7 ANTLR4, extract tokens from a parser rule and store them in a list

In my grammar I validate boolean expressions that look something like this:

((foo == true) && (bar != false) || (qux == norf))

I obtain the string from ANTLR4's context object by calling getText() :

def enterBody(self, ctx):
    expression = ctx.condition.getText() # condition here being shorthand for a grammar rule (`condition=expr`)

However, the string is returned whole (ie no spaces between each individual token) and I have no way of knowing what each token is:

((foo==true)&&(bar!=false)||(qux==norf))

Ideally, I would like it stored in a list in the following format:

['(', '(', 'foo', '==', 'true', ')', '&&', '(', 'bar', '!=', 'false', ')', '||', '(', 'qux', '==', 'norf', ')', ')']

The ANTLR4 Python documentation is rather sparse and I'm not sure if there's a method that accomplishes this.

Python runtime is really similar to the Java runtime, so you can look at the Java documentation and most likely the same method exists in Python. Or browse source code , it is pretty easy to read.

You're asking for getting a flat list of string. But the whole idea of parser is to avoid this. So I think it is most likely not the thing you need. Make sure to be aware about parse tree and how listeners work . Basically you should work with tree and not with flat list. What you probably are looking for is ParserRuleContext.getChildren() . You can use it to access all child nodes:

def enterBody(self, ctx):
    print(list(ctx.getChildren()))

Which is even more likely, you want to access specific type of a child node for some action. Take a look at the parser generated by ANTLR for you. You will find bunch of *Context classes, which contain methods to access every type of subnode. For example ctx parameter of the enterBody() method is instance of the BodyContext and you can use all it's methods to access its child nodes of specific type.

UPD If your grammar only defines a boolean expression and you use it only for validation and tokenization, you won't need parser at all. Just use lexer to get list of all tokens:

input_stream = antlr4.FileStream('input.txt')

# Instantiate an run generated lexer
lexer = BooleanLexer(input_stream)
tokens = antlr4.CommonTokenStream(lexer)

# Parse all tokens until EOF
tokens.fill()

# Print tokens as text (EOF is stripped from the end)
print([token.text for token in tokens.tokens][:-1])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM