简体   繁体   中英

Parse C++ with Python

Im trying to parse cpp using python. I generated the parser with ANTLR for python and now I want to visit the tree and gather some information.

  • Is there anyway to dump the ANTLR tree as AST in JSON format?
  • I was trying to trace the function calls I was expecting something like CallExpr but I couldn't find anything in generated parser files.

This is the grammar file im using https://github.com/antlr/grammars-v4/blob/master/cpp/CPP14.g4

I tried the following command to get the CPP parser, java -jar antlr-4.8-complete.jar -Dlanguage=Python3./CPP14.g4 -visitor

and this is the very basic code i have

import sys
import os
from antlr4 import *
from CPP14Lexer import *
from CPP14Parser import *
from CPP14Visitor import *



class TREEVisitor(CPP14Visitor):
    def __init__(self):
        pass


    def visitExpressionstatement(self, ctx):
        print(ctx.getText())
        return self.visitChildren(ctx)



if __name__ == '__main__':
    dtype = ""
    input_stream = FileStream(sys.argv[1])
    cpplex = CPP14Lexer(input_stream)
    commtokstream = CommonTokenStream(cpplex)
    cpparser = CPP14Parser(commtokstream)
    print("parse errors: {}".format(cpparser._syntaxErrors))

    tree = cpparser.translationunit()

    tv = TREEVisitor()
    tv.visit(tree)

and the input file im trying to parse,

#include <iostream>

using namespace std;


int foo(int i, int i2)
{
    return i * i2;
}

int main(int argc, char *argv[])
{
    cout << "test" << endl;
    foo(1, 3);
    return 0;
}

Thanks

Function calls are recognised by the postfixexpression rule:

postfixexpression
   : primaryexpression
   | postfixexpression '[' expression ']'
   | postfixexpression '[' bracedinitlist ']'
   | postfixexpression '(' expressionlist? ')'   // <---- this alternative!
   | simpletypespecifier '(' expressionlist? ')'
   | typenamespecifier '(' expressionlist? ')'
   | simpletypespecifier bracedinitlist
   | typenamespecifier bracedinitlist
   | postfixexpression '.' Template? idexpression
   | postfixexpression '->' Template? idexpression
   | postfixexpression '.' pseudodestructorname
   | postfixexpression '->' pseudodestructorname
   | postfixexpression '++'
   | postfixexpression '--'
   | Dynamic_cast '<' thetypeid '>' '(' expression ')'
   | Static_cast '<' thetypeid '>' '(' expression ')'
   | Reinterpret_cast '<' thetypeid '>' '(' expression ')'
   | Const_cast '<' thetypeid '>' '(' expression ')'
   | typeidofthetypeid '(' expression ')'
   | typeidofthetypeid '(' thetypeid ')'
   ;

So if you add this to your visitor:

def visitPostfixexpression(self, ctx:CPP14Parser.PostfixexpressionContext):
    print(ctx.getText())
    return self.visitChildren(ctx)

It will get printed. Note that it will now print a lot more than function calls, since it matches much more than that. You could label the alternatives :

postfixexpression
   : primaryexpression                                     #otherPostfixexpression
   | postfixexpression '[' expression ']'                  #otherPostfixexpression
   | postfixexpression '[' bracedinitlist ']'              #otherPostfixexpression
   | postfixexpression '(' expressionlist? ')'             #functionCallPostfixexpression
   | simpletypespecifier '(' expressionlist? ')'           #otherPostfixexpression
   | typenamespecifier '(' expressionlist? ')'             #otherPostfixexpression
   | simpletypespecifier bracedinitlist                    #otherPostfixexpression
   | typenamespecifier bracedinitlist                      #otherPostfixexpression
   | postfixexpression '.' Template? idexpression          #otherPostfixexpression
   | postfixexpression '->' Template? idexpression         #otherPostfixexpression
   | postfixexpression '.' pseudodestructorname            #otherPostfixexpression
   | postfixexpression '->' pseudodestructorname           #otherPostfixexpression
   | postfixexpression '++'                                #otherPostfixexpression
   | postfixexpression '--'                                #otherPostfixexpression
   | Dynamic_cast '<' thetypeid '>' '(' expression ')'     #otherPostfixexpression
   | Static_cast '<' thetypeid '>' '(' expression ')'      #otherPostfixexpression
   | Reinterpret_cast '<' thetypeid '>' '(' expression ')' #otherPostfixexpression
   | Const_cast '<' thetypeid '>' '(' expression ')'       #otherPostfixexpression
   | typeidofthetypeid '(' expression ')'                  #otherPostfixexpression
   | typeidofthetypeid '(' thetypeid ')'                   #otherPostfixexpression
   ;

and you can then do:

def visitFunctionCallPostfixexpression(self, ctx:CPP14Parser.FunctionCallPostfixexpressionContext):
    print(ctx.getText())
    return self.visitChildren(ctx)

and then only foo(1,3) gets printed (note that you might want to label more rules as functionCallPostfixexpression inside the postfixexpression rule).

Is there anyway to dump the ANTLR tree as AST in JSON format?

No.

But you could easily create something yourself of course: the objects returned by each parser rule, like translationunit , contains the entire tree. A quick and dirty example:

import antlr4
from antlr4.tree.Tree import TerminalNodeImpl
import json

# import CPP14Lexer, CPP14Parser, ...


def to_dict(root):
    obj = {}
    _fill(obj, root)
    return obj


def _fill(obj, node):

    if isinstance(node, TerminalNodeImpl):
        obj["type"] = node.symbol.type
        obj["text"] = node.getText()
        return

    class_name = type(node).__name__.replace('Context', '')
    rule_name = '{}{}'.format(class_name[0].lower(), class_name[1:])
    arr = []
    obj[rule_name] = arr

    for child_node in node.children:
        child_obj = {}
        arr.append(child_obj)
        _fill(child_obj, child_node)


if __name__ == '__main__':
    source = """
        #include <iostream>

        using namespace std;

        int foo(int i, int i2)
        {
            return i * i2;
        }

        int main(int argc, char *argv[])
        {
            cout << "test" << endl;
            foo(1, 3);
            return 0;
        }
        """
    lexer = CPP14Lexer(antlr4.InputStream(source))
    parser = CPP14Parser(antlr4.CommonTokenStream(lexer))
    tree = parser.translationunit()
    tree_dict = to_dict(tree)
    json_str = json.dumps(tree_dict, indent=2)
    print(json_str)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM