简体   繁体   English

构建口译员:设计AST

[英]Building an interpreter: designing an AST

So I am making an interpreter for a language which I am making which is similar to Python. 因此,我正在为一种与Python类似的语言进行解释。 Now I understand that this is no small task and I don't expect it to work very well or do much but I would like it to have some basic functionality (variables, functions, loops, if statements, etc...). 现在,我了解到这不是一项艰巨的任务,我并不希望它能很好地工作或做很多事情,但我希望它具有一些基本功能(变量,函数,循环,if语句等)。

So currently I am at the stage where the interpreter takes a file, and splits it up into a list of tokens, and now I am ready to turn these tokens into an AST. 因此,目前我处于解释器获取文件并将其拆分为令牌列表的阶段,现在我准备将这些令牌转换为AST。 I intend to do this with a recursive descent parser, which I believe I understand, but here is the problem. 我打算使用递归下降解析器来做到这一点,我相信我理解这一点,但这就是问题所在。 Let's say I have the following input 假设我有以下输入

1 + 2 * 3

this would output 7, because using BIDMAS the multiplication is done first so 这将输出7,因为首先使用BIDMAS进行了乘法

2 * 3 = 6

then the addition is done after 然后加法之后

1 + 6 = 7

I know how to get this order as I have a simple grammar, but I do not know how to store this as an AST. 我有一个简单的语法,所以我知道如何获得此顺序,但是我不知道如何将其存储为AST。 To simplify things for the answers, lets assume this is the only input you will recieve and the grammar can be 为了简化答案,让我们假设这是您将收到的唯一输入,并且语法可以是

program = add
add = mul {"+" mul}
mul = NUM {"*" NUM}

So basically, how do you make a data structure(s) to store an AST? 因此,基本上,您如何制作一个数据结构来存储AST?

PS I am doing this in C. PS我在用C做这个。

Disclaimer: This representation is subjective and just meant to illuminate. 免责声明:此表示是主观的,仅用于说明。

Fundamentally, your AST will be constructed like a binary tree where each AST node is a "C" structure that holds both a "left" and "right" pointer. 从根本上来说,您的AST就像二叉树一样构造,其中每个AST节点都是一个“ C”结构,既包含“左”指针又包含“右”指针。 The other elements of the AST are typically context sensitive. AST的其他元素通常是上下文相关的。 For example, a variable declaration versus a function definition or an expression in a function. 例如,变量声明与函数定义或函数中的表达式的比较。 For the example you cited, the rough tree would mirror this: 对于您引用的示例,粗糙的树将反映以下情况:

   +
 /   \
1     *
      /\
     2  3 

So by substituting the above nodes 1 + (2 * 3) with your AST construct would be similar to: 因此,通过将上述节点1 +(2 * 3)替换为您的AST构造将类似于:

                 -----------------
                | type: ADDOP   |
                | left: struct* |
                | right: struct*|
                -----------------
              /                   \
             /                     \
 (ADDOP left points to)         (ADDOP right points to)
------------------------       --------------------------  
| type: NUMLITERAL     |       | type: MULTOP           |
| value: 1             |       | left: struct*          |
| left: struct* (null) |       | right: struct*         |
| right: struct*(null) |       --------------------------
------------------------              /               \
                                     /                 \

                    (MULTOP left points to)         (MULTOP right points to)
                    ------------------------       --------------------------  
                    | type: NUMLITERAL     |       | type: NUMLITERAL       |
                    | value: 2             |       | value: 3               |
                    | left: struct* (null) |       | left: struct* (null)   |
                    | right: struct*(null) |       | right: struct* (null)  |
                    ------------------------       --------------------------

I assume that you know enough about "C" and how to malloc nodes and assign the left/right pointers. 我假设您对“ C”以及如何malloc节点和分配左/右指针足够了解。

Now the remaining activity would be to do a post order traversal of the tree to either evaluate the expression and produce a result or to emit the appropriate intermediate code/machine code that aligns to a compiled result. 现在剩下的活动将是对树进行后顺序遍历,以评估表达式并产生结果,或者发出与编译结果对齐的适当的中间代码/机器代码。 Either choice bringing with it a massive amount of thinking and planning on your part. 任何一种选择都会带来大量的思想和计划。

BTW: As noted, the AST nodes are going to typically have attributes based on the level of abstraction you want to represent. 顺便说一句:如前所述,AST节点通常将具有基于您要表示的抽象级别的属性。 Also note that a typical compiler may leverage multiple AST for different reasons. 另请注意,典型的编译器可能出于不同的原因利用多个AST。 Yep, more reading/studying on your part. 是的,请多多阅读/研究。

Note: This illustrates the data structure for an AST but @mikeb answer is rock solid for how to get the string "1 + 2 * 3" into the nodes of such a structure. 注意:这说明了AST的数据结构,但是@mikeb答案对于如何将字符串“ 1 + 2 * 3”放入这种结构的节点来说是坚如磐石。

I'd use the "Shunting Yard" algorithm -> https://en.wikipedia.org/wiki/Shunting-yard_algorithm 我会使用“调车场”算法-> https://en.wikipedia.org/wiki/Shunting-yard_algorithm

There is psudocode there too. 那里也有伪代码。

FTA: 自由贸易协定:

In computer science, the shunting-yard algorithm is a method for parsing mathematical expressions specified in infix notation. 在计算机科学中,调车场算法是一种解析以中缀符号指定的数学表达式的方法。 It can be used to produce either a postfix notation string, also known as Reverse Polish notation (RPN), or an abstract syntax tree (AST). 它可以用于生成后缀表示法字符串(也称为反向波兰表示法(RPN))或抽象语法树(AST)。 The algorithm was invented by Edsger Dijkstra and named the "shunting yard" algorithm because its operation resembles that of a railroad shunting yard. 该算法是由Edsger Dijkstra发明的,并被称为“调车场”算法,因为其操作类似于铁路调车场。 Dijkstra first described the Shunting Yard Algorithm in the Mathematisch Centrum report MR 34/61. Dijkstra首先在Mathematisch Centrum报告MR 34/61中描述了Shunting Yard算法。

Like the evaluation of RPN, the shunting yard algorithm is stack-based. 像RPN的评估一样,调车场算法是基于堆栈的。 Infix expressions are the form of mathematical notation most people are used to, for instance "3+4" or "3+4*(2−1)". 中缀表达式是大多数人习惯的数学符号形式,例如“ 3 + 4”或“ 3 + 4 *(2-1)”。 For the conversion there are two text variables (strings), the input and the output. 为了进行转换,有两个文本变量(字符串),输入和输出。 There is also a stack that holds operators not yet added to the output queue. 还有一个堆栈,用于保存尚未添加到输出队列的运算符。 To convert, the program reads each symbol in order and does something based on that symbol. 要进行转换,程序将按顺序读取每个符号并根据该符号执行某些操作。 The result for the above examples would be "3 4 +" or "3 4 2 1 - * +". 以上示例的结果为“ 3 4 +”或“ 3 4 2 1-* +”。

The shunting-yard algorithm has been later generalized into operator-precedence parsing. 调车场算法后来被推广到操作员优先级解析中。

The code, since it was pointed out that this is not how to store it (and if you don't like C - take your pick from http://rosettacode.org/wiki/Parsing/Shunting-yard_algorithm ): 该代码,因为有人指出,这不是存储它的方法(如果您不喜欢C,请从http://rosettacode.org/wiki/Parsing/Shunting-yard_algorithm中选择 ):

#include <sys/types.h>
#include <regex.h>
#include <stdio.h>

typedef struct {
    const char *s;
    int len, prec, assoc;
} str_tok_t;

typedef struct {
    const char * str;
    int assoc, prec;
    regex_t re;
} pat_t;

enum assoc { A_NONE, A_L, A_R };
pat_t pat_eos = {"", A_NONE, 0};

pat_t pat_ops[] = {
    {"^\\)",    A_NONE, -1},
    {"^\\*\\*", A_R, 3},
    {"^\\^",    A_R, 3},
    {"^\\*",    A_L, 2},
    {"^/",      A_L, 2},
    {"^\\+",    A_L, 1},
    {"^-",      A_L, 1},
    {0}
};

pat_t pat_arg[] = {
    {"^[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"},
    {"^[a-zA-Z_][a-zA-Z_0-9]*"},
    {"^\\(", A_L, -1},
    {0}
};

str_tok_t stack[256]; /* assume these are big enough */
str_tok_t queue[256];
int l_queue, l_stack;
#define qpush(x) queue[l_queue++] = x
#define spush(x) stack[l_stack++] = x
#define spop()   stack[--l_stack]

void display(const char *s)
{
    int i;
    printf("\033[1;1H\033[JText | %s", s);
    printf("\nStack| ");
    for (i = 0; i < l_stack; i++)
        printf("%.*s ", stack[i].len, stack[i].s); // uses C99 format strings
    printf("\nQueue| ");
    for (i = 0; i < l_queue; i++)
        printf("%.*s ", queue[i].len, queue[i].s);
    puts("\n\n<press enter>");
    getchar();
}

int prec_booster;

#define fail(s1, s2) {fprintf(stderr, "[Error %s] %s\n", s1, s2); return 0;}

int init(void)
{
    int i;
    pat_t *p;

    for (i = 0, p = pat_ops; p[i].str; i++)
        if (regcomp(&(p[i].re), p[i].str, REG_NEWLINE|REG_EXTENDED))
            fail("comp", p[i].str);

    for (i = 0, p = pat_arg; p[i].str; i++)
        if (regcomp(&(p[i].re), p[i].str, REG_NEWLINE|REG_EXTENDED))
            fail("comp", p[i].str);

    return 1;
}

pat_t* match(const char *s, pat_t *p, str_tok_t * t, const char **e)
{
    int i;
    regmatch_t m;

    while (*s == ' ') s++;
    *e = s;

    if (!*s) return &pat_eos;

    for (i = 0; p[i].str; i++) {
        if (regexec(&(p[i].re), s, 1, &m, REG_NOTEOL))
            continue;
        t->s = s;
        *e = s + (t->len = m.rm_eo - m.rm_so);
        return p + i;
    }
    return 0;
}

int parse(const char *s) {
    pat_t *p;
    str_tok_t *t, tok;

    prec_booster = l_queue = 0;
    display(s);
    while (*s) {
        p = match(s, pat_arg, &tok, &s);
        if (!p || p == &pat_eos) fail("parse arg", s);

        /* Odd logic here. Don't actually stack the parens: don't need to. */
        if (p->prec == -1) {
            prec_booster += 100;
            continue;
        }
        qpush(tok);
        display(s);

re_op:      p = match(s, pat_ops, &tok, &s);
        if (!p) fail("parse op", s);

        tok.assoc = p->assoc;
        tok.prec = p->prec;

        if (p->prec > 0)
            tok.prec = p->prec + prec_booster;
        else if (p->prec == -1) {
            if (prec_booster < 100)
                fail("unmatched )", s);
            tok.prec = prec_booster;
        }

        while (l_stack) {
            t = stack + l_stack - 1;
            if (!(t->prec == tok.prec && t->assoc == A_L)
                    && t->prec <= tok.prec)
                break;
            qpush(spop());
            display(s);
        }

        if (p->prec == -1) {
            prec_booster -= 100;
            goto re_op;
        }

        if (!p->prec) {
            display(s);
            if (prec_booster)
                fail("unmatched (", s);
            return 1;
        }

        spush(tok);
        display(s);
    }

    return 1;
}

int main()
{
    int i;
    const char *tests[] = { 
        "3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3",    /* RC mandated: OK */
        "123",                  /* OK */
        "3+4 * 2 / ( 1 - 5 ) ^ 2 ^ 3.14",   /* OK */
        "(((((((1+2+3**(4 + 5))))))",       /* bad parens */
        "a^(b + c/d * .1e5)!",          /* unknown op */
        "(1**2)**3",                /* OK */
        0
    };

    if (!init()) return 1;
    for (i = 0; tests[i]; i++) {
        printf("Testing string `%s'   <enter>\n", tests[i]);
        getchar();

        printf("string `%s': %s\n\n", tests[i],
            parse(tests[i]) ? "Ok" : "Error");
    }

    return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM