简体   繁体   English

如何实现BNF语法树来解析GO中的输入?

[英]How to implement a BNF grammar tree for parsing input in GO?

The grammar for the type language is as follows: 类型语言的语法如下:

TYPE ::= TYPEVAR | PRIMITIVE_TYPE | FUNCTYPE | LISTTYPE;
PRIMITIVE_TYPE ::= ‘int’ | ‘float’ | ‘long’ | ‘string’;
TYPEVAR ::= ‘`’ VARNAME; // Note, the character is a backwards apostrophe!
VARNAME ::= [a-zA-Z][a-zA-Z0-9]*; // Initial letter, then can have numbers
FUNCTYPE ::= ‘(‘ ARGLIST ‘)’ -> TYPE | ‘(‘ ‘)’ -> TYPE;
ARGLIST ::= TYPE ‘,’ ARGLIST | TYPE;
LISTTYPE ::= ‘[‘ TYPE ‘]’;

My input like this: TYPE 我的输入如下:TYPE

for example, if I input (int,int)->float, this is valid. 例如,如果我输入(int,int)-> float,这是有效的。 If I input ( [int] , int), it's a wrong type and invalid. 如果输入([int],int),则类型错误且无效。

I need to parse input from keyboard and decide if it's valid under this grammar(for later type inference). 我需要从键盘解析输入并确定它是否在此语法下有效(以后的类型推断)。 However, I don't know how to build this grammar with go and how to parse input by each byte. 但是,我不知道如何使用go构建此语法以及如何解析每个字节的输入。 Is there any hint or similar implementation? 是否有任何暗示或类似的实施? That's will be really helpful. 这将是非常有帮助的。

For your purposes, the grammar of types looks simple enough that you should be able to write a recursive descent parser that roughly matches the shape of your grammar. 出于您的目的,类型的语法看起来足够简单,您应该能够编写与语法形状大致匹配的递归下降解析器

As a concrete example, let's say that we're recognizing a similar language. 举一个具体的例子,假设我们正在识别一种相似的语言。

TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
ARGLIST ::= TYPE ARGLIST | TYPE

Not quite exactly the same as your original problem, but you should be able to see the similarities. 与您的原始问题不太完全相同,但是您应该能够看到相似之处。

A recursive descent parser consists of functions for each production rule. 递归下降解析器由每个生产规则的功能组成。

func ParseType(???) error {
    ???
}

func ParsePrimitiveType(???) error {
    ???
}

func ParseTupleType(???) error {
    ???
}

func ParseArgList(???) error {
    ???
}

where we'll denote things that we don't quite know what to put as ???* till we get there. 在这里我们将表示直到我们到达那儿才完全知道不应该放在??? *中的东西。 We at least will say for now that we get an error if we can't parse. 我们至少会说,如果我们无法解析,我们会收到error

The input into each of the functions is some stream of tokens. 每个功能的输入都是一些令牌流。 In our case, those tokens consist of sequences of: 在我们的情况下,这些令牌由以下序列组成:

 "int"
 "("
 ")"

and we can imagine a Stream might be something that satisfies: 我们可以想象一个Stream可能会满足:

type Stream interface {
    Peek() string  // peek at next token, stay where we are
    Next() string  // pick next token, move forward
}

to let us walk sequentially through the token stream. 让我们按顺序浏览令牌流。

A lexer is responsible for taking something like a string or io.Reader and producing this stream of string tokens. 词法分析器负责获取类似字符串或io.Reader并生成此字符串标记流。 Lexers are fairly easy to write: you can imagine just using regexps or something similar to break a string into tokens. Lexers很容易编写:你可以想象只使用正则表达式或类似的东西将字符串分解为标记。

Assuming we have a token stream, then a parser then just needs to deal with that stream and a very limited set of possibilities. 假设我们有一个令牌流,那么解析器只需要处理该流和一组非常有限的可能性。 As mentioned before, each production rule corresponds to a parsing function. 如前所述,每个生产规则对应于一个解析函数。 Within a production rule, each alternative is a conditional branch. 在生产规则中,每个替代方案都是条件分支。 If the grammar is particularly simple (as yours is!), we can figure out which conditional branch to take. 如果语法特别简单(如您所愿!),我们可以找出要采用的条件分支。

For example, let's look at TYPE and its corresponding ParseType function: 例如,让我们看一下TYPE及其对应的ParseType函数:

TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'

How might this corresponds to the definition of ParseType ? 这怎么可能对应于ParseType的定义?

The production says that there are two possibilities: it can either be (1) primitive, or (2) tuple. 制作说有两种可能性:它可以是(1)原始的,或(2)元组。 We can peek at the token stream: if we see "int" , then we know it's primitive. 我们可以查看令牌流:如果我们看到"int" ,那么我们知道它是原始的。 If we see a "(" , then since the only possibility is that it's tuple type, we can call the tupletype parser function and let it do the dirty work. 如果我们看到一个"(" ,则由于唯一的可能就是它的元组类型,因此我们可以调用tupletype解析器函数,然后让它完成工作。

It's important to note: if we don't see either a "(" nor an "int" , then something horribly has gone wrong! We know this just from looking at the grammar. We can see that every type must parse from something FIRST starting with one of those two tokens. 重要的是要注意:如果我们没有看到"(""int" ,那么可怕的东西出错了!我们只是从查看语法就知道这一点。我们可以看到每种类型都必须从FIRST中解析出来从这两个标记之一开始。

Ok, let's write the code. 好的,让我们编写代码。

func ParseType(s Stream) error {
    peeked := s.Peek()
    if peeked == "int" {
        return ParsePrimitiveType(s)
    }
    if peeked == "(" {
        return ParseTupleType(s)
    }
    return fmt.Errorf("ParseType on %#v", peeked)
}

Parsing PRIMITIVETYPE and TUPLETYPE is equally direct. 解析PRIMITIVETYPE和TUPLETYPE是直接的。

func ParsePrimitiveType(s Stream) error {
    next := s.Next()
    if next == "int" {
        return nil
    }
    return fmt.Errorf("ParsePrimitiveType on %#v", next)
}

func ParseTupleType(s Stream) error {
    lparen := s.Next()
    if lparen != "(" {
        return fmt.Errorf("ParseTupleType on %#v", lparen)
    }

    err := ParseArgList(s)
    if err != nil {
        return err
    }

    rparen := s.Next()
    if rparen != ")" {
        return fmt.Errorf("ParseTupleType on %#v", rparen)
    }

    return nil
}

The only one that might cause some issues is the parser for argument lists. 唯一可能导致某些问题的是参数列表解析器。 Let's look at the rule. 让我们来看看规则。

ARGLIST ::= TYPE ARGLIST | TYPE

If we try to write the function ParseArgList , we might get stuck because we don't yet know which choice to make. 如果我们尝试编写函数ParseArgList ,我们可能会卡住,因为我们还不知道要做出哪个选择。 Do we go for the first, or the second choice? 我们选择第一个还是第二个选择?

Well, let's at least parse out the part that's common to both alternatives: the TYPE part. 好吧,让我们至少解析出两种选择共同的部分:TYPE部分。

func ParseArgList(s Stream) error {
    err := ParseType(s)
    if err != nil {
        return err
    }

    /// ... FILL ME IN.  Do we call ParseArgList() again, or stop?
}

So we've parsed the prefix. 因此,我们已经解析了前缀。 If it was the second case, we're done. 如果是第二种情况,我们就完成了。 But what if it were the first case? 但是,如果是第一种情况呢? Then we'd still have to read additional lists of types. 然后我们仍然需要阅读其他类型列表。

Ah, but if we are continuing to read additional types, then the stream must FIRST start with another type. 啊,但如果我们继续读取其他类型,那么流必须首先从另一种类型开始。 And we know that all types FIRST start either with "int" or "(" . So we can peek at the stream. Our decision whether or not we picked the first or second choice hinges just on this! 我们知道所有类型FIRST都以"int""("开头。所以我们可以查看流。我们决定是否选择第一个或第二个选择取决于此!

func ParseArgList(s Stream) error {
    err := ParseType(s)
    if err != nil {
        return err
    }

    peeked := s.Peek()
    if peeked == "int" || peeked == "(" {
        // alternative 1
        return ParseArgList(s)
    }
    // alternative 2
    return nil
}

Believe it or not, that's pretty much all we need. 信不信由你,这几乎是我们所需要的。 Here is working code . 这是工作代码

package main

import "fmt"

type Stream interface {
    Peek() string
    Next() string
}

type TokenSlice []string

func (s *TokenSlice) Peek() string {
    return (*s)[0]
}

func (s *TokenSlice) Next() string {
    result := (*s)[0]
    *s = (*s)[1:]
    return result
}

func ParseType(s Stream) error {
    peeked := s.Peek()
    if peeked == "int" {
        return ParsePrimitiveType(s)
    }
    if peeked == "(" {
        return ParseTupleType(s)
    }
    return fmt.Errorf("ParseType on %#v", peeked)
}

func ParsePrimitiveType(s Stream) error {
    next := s.Next()
    if next == "int" {
        return nil
    }
    return fmt.Errorf("ParsePrimitiveType on %#v", next)
}

func ParseTupleType(s Stream) error {
    lparen := s.Next()
    if lparen != "(" {
        return fmt.Errorf("ParseTupleType on %#v", lparen)
    }

    err := ParseArgList(s)
    if err != nil {
        return err
    }

    rparen := s.Next()
    if rparen != ")" {
        return fmt.Errorf("ParseTupleType on %#v", rparen)
    }

    return nil
}

func ParseArgList(s Stream) error {
    err := ParseType(s)
    if err != nil {
        return err
    }

    peeked := s.Peek()
    if peeked == "int" || peeked == "(" {
        // alternative 1
        return ParseArgList(s)
    }
    // alternative 2
    return nil
}

func main() {
    fmt.Println(ParseType(&TokenSlice{"int"}))
    fmt.Println(ParseType(&TokenSlice{"(", "int", ")"}))
    fmt.Println(ParseType(&TokenSlice{"(", "int", "int", ")"}))
    fmt.Println(ParseType(&TokenSlice{"(", "(", "int", ")", "(", "int", ")", ")"}))

    // Should show error:
    fmt.Println(ParseType(&TokenSlice{"(", ")"}))
}

This is a toy parser, of course, because it is not handling certain kinds of errors very well (like premature end of input), and tokens should include, not only their textual content, but also their source location for good error reporting. 当然,这是一个玩具解析器,因为它不能很好地处理某些类型的错误(比如输入的过早结束),并且令牌不仅应包括其文本内容,还应包括其良好错误报告的源位置。 For your own purposes, you'll also want to expand the parsers so that they don't just return error , but also some kind of useful result from the parse. 出于自己的目的,您还需要扩展解析器,以便它们不仅返回error ,而且还可以从解析中获得某种有用的结果。

This answer is just a sketch on how recursive descent parsers work. 这个答案只是递归下降解析器如何工作的草图。 But you should really read a good compiler book to get the details, because you need them. 但是你应该阅读一本好的编译器书来获取细节,因为你需要它们。 The Dragon Book , for example, spends at least a good chapter on about how to write recursive descent parsers with plenty of the technical details. 例如,《 龙书》在如何编写具有大量技术细节的递归下降解析器上花费了至少一章。 in particular, you want to know about the concept of FIRST sets (which I hinted at), because you'll need to understand them to choose which alternative is appropriate when writing each of your parser functions. 特别是,您想了解FIRST集的概念(我暗示过),因为在编写每个解析器函数时,您需要了解它们以选择合适的替代方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM