简体   繁体   中英

How to implement a BNF grammar tree for parsing input in GO?

The grammar for the type language is as follows:

TYPE ::= TYPEVAR | PRIMITIVE_TYPE | FUNCTYPE | LISTTYPE;
PRIMITIVE_TYPE ::= ‘int’ | ‘float’ | ‘long’ | ‘string’;
TYPEVAR ::= ‘`’ VARNAME; // Note, the character is a backwards apostrophe!
VARNAME ::= [a-zA-Z][a-zA-Z0-9]*; // Initial letter, then can have numbers
FUNCTYPE ::= ‘(‘ ARGLIST ‘)’ -> TYPE | ‘(‘ ‘)’ -> TYPE;
ARGLIST ::= TYPE ‘,’ ARGLIST | TYPE;
LISTTYPE ::= ‘[‘ TYPE ‘]’;

My input like this: TYPE

for example, if I input (int,int)->float, this is valid. If I input ( [int] , int), it's a wrong type and invalid.

I need to parse input from keyboard and decide if it's valid under this grammar(for later type inference). However, I don't know how to build this grammar with go and how to parse input by each byte. Is there any hint or similar implementation? That's will be really helpful.

For your purposes, the grammar of types looks simple enough that you should be able to write a recursive descent parser that roughly matches the shape of your grammar.

As a concrete example, let's say that we're recognizing a similar language.

TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
ARGLIST ::= TYPE ARGLIST | TYPE

Not quite exactly the same as your original problem, but you should be able to see the similarities.

A recursive descent parser consists of functions for each production rule.

func ParseType(???) error {
    ???
}

func ParsePrimitiveType(???) error {
    ???
}

func ParseTupleType(???) error {
    ???
}

func ParseArgList(???) error {
    ???
}

where we'll denote things that we don't quite know what to put as ???* till we get there. We at least will say for now that we get an error if we can't parse.

The input into each of the functions is some stream of tokens. In our case, those tokens consist of sequences of:

 "int"
 "("
 ")"

and we can imagine a Stream might be something that satisfies:

type Stream interface {
    Peek() string  // peek at next token, stay where we are
    Next() string  // pick next token, move forward
}

to let us walk sequentially through the token stream.

A lexer is responsible for taking something like a string or io.Reader and producing this stream of string tokens. Lexers are fairly easy to write: you can imagine just using regexps or something similar to break a string into tokens.

Assuming we have a token stream, then a parser then just needs to deal with that stream and a very limited set of possibilities. As mentioned before, each production rule corresponds to a parsing function. Within a production rule, each alternative is a conditional branch. If the grammar is particularly simple (as yours is!), we can figure out which conditional branch to take.

For example, let's look at TYPE and its corresponding ParseType function:

TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'

How might this corresponds to the definition of ParseType ?

The production says that there are two possibilities: it can either be (1) primitive, or (2) tuple. We can peek at the token stream: if we see "int" , then we know it's primitive. If we see a "(" , then since the only possibility is that it's tuple type, we can call the tupletype parser function and let it do the dirty work.

It's important to note: if we don't see either a "(" nor an "int" , then something horribly has gone wrong! We know this just from looking at the grammar. We can see that every type must parse from something FIRST starting with one of those two tokens.

Ok, let's write the code.

func ParseType(s Stream) error {
    peeked := s.Peek()
    if peeked == "int" {
        return ParsePrimitiveType(s)
    }
    if peeked == "(" {
        return ParseTupleType(s)
    }
    return fmt.Errorf("ParseType on %#v", peeked)
}

Parsing PRIMITIVETYPE and TUPLETYPE is equally direct.

func ParsePrimitiveType(s Stream) error {
    next := s.Next()
    if next == "int" {
        return nil
    }
    return fmt.Errorf("ParsePrimitiveType on %#v", next)
}

func ParseTupleType(s Stream) error {
    lparen := s.Next()
    if lparen != "(" {
        return fmt.Errorf("ParseTupleType on %#v", lparen)
    }

    err := ParseArgList(s)
    if err != nil {
        return err
    }

    rparen := s.Next()
    if rparen != ")" {
        return fmt.Errorf("ParseTupleType on %#v", rparen)
    }

    return nil
}

The only one that might cause some issues is the parser for argument lists. Let's look at the rule.

ARGLIST ::= TYPE ARGLIST | TYPE

If we try to write the function ParseArgList , we might get stuck because we don't yet know which choice to make. Do we go for the first, or the second choice?

Well, let's at least parse out the part that's common to both alternatives: the TYPE part.

func ParseArgList(s Stream) error {
    err := ParseType(s)
    if err != nil {
        return err
    }

    /// ... FILL ME IN.  Do we call ParseArgList() again, or stop?
}

So we've parsed the prefix. If it was the second case, we're done. But what if it were the first case? Then we'd still have to read additional lists of types.

Ah, but if we are continuing to read additional types, then the stream must FIRST start with another type. And we know that all types FIRST start either with "int" or "(" . So we can peek at the stream. Our decision whether or not we picked the first or second choice hinges just on this!

func ParseArgList(s Stream) error {
    err := ParseType(s)
    if err != nil {
        return err
    }

    peeked := s.Peek()
    if peeked == "int" || peeked == "(" {
        // alternative 1
        return ParseArgList(s)
    }
    // alternative 2
    return nil
}

Believe it or not, that's pretty much all we need. Here is working code .

package main

import "fmt"

type Stream interface {
    Peek() string
    Next() string
}

type TokenSlice []string

func (s *TokenSlice) Peek() string {
    return (*s)[0]
}

func (s *TokenSlice) Next() string {
    result := (*s)[0]
    *s = (*s)[1:]
    return result
}

func ParseType(s Stream) error {
    peeked := s.Peek()
    if peeked == "int" {
        return ParsePrimitiveType(s)
    }
    if peeked == "(" {
        return ParseTupleType(s)
    }
    return fmt.Errorf("ParseType on %#v", peeked)
}

func ParsePrimitiveType(s Stream) error {
    next := s.Next()
    if next == "int" {
        return nil
    }
    return fmt.Errorf("ParsePrimitiveType on %#v", next)
}

func ParseTupleType(s Stream) error {
    lparen := s.Next()
    if lparen != "(" {
        return fmt.Errorf("ParseTupleType on %#v", lparen)
    }

    err := ParseArgList(s)
    if err != nil {
        return err
    }

    rparen := s.Next()
    if rparen != ")" {
        return fmt.Errorf("ParseTupleType on %#v", rparen)
    }

    return nil
}

func ParseArgList(s Stream) error {
    err := ParseType(s)
    if err != nil {
        return err
    }

    peeked := s.Peek()
    if peeked == "int" || peeked == "(" {
        // alternative 1
        return ParseArgList(s)
    }
    // alternative 2
    return nil
}

func main() {
    fmt.Println(ParseType(&TokenSlice{"int"}))
    fmt.Println(ParseType(&TokenSlice{"(", "int", ")"}))
    fmt.Println(ParseType(&TokenSlice{"(", "int", "int", ")"}))
    fmt.Println(ParseType(&TokenSlice{"(", "(", "int", ")", "(", "int", ")", ")"}))

    // Should show error:
    fmt.Println(ParseType(&TokenSlice{"(", ")"}))
}

This is a toy parser, of course, because it is not handling certain kinds of errors very well (like premature end of input), and tokens should include, not only their textual content, but also their source location for good error reporting. For your own purposes, you'll also want to expand the parsers so that they don't just return error , but also some kind of useful result from the parse.

This answer is just a sketch on how recursive descent parsers work. But you should really read a good compiler book to get the details, because you need them. The Dragon Book , for example, spends at least a good chapter on about how to write recursive descent parsers with plenty of the technical details. in particular, you want to know about the concept of FIRST sets (which I hinted at), because you'll need to understand them to choose which alternative is appropriate when writing each of your parser functions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM