简体   繁体   中英

How to validate input based on CFG?

Consider this grammar:

expr ::= LP expr RP
         | expr PLUS|MINUS expr
         | expr STAR|SLASH expr
         | term

term ::= INTEGER|FLOAT

Context-free grammar is defined as G = ( V, Σ, R, S ) , where (in this case):

V = { expr, term }
Σ = { LP, RP, PLUS, MINUS, STAR, SLASH, INTEGER, FLOAT }
R = //was defined above
S = expr

Now let's define a small class called Parser which definition is (code samples are provided in C++):

class Parser
{
public:
    Parser();
    void Parse();
private:
    void parseRecursive(vector<string> rules, int ruleIndex, int startingTokenIndex, string prevRule);

    bool isTerm(string token);  //returns true if token is in upper case
    vector<string> split(...);  //input: string; output: vector of words splitted by delim

    map<string, vector<string>> ruleNames; //contains grammar definition
    vector<int> tokenList; //our input set of tokens
};

To make it easier to go between rules, every grammar rule is split into 2 parts: a key (before ::= ) and its rules (after ::= ), so for my grammar from above the following map takes place:

std::map<string, vector<string>> ruleNames =
{
    { "expr", {
            "LP expr RP",
            "expr PLUS|MINUS expr",
            "expr STAR|SLASH expr",
            "term"
        }
    },
    { "term", { "INTEGER", "FLOAT" } }
};

For testing purposes, string (2 + 3) * 4 has been tokenized to the following set

{ TK_LP, TK_INTEGER, TK_PLUS, TK_INTEGER, TK_RP, TK_STAR, TK_INTEGER }

and been used as an input data for Parser .

Now for the hardest part: the algorithm. From what I understand, I was thinking about this:

1) Taking first rule from starting symbol vector ( LP expr RP ) and split it into words.

2) Check if first word in rule is terminal.

  1. If the word is terminal, compare it with first token.
    • If they are equal, increase token index and move to next word in rule
    • If they are not equal, keep token index and move to next rule
  2. If the word is not terminal and it was not used in previous recursion, increase token index and go into recursive parsing (passing new rules and non-terminal word)

While I am not sure in this algorithm, I still tried to make and implementation of it (of course, unsuccessful):

1) Outter Parse function that initiates recursion:

void Parser::Parse()
{
    int startingTokenIndex = 0;
    string word = this->startingSymbol;
    for (int ruleIndex = 0; ruleIndex < this->ruleNames[word].size(); ruleIndex++)
    {
        this->parseRecursive(this->ruleNames[word], ruleIndex, startingTokenIndex, "");
    }
}

2) Recursive function:

void Parser::parseRecursive(vector<string> rules, unsigned ruleIndex, unsigned startingTokenIndex, string prevRule)
{
    printf("%s - %s\n", rules[ruleIndex].c_str(), this->tokenNames[this->tokenList[startingTokenIndex]].c_str());
    vector<string> temp = this->split(rules[ruleIndex], ' ');
    vector<vector<string>> ruleWords;
    bool breakOutter = false;

    for (unsigned wordIndex = 0; wordIndex < temp.size(); wordIndex++)
    {
        ruleWords.push_back(this->split(temp[wordIndex], '|'));
    }

    for (unsigned wordIndex = 0; wordIndex < ruleWords.size(); wordIndex++)
    {
        breakOutter = false;
        for (unsigned subWordIndex = 0; subWordIndex < ruleWords[wordIndex].size(); subWordIndex++)
        {
            string word = ruleWords[wordIndex][subWordIndex];
            if (this->isTerm(word))
            {
                if (this->tokenNames[this->tokenList[startingTokenIndex]] == this->makeToken(word))
                {
                    printf("%s ", word.c_str());
                    startingTokenIndex++;
                } else {
                    breakOutter = true;
                    break;
                }
            } else {
                if (prevRule != word)
                {
                    startingTokenIndex++;
                    this->parseRecursive(this->ruleNames[word], 0, startingTokenIndex, word);
                    prevRule = word;
                }
            }
        }

        if (breakOutter)
            break;
    }
}

What changes should I perform to my algorithm to make it work?

Depending on what you want to implement a one-time parser or compiler compiler, different methods are used. For compiler compilers are used mainly LR, for manual implementation of LL. Basically, for LL, a manual implementation uses recursive descent (for each non-terminal, a function is created that implements it). For example, for grammar:

S -> S + A | A
A -> a | b

Let us kill the left recursion and the left factorization (LL grammars do not work with the left recursion):

S -> As
s -> + As | epsilon
A -> a | b

Such an implementation is possible:

void S (void)
{
    A ();
    s ();
}
void s (void)
{
    if (get_next_token (). value! = '+')
        return;
    A ();
    s ();
}
void A (void)
{
    token * tok = get_next_token ();
    if (tok.value! = 'a' && tok.value! = 'b')
            syntax_error ();
}

If you want to add SDD, then the inherited attributes are passed through the arguments, and the synthesized attributes as output values.

Comment: do not collect all the tokens at one time, get them as needed: get_next_token ().

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM