Consider this grammar:
expr ::= LP expr RP
| expr PLUS|MINUS expr
| expr STAR|SLASH expr
| term
term ::= INTEGER|FLOAT
Context-free grammar is defined as G = ( V, Σ, R, S )
, where (in this case):
V = { expr, term }
Σ = { LP, RP, PLUS, MINUS, STAR, SLASH, INTEGER, FLOAT }
R = //was defined above
S = expr
Now let's define a small class called Parser
which definition is (code samples are provided in C++):
class Parser
{
public:
Parser();
void Parse();
private:
void parseRecursive(vector<string> rules, int ruleIndex, int startingTokenIndex, string prevRule);
bool isTerm(string token); //returns true if token is in upper case
vector<string> split(...); //input: string; output: vector of words splitted by delim
map<string, vector<string>> ruleNames; //contains grammar definition
vector<int> tokenList; //our input set of tokens
};
To make it easier to go between rules, every grammar rule is split into 2 parts: a key (before ::=
) and its rules (after ::=
), so for my grammar from above the following map takes place:
std::map<string, vector<string>> ruleNames =
{
{ "expr", {
"LP expr RP",
"expr PLUS|MINUS expr",
"expr STAR|SLASH expr",
"term"
}
},
{ "term", { "INTEGER", "FLOAT" } }
};
For testing purposes, string (2 + 3) * 4
has been tokenized to the following set
{ TK_LP, TK_INTEGER, TK_PLUS, TK_INTEGER, TK_RP, TK_STAR, TK_INTEGER }
and been used as an input data for Parser
.
Now for the hardest part: the algorithm. From what I understand, I was thinking about this:
1) Taking first rule from starting symbol vector ( LP expr RP
) and split it into words.
2) Check if first word in rule is terminal.
While I am not sure in this algorithm, I still tried to make and implementation of it (of course, unsuccessful):
1) Outter Parse
function that initiates recursion:
void Parser::Parse()
{
int startingTokenIndex = 0;
string word = this->startingSymbol;
for (int ruleIndex = 0; ruleIndex < this->ruleNames[word].size(); ruleIndex++)
{
this->parseRecursive(this->ruleNames[word], ruleIndex, startingTokenIndex, "");
}
}
2) Recursive function:
void Parser::parseRecursive(vector<string> rules, unsigned ruleIndex, unsigned startingTokenIndex, string prevRule)
{
printf("%s - %s\n", rules[ruleIndex].c_str(), this->tokenNames[this->tokenList[startingTokenIndex]].c_str());
vector<string> temp = this->split(rules[ruleIndex], ' ');
vector<vector<string>> ruleWords;
bool breakOutter = false;
for (unsigned wordIndex = 0; wordIndex < temp.size(); wordIndex++)
{
ruleWords.push_back(this->split(temp[wordIndex], '|'));
}
for (unsigned wordIndex = 0; wordIndex < ruleWords.size(); wordIndex++)
{
breakOutter = false;
for (unsigned subWordIndex = 0; subWordIndex < ruleWords[wordIndex].size(); subWordIndex++)
{
string word = ruleWords[wordIndex][subWordIndex];
if (this->isTerm(word))
{
if (this->tokenNames[this->tokenList[startingTokenIndex]] == this->makeToken(word))
{
printf("%s ", word.c_str());
startingTokenIndex++;
} else {
breakOutter = true;
break;
}
} else {
if (prevRule != word)
{
startingTokenIndex++;
this->parseRecursive(this->ruleNames[word], 0, startingTokenIndex, word);
prevRule = word;
}
}
}
if (breakOutter)
break;
}
}
What changes should I perform to my algorithm to make it work?
Depending on what you want to implement a one-time parser or compiler compiler, different methods are used. For compiler compilers are used mainly LR, for manual implementation of LL. Basically, for LL, a manual implementation uses recursive descent (for each non-terminal, a function is created that implements it). For example, for grammar:
S -> S + A | A
A -> a | b
Let us kill the left recursion and the left factorization (LL grammars do not work with the left recursion):
S -> As
s -> + As | epsilon
A -> a | b
Such an implementation is possible:
void S (void)
{
A ();
s ();
}
void s (void)
{
if (get_next_token (). value! = '+')
return;
A ();
s ();
}
void A (void)
{
token * tok = get_next_token ();
if (tok.value! = 'a' && tok.value! = 'b')
syntax_error ();
}
If you want to add SDD, then the inherited attributes are passed through the arguments, and the synthesized attributes as output values.
Comment: do not collect all the tokens at one time, get them as needed: get_next_token ().
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.