简体   繁体   中英

How to get tokens in Jison?

I'm using Jison for a college project, and I need to make a switch for each recognized token, so I can present to the professor something like:

<identifier, s>
<operator, =>
<identifier, a>
<operator, +>
<identifier, b>

Any way on how to get this done without recurring to regular expressions manually? (I mean, Jison uses regexp internally but that's not my business)

What I tried doing is the following:

var lex = parser.lexer,
    token;
lex.setInput('The code to parse');
while (!lex.done) {
    token = lex.next();
}

But the only thing I get saved in token is a number, and when a symbol is not defined in the grammar, it returns character-by-character token.

Thanks in advance.

(Warning: Some of this answer was derived by examining code generated by jison. Since the interfaces are not well defined, it may not stand the test of time.)

parser.lexer.next() is not part of the documented lexer interface, although the lexical analyzer produced by jison does appear to implement it. Note that it does not produce a token, if the input consumed corresponds to a lexical rule which does not produce a token. (For example, a rule which ignores whitespace.) It is better to use the documented interface parser.lexer.lex() , which does always produce a token.

Strictly speaking, parser.lexer.lex() is documented as returning the name of a terminal, but for efficiency the lexical analyzers generated by jison will return the internal numerical code for the terminal if jison is able to figure out which terminal the lexical rule will return. So you have a couple of alternatives, if you want to trace the actual names of the terminals recognized:

  1. You can defeat this optimization by avoiding the use of the form return <string> . For example, if you change the lexical rule:

     [A-Za-z][A-Za-z0-9] { return 'IDENTIFIER`; } 

    to

     [A-Za-z][A-Za-z0-9] { return '' + 'IDENTIFIER`; } 

    then the generated lexical analyzer will return the string 'IDENTIFIER' rather than some numeric code.

  2. Alternatively, you can use parser.terminals_ , which according to the comment at the top of the generated parser has the form terminals_: {associative list: number ==> name} , to look up the terminal name given the token number.

To get the source character string associated with the lexeme, use parser.lexer.yytext .

Here's a solution using the second alternative:

/* To reduce confusion, I change 'lex' to 'lexer' */
var lexer = parser.lexer,
    token;
lexer.setInput('The code to parse');
while (!lexer.done) {
    token = lexer.lex();
    /* Look up the token name if necessary */
    if (token in parser.terminals_) {
       token = parser.terminals_[token];
    }
    console.log('<' + token + ', ' + lexer.yytext + '>')
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM