简体   繁体   中英

Does the Peg.js engine backstep after a lookahead like regexs do?

According to regular-expressions.info on lookarounds, the engine backsteps after a lookahead:

Let's take one more look inside, to make sure you understand the implications of the lookahead. Let's apply q(?=u)i to quit. The lookahead is now positive and is followed by another token. Again, q matches q and u matches u. Again, the match from the lookahead must be discarded, so the engine steps back from i in the string to u. The lookahead was successful, so the engine continues with i. But i cannot match u. So this match attempt fails. All remaining attempts fail as well, because there are no more q's in the string.

However, in Peg.js it SEEMS like the engine still moves passed the & or ! so that in fact it isn't a lookahead in the same sense as regexps but a decision on consumption, and there is no backstepping, and therefor no true looking ahead.

Is this the case?

(If so then certain parsearen't even possible, like this one ?)

Lookahead works similar to how it does in a regex engine.

This query fails to match because the next letter should be 'u' , not 'i' .

word = 'q' &'u' 'i' 't'

This query succeeds:

word = 'q' &'u' 'u' 'i' 't'

This query succeeds:

word = 'q' 'u' 'i' 't'

As for your example, try something along these lines, you shouldn't need to use lookaheads at all:

expression
    = termPair ( _ delimiter _ termPair )*

termPair
    = term ('.' term)? ' ' term ('.' term)?

term "term"
    = $([a-z0-9]+)

delimiter "delimiter"
    = "."

_ "whitespace"
    = [ \t\n\r]+

EDIT : Added another example per comments below.

expression
    = first:term rest:delimTerm* { return [first].concat(rest); }

delimTerm
    = delimiter t:term { return t; }

term "term"
    = $((!delimiter [a-z0-9. ])+)

delimiter "delimiter"
    = _ "." _

_ "whitespace"
    = [ \t\n\r]+

EDIT : Added extra explanation of the term expression.

I'll try to break down the term rule a bit $((!delimiter [a-z0-9. ])+) .

$() converts everything inside to a single text node like [].join('') .

A single "character" of a term is any character [a-z0-9. ] [a-z0-9. ] , if we wanted to simplify it, we could say . instead. Before matching the character we want to lookahead for a delimiter , if we find a delimiter we stop matching that character. Since we want multiple characters we do the whole thing multiple times with + .

It think it's a common idiom in PEG parsers to move forward this way. I learned the idea from the treetop documentation for matching a string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM