简体   繁体   中英

Parsing context-free languages in a stream of tokens

The problem

Given a context-free grammar with arbitrary rules and a stream of tokens, how can stream fragments that match the grammar be identified effectively?

Example:

Grammar

S -> ASB | AB
A -> a 
B -> b

(So essentially, a number of a s followed by an equal number of b s)

Stream:

aabaaabbc...

Expected result:

  1. Match starting at position 1: ab
  2. Match starting at position 4: aabb

Of course the key is "effectively". without testing too many hopeless candidates for too long. The only thing I know about my data is that although the grammar is arbitrary, in practice matching sequences will be relatively short (<20 terminals) while the stream itself will be quite long (>10000 terminals).

Ideally I'd also want a syntax tree but that's not too important, because once the fragment is identified, I can run an ordinary parser over it to obtain the tree.

Where should I start? Which type of parser can be adapted to this type of work?

"Arbitrary grammar" makes me suggest you look at wberry's comment.

How complex are these grammars? Is there a manual intervention step?

I'll make an attempt. If I modified your example grammar from:

S -> ASB | AB
A -> a 
B -> b

to include:

S' -> S | GS' | S'GS' | S'G
G -> sigma*

So that G = garbage and S' is many S fragments with garbage in between (I may have been careless with my production rules. You get the idea), I think we can solve your problem. You just need a parser that will match other rules before G. You may have to modify these production rules based on the parser. I almost guarantee that there will be rule ordering changes depending on the parser. Since most parser libraries separate lexing from parsing, you'll probably need a catch-all lexeme followed by modifying G to include all possible lexemes. Depending on your specifics, this might not be any better (efficiency-wise) than just starting each attempt at each spot in the stream.

But... Assuming my production rules are fixed (both for correctness and for the particular flavor of parser), this should not only match fragments in the stream, but it should give you a parse tree for the whole stream. You are only interested in subtrees rooted in nodes of type S.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM