简体   繁体   中英

Regular expression: language generator

Given a regular expression in C#, is there a way to generate a word that is accepted by this regular expression?

For instance, let's consider:

[ab]c*b*

Is there a function that can automatically generate a enumeration like:

a
b
ac
ab
bc
bb
acb
bcb
acc
bcc
...

Obviously this list being infinite of potentially of as-long-as-you-want words, the generator would have to be smart in order to output things from the simplest to the most complex, without being trapped in infinite loops.

I think this would be a useful tool in order to validate regular expression. In general it's easy to see that a regular expression accepts words that you planned it would accept. It's usually much more difficult to see what other words it would accept.

EDIT: This question is not about how to do it, but rather: is there anything out there that I could use to do it in C#?

This isn't even a C#-specific question; I think you can do this with any true regex.

It seems to me like you should be able to tell a generation story for any regex match that's just a list of rewrites. In your example [ab]c*b* can generate acccbbb ; that's [ab]c*b* -> ac*b* -> acccb* -> acccbbb . For each operator we can imagine enumerating all the ways it rewrites; then it's just a question of enumerating all combinations of rewrites, which boils down to enumerating all the N-tuples of naturals.

edit: N-tuples of naturals is a glib comparison. But you could imagine essentially performing a breadth-first traversal over rewrite states, outputting each string that all operators have been rewritten out of.

I don't know how to do this in C#, but in theory yes, it can be done.

You need to convert your regular expression to a NFA or DFA graph, transverse it with a BFS keeping track of the current path, adding a new character to the path for each edge, and printing the current path when the finish nodes are hit. Depending on the regular expression at hand your memory usage can easily grow exponentially.

For example given the regular expression (a|b)*abb we can create a NFA graph as the following:

NFA代表<code>(a | b)* abb </ code>

This NFA graph can be used both to recognize a word and to enumerate all possible words. We do that by nondeterministically traversing the graph. Meaning, we need to keep track of all possible paths in the graph.

Starting at zero we do a BFS, and for each node that has two or more output edged we create a new nondeterministic path. The BFS visits the nodes in the following order, each time printing:

0, 1, 7, 2, 4, 8, 3, 5, 9, 6, 6, 10, 1, 1, 7, ...

For each node visited we have the intermediate temporary paths as:

  • 0, ""
  • 1, "e"
  • 7, "e"
  • 2, "ee"
  • 4, "ee"
  • 8, "ea"
  • 3, "eea"
  • 5, "eeb"
  • 9, "eab"
  • 6, "eeae"
  • 6, "eebe"
  • 10, "eabb"
  • 1, "eeaee"
  • 1, "eebee"

The "e" symbol is the epsilon letter representing the empty string "" , which should be filtered out while printing each word.

By doing a BFS over the graph we're sorting each word by the number of edges needed to recognize the word with the NFA back again. Since the graph contains a cycle this procedure will never finish.

Each time each nondeterministic path reaches the ending node 10 we print the generated string:

  • "abb"
  • "aabb"
  • "babb"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM