简体   繁体   中英

Creating string set from given regular expression in language L

I'm trying to create word sequence in alphabet (given by user) according to regular expression (also given by user) but couldn't make it.

Example scenario 1:

Alphabet = [a,b,c]

Regex = (a+c)b*

Word Count = 6

Words = ["a", "c", "ab", "cb", "abb", "cbb"]

Example scenario 2:

Alphabet = [a,b]

Regex = (a+b)*a

Word Count = 3

Words = ["a", "aa", "ba"]

I tried converting regex to postfix / infix then go from there but couldn't build the engine algorithm.

Basically there is 3 operations;

Union (+)
Concat ()
Closure (*)

I wrote one function per operator type;

void union(char* x[], char y)
{
    printf("%s\n%c\n", x, y);

    remainingWordCount -= 2;
}

void concat(char* x[], char* y[])
{
    printf("%s%s\n", x, y);
    remainingWordCount--;
}

void closure(char* x[], char* y[])
{
    while (remainingWordCount > 0)
    {
        concat(x, y);
    }
}

It's only working in most basic scenarios.

So my question is how can i create set of strings according to given regex without using any regex library? Are there any known algorithms for that?

The "brute force" solution is to parse the regular expression into the finite-state machine's state transition graph, with each state having a list of transitions, and each transition having the associated symbol (character) and the next state. You can use a state with no transitions to indicate the terminal state.

Then traverse this graph, remembering the string produced by the transitions. When you reach a terminal state, print the word and decrement the remaining word count, stop when it reaches zero. If it is possible that different paths through the graph end up producing the same word, you also need to remember any words already output and not print/decrement if the same word already exists.

Process the paths in sorted order (such that shorter paths come before longer ones with the same prefix, ie, as per strcmp in C). This avoids getting stuck in a loop, and gives the order you want.

For example, for the regular expression a* (pseudocode):

state[0] = { {'a', 0}, {'\0', 1} };
state[1] = { }; // terminal state
paths = { { .state = 0, .string = "" } }; // set initial state

You start with the only path you have at state 0 , and append to it (separately) both transitions from state 0 to make the new paths:

paths = { { .state = 1, .string = "" },
          { .state = 0, .string = "a" } };

Since the path with the empty string is ordered first (due to the empty string being sorted before any non-empty string), it is handled first, and since it is in a terminal state with no transitions, it prints the empty string and decrements word count. Then you take the other path and add the transitions from state 0 again, ending up with:

paths = { { .state = 1, .string = "a" },
          { .state = 0, .string = "aa" } };

etc.

(Disclaimer: This is completely untested, off the top of my head, and there could be corner cases that I didn't think of. Also be aware that the number of paths will explode for non-trivial regular expressions.)

The basic algorithm is simple (if you know how to do all the pieces):

  1. Construct a DFA from the regular expression . (Constructing an NFA is not sufficient because the NFA will produce duplicate strings if the regular expression is not deterministic.) The link shows one way of doing this, but there are others; you'll probably find a longer exposition in your formal languages textbook, if you have one.

  2. Do an ordered walk ( "best-first search" ) of the (infinite) graph generated from the DFA, where each node is a pair (state, prefix) and edges correspond to transitions in the DFA. During the walk, if a node is encountered whose state is accepting, produce its prefix .

That basic algorithm will work for any ordering relationship between strings with the prefix property: any proper prefix of a string is guaranteed to be less than the string. (If that's not the case, it is possible that there is no "least" element in a set of strings. For example, if the ordering relationship puts a string before any prefix of that string, but is otherwise lexicographic, then any string in a* is preceded by the next longer string, which will produce an infinite loop.)

It's important to note that the state in the node is only for convenience; it is computationally redundant because it could be regenerated by passing the prefix through the DFA. As a consequence, the graph never contains two different nodes with the same prefix . The corollary of this observation is that it is not necessary to maintain a set of "seen" nodes, because the successor sets of two distinct nodes are disjoint.

In order to implement the ordered search in step 2, it is necessary to know the least accepted successor of each node in the graph, which is not necessarily the immediate successor with the least prefix.

For example, the lexicographic ("alphabetical") ordering relationship is given by:

(S 1 , P 1 ) < (S 2 , P 2 ) iff P 1 < lexicographic P 2

In this case, the least accepted successor definitely has the least immediate successor as a prefix, so it is sufficient to order candidates by prefix.

By constrast, with the more common "by length then lexicographic" ordering given by:

(S 1 , P 1 ) < (S 2 , P 2 ) iff |P 1 | < |P 2 | or (|P 1 | = |P 2 | and P 1 < lexicographic P 2 )

you cannot predict the order of the least accepted successor of two nodes simply by looking at their immediate successors. You also need to know the minimum number of symbols required to reach an accepting node (or, equivalently, state). Fortunately, that's easy to precompute using any all-pairs shortest-paths algorithm on the DFA.

I would recommend using the "iterator" design pattern. I see that you're using C, which is not particularly geared toward object-oriented code, but you can accomplish this by using a structure containing a pointer to a next function, a pointer to a restart function, and a pointer to a "context" object to be passed to those functions, where the nature of the "context" object will depend on the operator.

In something like a , the next function returns "a" the first time and NULL the second time. (The context object keeps track of the "a" and whether it's already been returned.)

In something like ...+... , the next can either exhaust the first ... 's iterator before proceeding to the second, or it can alternate between them, as you prefer. (The context object keeps pointers to the two ... 's iterators, and which one to call next.)

. . . and so on.

The harderst part is parsing the expression to create all these objects, but it sounds like you're already comfortable with parsing the expression?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM