简体   繁体   English

根据语言L的给定正则表达式创建字符串集

[英]Creating string set from given regular expression in language L

I'm trying to create word sequence in alphabet (given by user) according to regular expression (also given by user) but couldn't make it. 我正在尝试根据正则表达式(也由用户提供)创建字母(由用户提供)中的单词序列,但无法成功。

Example scenario 1: 示例场景1:

Alphabet = [a,b,c]

Regex = (a+c)b*

Word Count = 6

Words = ["a", "c", "ab", "cb", "abb", "cbb"]

Example scenario 2: 示例场景2:

Alphabet = [a,b]

Regex = (a+b)*a

Word Count = 3

Words = ["a", "aa", "ba"]

I tried converting regex to postfix / infix then go from there but couldn't build the engine algorithm. 我尝试将正则表达式转换为后缀/中缀,然后从那里去,但无法构建引擎算法。

Basically there is 3 operations; 基本上有3种操作;

Union (+) 联盟(+)
Concat () 康卡特()
Closure (*) 闭包(*)

I wrote one function per operator type; 我为每种运算符类型编写了一个函数;

void union(char* x[], char y)
{
    printf("%s\n%c\n", x, y);

    remainingWordCount -= 2;
}

void concat(char* x[], char* y[])
{
    printf("%s%s\n", x, y);
    remainingWordCount--;
}

void closure(char* x[], char* y[])
{
    while (remainingWordCount > 0)
    {
        concat(x, y);
    }
}

It's only working in most basic scenarios. 它仅在大多数基本方案中有效。

So my question is how can i create set of strings according to given regex without using any regex library? 所以我的问题是如何在不使用任何正则表达式库的情况下根据给定的正则表达式创建字符串集? Are there any known algorithms for that? 是否有任何已知的算法?

The "brute force" solution is to parse the regular expression into the finite-state machine's state transition graph, with each state having a list of transitions, and each transition having the associated symbol (character) and the next state. “强力”解决方案是将正则表达式解析为有限状态机的状态转换图,其中每个状态都有一个转换列表,每个转换都具有关联的符号(字符)和下一个状态。 You can use a state with no transitions to indicate the terminal state. 您可以使用没有过渡的状态来指示终端状态。

Then traverse this graph, remembering the string produced by the transitions. 然后遍历此图,记住过渡产生的字符串。 When you reach a terminal state, print the word and decrement the remaining word count, stop when it reaches zero. 达到终端状态时,打印单词并减少剩余的单词数,直到达到零时停止。 If it is possible that different paths through the graph end up producing the same word, you also need to remember any words already output and not print/decrement if the same word already exists. 如果有可能通过图形的不同路径最终产生相同的单词,那么您还需要记住已经输出的任何单词,如果已经存在相同的单词,则不要打印/递减。

Process the paths in sorted order (such that shorter paths come before longer ones with the same prefix, ie, as per strcmp in C). 以排序的顺序处理路径(这样,较短的路径将在具有相同前缀的较长路径之前出现,即,按照C中的strcmp )。 This avoids getting stuck in a loop, and gives the order you want. 这样可以避免陷入循环,并给出所需的顺序。

For example, for the regular expression a* (pseudocode): 例如,对于正则表达式a* (伪代码):

state[0] = { {'a', 0}, {'\0', 1} };
state[1] = { }; // terminal state
paths = { { .state = 0, .string = "" } }; // set initial state

You start with the only path you have at state 0 , and append to it (separately) both transitions from state 0 to make the new paths: 您从状态0处的唯一路径开始,然后(分别)将状态0两个转换附加到路径上,以创建新路径:

paths = { { .state = 1, .string = "" },
          { .state = 0, .string = "a" } };

Since the path with the empty string is ordered first (due to the empty string being sorted before any non-empty string), it is handled first, and since it is in a terminal state with no transitions, it prints the empty string and decrements word count. 由于具有空字符串的路径首先被排序(由于空字符串在任何非空字符串之前被排序),因此将首先对其进行处理,并且由于它处于没有过渡的终端状态,因此它将打印空字符串并减少字数。 Then you take the other path and add the transitions from state 0 again, ending up with: 然后,您采用另一条路径,并再次添加从状态0开始的转换,最后得到:

paths = { { .state = 1, .string = "a" },
          { .state = 0, .string = "aa" } };

etc. 等等

(Disclaimer: This is completely untested, off the top of my head, and there could be corner cases that I didn't think of. Also be aware that the number of paths will explode for non-trivial regular expressions.) (免责声明:这是完全未经测试的,超出我的头脑,并且可能有一些我没想到的极端情况。还要注意,对于非平凡的正则表达式,路径数量会激增。)

The basic algorithm is simple (if you know how to do all the pieces): 基本算法很简单(如果您知道如何做所有事情):

  1. Construct a DFA from the regular expression . 从正则表达式构造DFA (Constructing an NFA is not sufficient because the NFA will produce duplicate strings if the regular expression is not deterministic.) The link shows one way of doing this, but there are others; (构造NFA是不够的,因为如果正则表达式不确定,则NFA会产生重复的字符串。)链接显示了一种执行此操作的方法,但还有其他方法。 you'll probably find a longer exposition in your formal languages textbook, if you have one. 如果您有正式的语言教科书,则可能会发现更长的论述。

  2. Do an ordered walk ( "best-first search" ) of the (infinite) graph generated from the DFA, where each node is a pair (state, prefix) and edges correspond to transitions in the DFA. 对从DFA生成的(无限)图进行有序遍历( “最佳优先搜索” ),其中每个节点都是一对(state, prefix)并且边对应于DFA中的过渡。 During the walk, if a node is encountered whose state is accepting, produce its prefix . 在遍历期间,如果遇到state为接受的节点,请产生其prefix

That basic algorithm will work for any ordering relationship between strings with the prefix property: any proper prefix of a string is guaranteed to be less than the string. 该基本算法适用于具有prefix属性的字符串之间的任何排序关系:保证字符串的任何适当前缀小于该字符串。 (If that's not the case, it is possible that there is no "least" element in a set of strings. For example, if the ordering relationship puts a string before any prefix of that string, but is otherwise lexicographic, then any string in a* is preceded by the next longer string, which will produce an infinite loop.) (如果不是这种情况,则可能在一组字符串中没有“最少”元素。例如,如果排序关系将字符串放在该字符串的任何前缀之前,但按字典顺序,则其中的任何字符串a*之前是下一个较长的字符串,它将产生无限循环。)

It's important to note that the state in the node is only for convenience; 需要注意的是,节点中的state仅是为了方便起见。 it is computationally redundant because it could be regenerated by passing the prefix through the DFA. 它在计算上是多余的,因为可以通过将prefix通过DFA进行重新生成。 As a consequence, the graph never contains two different nodes with the same prefix . 结果,该图永远不会包含带有相同prefix两个不同节点。 The corollary of this observation is that it is not necessary to maintain a set of "seen" nodes, because the successor sets of two distinct nodes are disjoint. 这种观察的必然结果是,不必维护一组“可见”节点,因为两个不同节点的后继集合是不相交的。

In order to implement the ordered search in step 2, it is necessary to know the least accepted successor of each node in the graph, which is not necessarily the immediate successor with the least prefix. 为了在步骤2中实现有序搜索,必须知道图中每个节点的最少接受后继 ,它不一定是前缀最少的直接后继。

For example, the lexicographic ("alphabetical") ordering relationship is given by: 例如,词典顺序(“字母顺序”)由下式给出:

(S 1 , P 1 ) < (S 2 , P 2 ) iff P 1 < lexicographic P 2

In this case, the least accepted successor definitely has the least immediate successor as a prefix, so it is sufficient to order candidates by prefix. 在这种情况下,被接受程度最低的后继者肯定具有最短的直接后继者作为前缀,因此按前缀对候选者进行排序就足够了。

By constrast, with the more common "by length then lexicographic" ordering given by: 相比之下,更常见的“按长度先按字典顺序”顺序由下式给出:

(S 1 , P 1 ) < (S 2 , P 2 ) iff |P 1 | < |P 2 | or (|P 1 | = |P 2 | and P 1 < lexicographic P 2 )

you cannot predict the order of the least accepted successor of two nodes simply by looking at their immediate successors. 您不能仅仅通过查看两个节点的直接后继节点来预测两个节点中被接受程度最低的后继节点的顺序。 You also need to know the minimum number of symbols required to reach an accepting node (or, equivalently, state). 您还需要知道到达接受节点(或等效状态)所需的最小符号数。 Fortunately, that's easy to precompute using any all-pairs shortest-paths algorithm on the DFA. 幸运的是,使用DFA上的所有对最短路径算法很容易进行预计算。

I would recommend using the "iterator" design pattern. 我建议使用“迭代器”设计模式。 I see that you're using C, which is not particularly geared toward object-oriented code, but you can accomplish this by using a structure containing a pointer to a next function, a pointer to a restart function, and a pointer to a "context" object to be passed to those functions, where the nature of the "context" object will depend on the operator. 我看到您使用的是C语言,它并不特别适合面向对象的代码,但是您可以通过使用包含指向next函数的指针,指向restart函数的指针和指向“上下文”对象传递给那些函数,其中“上下文”对象的性质将取决于运算符。

In something like a , the next function returns "a" the first time and NULL the second time. 在类似a东西中, next函数第一次返回"a" ,第二次返回NULL (The context object keeps track of the "a" and whether it's already been returned.) (上下文对象跟踪"a"及其是否已返回。)

In something like ...+... , the next can either exhaust the first ... 's iterator before proceeding to the second, or it can alternate between them, as you prefer. 在类似...+...的情况nextnext可以耗尽第一个...的迭代器,然后再进行第二个迭代,或者可以根据需要在它们之间进行交替。 (The context object keeps pointers to the two ... 's iterators, and which one to call next.) (上下文对象保留指向这两个...的迭代器的指针,然后指向下一个调用。)

. . . and so on. 等等。

The harderst part is parsing the expression to create all these objects, but it sounds like you're already comfortable with parsing the expression? 最难的部分是解析表达式以创建所有这些对象,但是听起来您已经对解析表达式感到满意了?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用或操作创建正则表达式 - Creating regular expression with or operation 表达式:(L“字符串不为null终止”&&0) - Expression: (L “String is not null terminated” & & 0) Curley大括号在C语言中不能用于正则表达式 - Curley braces are not working in regular expression in C language C语言从给定字符串中删除前面的空格和制表符 - Remove preceding spaces and tabs from a given string in C language 正则表达式拒绝字符串的问题 - Issues with regular expression rejecting string 字符串/正则表达式字符&#39;[&#39;,&#39;]&#39;,&#39;{&#39;,&#39;}&#39;用C语言在大型机TN3270(代码页1047,1147,500,249)上替换为空格 - String/Regular expression characters '[', ']', '{', '}' replaced by spaces on Mainframe TN3270 (with code page 1047,1147,500,249) in C language 我收到错误消息:表达式:(L“字符串不为null终止”&&0) - I get the error message:Expression: (L “String is not null terminated” & & 0) 使用 C 语言创建具有给定字符串的结构名称 - Create a struct name with a given string in C language 在C中使用带有unicode字符串的正则表达式 - using regular expression with unicode string in C flex/lex 中字符串文字的正则表达式 - Regular expression for a string literal in flex/lex
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM