简体繁体 English

计算DFA状态

[英]Computation of DFA states

原文 2013-05-22 17:37:50 9 1 regex/ lex/ flex-lexer

I want to compute the total number of DFA states for a certain regular expression using FLEX. 我想使用FLEX计算某个正则表达式的DFA状态总数。 Which C files or functions will help me to achieve this task using FLEX? 哪些C文件或函数可以帮助我使用FLEX完成此任务？

1 个解决方案

If you look in the file generated by flex , then the number of entries in yy_accept (and yy_base ) will probably give a good indication of the number of states used by the generated DFA. 如果查看flex生成的文件，则yy_accept （和yy_base ）中的条目数可能会很好地指示生成的DFA使用的状态数。 If you'd use -Cf option then yy_nxt contains the transition function of the DFA and the number of rows in the table is again the number of used states. 如果您使用-Cf选项，则yy_nxt包含DFA的转换函数，表中的行数也是使用状态的数量。

You may have a different version of flex where the tables are named differently, but most likely their names will be very similar. 您可能有不同版本的flex ，其中表的命名方式不同，但很可能它们的名称非常相似。

In reaction to your questions below: the number of states in a DFA could be considered quite well defined, assuming the DFA has been minimized. 在回答下面的问题时：假设DFA已经最小化，DFA中的状态数量可以被认为是非常明确的。 The number of transitions is however much less well defined. 然而，过渡的数量不太明确。

In the first place flex has a transition for each input character as it will ECHO any character that is not part of the defined language. 首先， flex对每个输入字符都有一个转换，因为它将ECHO任何不属于定义语言的字符。 This is implemented by a fresh new state to handle that case. 这是通过一个新的新状态来实现的。 Using a debugger you could reverse engineer which state this is. 使用调试器可以反向设计这是哪种状态。 But beware that if you use start conditions, you may have to consider the possibility that there are multiple such states. 但请注意，如果使用开始条件，则可能必须考虑存在多个此类状态的可能性。 If you want to analyze many regular expressions, then you may want to look into some other tools or take the sources of flex and go from there. 如果您想分析许多正则表达式，那么您可能需要查看其他一些工具或从中获取flex的来源。

In the second place flex has strategies to minimize the total size of all the tables. 第二， flex有策略来最小化所有表的总大小。 The -Cf option instructs it to not do that. -Cf选项指示它不这样做。 One such optimization is finding equivalence classes of characters and only use transitions for each character class. 一个这样的优化是找到字符的等价类，并且仅对每个字符类使用转换。 An input character is first translated to its class, which in turn is used to determine the transition. 输入字符首先被转换为其类，而后者又用于确定转换。 As a consequence the number of transitions is much lower, but an additional table (see yy_ec ) is required for determining the character class. 因此，转换的数量要低得多，但是需要一个额外的表（参见yy_ec ）来确定字符类。

As a consequence the number of transitions is a not so well defined concept. 因此，转换的数量是一个不太明确的概念。 If you are interested in determining the memory footprint of the scanner, then I would look at the size of the data section of the scanner. 如果您对确定扫描仪的内存占用率感兴趣，那么我会查看扫描仪数据部分的大小。 Use for example objdump -h on the lex.yy.o file. 例如，在lex.yy.o文件中使用objdump -h 。 The size of the .rodata section will give a quite accurate estimate of the total size of the tables. .rodata部分的大小将给出表的总大小的非常准确的估计。

You seemed to have already found the -v option of flex that gives the number of states in the DFA in a more verbose form. 您似乎已经找到了flex的-v选项，它以更详细的形式提供DFA中的状态数。 In answer to why "a" {} gives 5 states, you may also use the --trace option as it gives the DFA while it is generated. 为了回答"a" {}给出5个状态的原因，您也可以使用--trace选项，因为它在生成时为DFA提供。 Apparently there is also an End Marker rule, I assume it is used for end-of-file. 显然还有一个End Marker规则，我认为它用于文件结束。 For each start condition there are two states, one that is used when at the start of a line and one in the middle of a line. 对于每个开始条件，有两种状态，一种在线的开始处使用，一种在线的中间使用。 That makes 3 accepting states (one for "a" , one for End Marker and one for (.|"\\n") ) plus two states for the single start condition. 这使得3个接受状态（一个用于"a" ，一个用于End Marker ，一个用于(.|"\\n") ）加上两个状态用于单个开始条件。

The source file dfa.c is not part of the generated code, but if you feel brave you could of course change the sources of flex to do further analysis of your own. 源文件dfa.c不是生成的代码的一部分，但是如果你觉得很勇敢，你当然可以改变flex的来源来进一步分析你自己的。 I had a quick look and it does seem that generation of the code is intertwined with the transformations, which makes it a bit less modular than one would desire for an experimentation platform. 我快速浏览了一下，似乎代码的生成与转换交织在一起，这使得它的模块化程度低于实验平台所需的模块化程度。 Also beware of the K&R prototypes which effectively disables any type checking on the prototypes. 还要注意K＆R原型，这些原型可以有效地禁用原型上的任何类型检查。