简体   繁体   English

flex filename.l 不会产生 lex.yy.c

[英]flex filename.l won't produce lex.yy.c

I want to detect if there are any duplicates of uppercase letters from input.我想检测输入中是否有任何大写字母的重复。 The program should work fine but when i try to run the command flex filename.l on cmd windows it won't show any errors or warnings.该程序应该可以正常工作,但是当我尝试在 cmd windows 上运行命令flex filename.l时,它不会显示任何错误或警告。 The command runs and i have to wait until it outputs lex.yy.c but that never happens.该命令运行,我必须等到它输出lex.yy.c但这永远不会发生。 I'm still waiting for like an hour to finish.我还在等一个小时左右才能完成。 Why does this happen?为什么会这样? This is the part of the code for detecting duplicates.这是用于检测重复的代码的一部分。

// Definition Section
A_ ([B-L]*"A"[B-L]*("A"[B-L]*)+)
B_ ([ACDEFGHIJKL]*"B"[ACDEFGHIJKL]*("B"[ACDEFGHIJKL]*)+)
C_ ([ABDEFGHIJKL]*"C"[ABDEFGHIJKL]*("C"[ABDEFGHIJKL]*)+)
D_ ([ABCEFGHIJKL]*"D"[ABCEFGHIJKL]*("D"[ABCEFGHIJKL]*)+)
E_ ([ABCDFGHIJKL]*"E"[ABCDFGHIJKL]*("E"[ABCDFGHIJKL]*)+)
F_ ([ABCDEGHIJKL]*"F"[ABCDEGHIJKL]*("F"[ABCDEGHIJKL]*)+)
G_ ([ABCDEFHIJKL]*"G"[ABCDEFHIJKL]*("G"[ABCDEFHIJKL]*)+)
H_ ([ABCDEFGIJKL]*"H"[ABCDEFGIJKL]*("H"[ABCDEFGIJKL]*)+)
I_ ([ABCDEFGHJKL]*"I"[ABCDEFGHJKL]*("I"[ABCDEFGHJKL]*)+)
J_ ([ABCDEFGHIKL]*"J"[ABCDEFGHIKL]*("J"[ABCDEFGHIKL]*)+)
K_ ([ABCDEFGHIJL]*"K"[ABCDEFGHIJL]*("K"[ABCDEFGHIJL]*)+)
L_ ([ABCDEFGHIJK]*"L"[ABCDEFGHIJK]*("L"[ABCDEFGHIJK]*)+)
%%//Rule Section
{A_} {printf("Letter 'A' appeared more than once!");}
{B_} {printf("Letter 'B' appeared more than once!");}
{C_} {printf("Letter 'C' appeared more than once!");}
{D_} {printf("Letter 'D' appeared more than once!");}
{E_} {printf("Letter 'E' appeared more than once!");}
{F_} {printf("Letter 'F' appeared more than once!");}
{G_} {printf("Letter 'G' appeared more than once!");}
{H_} {printf("Letter 'H' appeared more than once!");}
{I_} {printf("Letter 'I' appeared more than once!");}
{J_} {printf("Letter 'J' appeared more than once!");}
{K_} {printf("Letter 'K' appeared more than once!");}
{L_} {printf("Letter 'L' appeared more than once!");}

There are really two separate issues here.这里真的有两个不同的问题。 One is the question you ask – why does flex take so long to compile this scanner – and the other has to do with whether the scanner description accurately reflects your intent.一个是你问的问题——为什么 flex 编译这个扫描器需要这么长时间——另一个与扫描器描述是否准确地反映了你的意图有关。 The second question is a bit tricky because you don't give a very precise description of what you intended, but I'll try to consider some plausible possibilities.第二个问题有点棘手,因为您没有对您的意图给出非常精确的描述,但我会尝试考虑一些似是而非的可能性。

To start with, it's useful to describe what your scanner description actually does.首先,描述一下您的扫描仪描述实际上是做什么的很有用。 That's also a bit tricky, because you have only included a part of your scanner, so I'm going to take the liberty of projecting a complete scanner which might at least be an illustration of what the intention might have been.这也有点棘手,因为您只包含了扫描仪的一部分,所以我将冒昧地投影一个完整的扫描仪,这至少可以说明意图可能是什么。 The code I'm going to work with is the following:我要使用的代码如下:

Note: The options prevent compiler warnings, remove the need for yywrap, and warn if the rules don't cover all possible inputs.注意:这些选项可防止编译器警告,消除对 yywrap 的需要,并在规则未涵盖所有可能的输入时发出警告。 The fallback patterns at the end and the brief definition of main make the scanner runnable.最后的后备模式和main的简要定义使扫描仪可以运行。 {-} is a Flex extension which computes the set difference between two character classes. {-}是一个 Flex 扩展,它计算两个字符类之间的集合差异。

  /* My standard 
   */
%option noinput nounput noyywrap nodefault
NOT_A  [A-L]{-}[A]
NOT_B  [A-L]{-}[B]
NOT_C  [A-L]{-}[C]
NOT_D  [A-L]{-}[D]
NOT_E  [A-L]{-}[E]
NOT_F  [A-L]{-}[F]
NOT_G  [A-L]{-}[G]
NOT_H  [A-L]{-}[H]
NOT_I  [A-L]{-}[I]
NOT_J  [A-L]{-}[J]
NOT_K  [A-L]{-}[K]
NOT_L  [A-L]{-}[L]
%%
{NOT_A}*A{NOT_A}*(A{NOT_A}*)+   { puts("Duplicate A"); }
{NOT_B}*B{NOT_B}*(B{NOT_B}*)+   { puts("Duplicate B"); }
{NOT_C}*C{NOT_C}*(C{NOT_C}*)+   { puts("Duplicate C"); }
{NOT_D}*D{NOT_D}*(D{NOT_D}*)+   { puts("Duplicate D"); }
{NOT_E}*E{NOT_E}*(E{NOT_E}*)+   { puts("Duplicate E"); }
{NOT_F}*F{NOT_F}*(F{NOT_F}*)+   { puts("Duplicate F"); }
{NOT_G}*G{NOT_G}*(G{NOT_G}*)+   { puts("Duplicate G"); }
{NOT_H}*H{NOT_H}*(H{NOT_H}*)+   { puts("Duplicate H"); }
{NOT_I}*I{NOT_I}*(I{NOT_I}*)+   { puts("Duplicate I"); }
{NOT_J}*J{NOT_J}*(J{NOT_J}*)+   { puts("Duplicate J"); }
{NOT_K}*K{NOT_K}*(K{NOT_K}*)+   { puts("Duplicate K"); }
{NOT_L}*L{NOT_L}*(L{NOT_L}*)+   { puts("Duplicate L"); }
  /* Match any string consisting of A-L */
[A-L]+                          { puts("No duplicate"); }
  /* Ignore newline or any other character */
.|\n                            ;
%%
int main(void) {
  return yylex();
}

It's important to underline the fact that the point of a scanner is to divide the input into tokens.重要的是要强调扫描仪的目的是将输入划分为令牌这一事实。 A scanner is not a general purpose regular expression engine, although it is sometimes possible to push the boundaries a bit.扫描器不是通用的正则表达式引擎,尽管有时可以稍微突破界限。 However, if you want to do a task like searching the input for possibly overlapping regular expression matches, ignoring unmatched input, you might find that Flex's sequential match architecture is not a good match for the problem.但是,如果您想要执行一项任务,例如在输入中搜索可能重叠的正则表达式匹配,忽略不匹配的输入,您可能会发现 Flex 的顺序匹配架构不能很好地匹配该问题。

So, in this case, I assumed that the strings of interest are sequences of upper case letters (with a restricted alphabet) and that anything else can be safely ignored.因此,在这种情况下,我假设感兴趣的字符串是大写字母序列(带有受限字母),并且可以安全地忽略其他任何内容。 (That means that the input abcABCD993 will be divided into seven tokens, of which six are ignored: the lower case letters at the beginning and the digits at the end. ABCD is considered a token even though it is not visibly delimited from the surrounding text. Fixing that, if it needs to be fixed, is not difficult, but is not relevant to this question.) (这意味着输入abcABCD993将被分成七个标记,其中六个被忽略:开头的小写字母和结尾的数字ABCD被认为是一个标记,即使它与周围的文本没有明显的分隔.修复它,如果需要修复,并不困难,但与这个问题无关。)

Because I added the pattern [AL]+ at the end, the rule set does match any sequence of upper case letters from the restricted alphabet.因为我在末尾添加了模式[AL]+ ,所以规则集确实匹配受限字母表中的任何大写字母序列。 But only one rule is allowed to match for any such word.但是对于任何这样的词,只允许匹配一个规则。 If there is a duplicate letter, one of the first 12 rules will match.如果有重复的字母,前 12 条规则中的一条将匹配。 The [AL]+ rule will also match, but since it is at the end, it will only be applied if no duplicate rule matches. [AL]+规则也将匹配,但由于它位于末尾,因此只有在没有重复规则匹配时才会应用它。 Similarly, a token could have more than one duplicate letter (for example, CCLAAL matches the A , C and L duplicate rules).类似地,一个标记可能有多个重复字母(例如, CCLAAL匹配ACL重复规则)。 Because any rule which matches will match the entire token, the rule which applies is the first rule in the scanner description;因为任何匹配的规则都将匹配整个令牌,所以应用的规则是扫描器描述中的第一条规则; the string CCLAAL will trigger the report Duplicate A .字符串CCLAAL将触发报告Duplicate A

Innocuous though it seems, it is that resolution which causes the exponential blow-up in scanner size.尽管看起来无关紧要,但正是这种分辨率导致扫描仪尺寸呈指数级增长。 Consider the example input CCLAAL once again.再次考虑示例输入CCLAAL The CC at the beginning immediately puts the pattern {NOT_C}*C{NOT_C}*(C{NOT_C}*)+ into an accepting state, but the match does not terminate at that point;开头的CC立即将模式{NOT_C}*C{NOT_C}*(C{NOT_C}*)+放入接受 state 中,但匹配不会在该点终止; it can be extended.它可以扩展。 However, while the scanner continues to see whether there is a better match, it can't forget that Duplicate C is a possibility.不过,在扫描仪继续查看是否有更好的匹配的同时,也不能忘记Duplicate C是一种可能性。 Had the input been CCLAL , for example, Duplicate C would be the best match rather than Duplicate A .例如,如果输入是CCLAL ,则Duplicate C将是最佳匹配,而不是Duplicate A And it's not just full duplicate matches which need to be remembered.需要记住的不仅仅是完全重复的匹配项。 In the pattern ACLLCA , it's not known until the second A is reached that Duplicate A is correct.在模式ACLLCA中,直到到达第二个 A 才知道Duplicate A是正确的。 So at the point that the second L has been read, the scanner must remember not only that Duplicate L is a possible report, but also that a single A or a single C could change the report accordingly.因此,在读取第二个 L 时,扫描仪必须记住不仅Duplicate L是可能的报告,而且单个 A 或单个 C 可能会相应地更改报告。

If we were going to write a program to implement these semantic, the simplest way would be to keep a vector of integers representing the number of occurrences of each of the letters A through L. Such code might look something like this:如果我们要编写一个程序来实现这些语义,最简单的方法是保留一个整数向量,表示每个字母 A 到 L 的出现次数。这样的代码可能看起来像这样:

int check_for_duplicate(void) {
  int ch;
  int seen['L' - 'A' + 1] = {0};
  while ((ch = getchar()) != EOF && (ch < 'A' || ch > 'L'))
    continue;
  if (ch == EOF) return EOF;
  seen[ch - 'A'] = 1;
  while ( (ch = getchar()) != EOF &&  >= 'A' && ch <= 'L') {
    ++seen[ch - 'A'];
  }
  /* Put the following character back into the stream */
  if (ch != EOF) ungetc(ch, stdin);
  /* See if a duplicate was found */
  for (int i = 0; i < sizeof(seen) / sizeof(*seen); ++i) {
    if (seen[i] >= 2) {
      printf("Duplicate %c\n", 'A' + i);
      return 1;
    }
  }
  puts("No duplicates");
  return 0;
}

In that program, the array seen maintains the state of the scan.在该程序中, seen的阵列维护扫描的 state。 Only three of each of the possible count values are actually important, so as a first approximation there are a total of 3 12 = 531,441 possible meaningful scanner states.每个可能的计数值中只有三个实际上是重要的,因此作为第一个近似值,总共有 3 12 = 531,441 个可能的有意义的扫描仪状态。 That doesn't matter here, but it does matter to Flex;这在这里无关紧要,但对 Flex 来说确实很重要; the scanner built by Flex is a state machine with no data other than the scanner state. Flex构建的扫描仪是state机器,除了扫描仪state之外没有其他数据。

That seems manageable, and even more so when we note that anything after the first 2 is irrelevant, reducing the count to 3*(2 12 -1), or 12,287.这似乎是可控的,当我们注意到前 2 之后的任何内容都无关紧要时,将计数减少到 3*(2 12 -1) 或 12,287 时更是如此。 Unfortunately, Flex is actually dealing with regular expressions and the reduction to duplicate counts is an abstraction which it can't make.不幸的是,Flex 实际上是在处理正则表达式,而减少重复计数是它无法实现的抽象。 We can see that there are only three relevant counts for each letter, but Flex is limited to noting how far it has progressed in the regular expression for each possible duplicate match.我们可以看到每个字母只有三个相关计数,但 Flex 仅限于记录每个可能的重复匹配在正则表达式中的进展程度。 And there are more than three positions in the regular expression.并且正则表达式中的位置不止三个。

If you run Flex with the -v option, it will print out a summary of statistics, one of which is the number of "DFA states" which it produced from the ruleset.如果您使用-v选项运行Flex ,它将打印出统计摘要,其中之一是它从规则集中生成的“DFA 状态”的数量。 The scanner Flex generates includes a state transition table whose size is related to the number of states, so that statistic is very important. Flex 生成的扫描仪包含一个 state 转换表,其大小与状态数有关,因此统计非常重要。 I ran Flex with different numbers of duplicate match rules, from 2 to 9, and extracted the number of states in each state machine;我用不同数量的重复匹配规则运行 Flex,从 2 到 9,并提取了每台 state 机器中的状态数; as might be expected from the above, the state count is exponential in the number of rules:正如上面所预期的,state 计数在规则数量上呈指数增长:

Rules   States
  2       28
  3       89
  4      306
  5     1063
  6     3656
  7    12405
  8    41566
  9   137795

Based on that progression, it's reasonable to predict that the scanner for 12 letters would have somewhat more than 50 million states, and the source file which contains the state transitions will be measured in gigabytes.根据这一进展,可以合理地预测 12 个字母的扫描仪将具有超过 5000 万个状态,并且包含 state 转换的源文件将以千兆字节为单位进行测量。 So it's not too surprising that Flex takes a while to generate it.因此,Flex 需要一段时间来生成它也就不足为奇了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM