简体繁体 English

在实现词法分析器时，DFA与Regex有什么关系？

[英]DFAs vs Regexes when implementing a lexical analyzer?

原文 2013-01-19 22:34:05 1 1 regex/ compiler-construction/ lexical-analysis/ dfa

(I'm just learning how to write a compiler, so please correct me if I make any incorrect claims) （我刚刚学习如何编写编译器，所以如果我做出任何不正确的声明，请纠正我）

Why would anyone still implement DFAs in code (goto statements, table-driven implementations) when they can simply use regular expressions? 当人们可以简单地使用正则表达式时，为什么还有人会在代码（goto语句，表驱动的实现）中实现DFA？ As far as I understand, lexical analyzers take in a string of characters and churn out a list of tokens which, in the languages' grammar definition, are terminals, making it possible for them to be described by a regular expression. 据我所知，词法分析器接收一串字符并生成一个令牌列表，这些令牌在语言的语法定义中是终端，使得它们可以用正则表达式来描述。 Wouldn't it be easier to just loop over a bunch of regexes, breaking out of the loop if it finds a match? 绕过一堆正则表达式会不会更容易，如果找到匹配就会突破循环？

1 个解决方案

You're absolutely right that it's easier to write regular expressions than DFAs. 你写正则表达式比DFA更容易，这是完全正确的。 However, A good question to think about is 但是，要考虑的一个好问题是

How do these regex matchers work? 这些正则表达式匹配器如何工作？

Most very fast implementations of regex matchers work by compiling down to some type of automaton (either an NFA or a minimum-state DFA) internally. 正则表达式匹配器的大多数非常快速的实现都是通过在内部编译为某种类型的自动机（NFA或最小状态DFA）来实现的。 If you wanted to build a scanner that worked by using regexes to describe which tokens to match and then looping through all of them, you could absolutely do so, but internally they'd probably compile to DFAs. 如果你想构建一个通过使用正则表达式来描述哪些令牌匹配然后循环遍历所有这些令牌的扫描程序，你绝对可以这样做，但在内部它们可能会编译为DFA。

It's extremely rare to see anyone actually code up a DFA for doing scanning or parsing because it's just so complicated. 很少见到任何人实际编写DFA进行扫描或解析，因为它太复杂了。 This is why there are tools like lex or flex , which let you specify the regexes to match and then automatically compile down to DFAs behind the scenes. 这就是为什么有像lex或flex这样的工具，它们允许你指定匹配的正则表达式，然后在后台自动编译成DFA。 That way, you get the best of both worlds - you describe what to match using the nicer framework for regexes, but you get the speed and efficiency of DFAs behind the scenes. 这样，您就可以获得两全其美的效果 - 您可以使用更好的正则表达框架来描述要匹配的内容，但是您可以在幕后获得DFA的速度和效率。

One more important detail about building a giant DFA is that it is possible to build a single DFA that tries matching multiple different regular expressions in parallel. 关于构建巨型DFA的另一个重要细节是，可以构建单个DFA，尝试并行匹配多个不同的正则表达式。 This increases efficiency, since it's possible to run the matching DFA over the string in a way that will concurrently search for all possible regex matches. 这样可以提高效率，因为可以在字符串上运行匹配的DFA，同时搜索所有可能的正则表达式匹配。

Hope this helps! 希望这可以帮助！