简体   繁体   English

DFA 与 NFA 引擎:它们的功能和限制有何不同?

[英]DFA vs NFA engines: What is the difference in their capabilities and limitations?

我正在寻找对 DFA 与 NFA 引擎之间差异的非技术性解释,基于它们的功能和限制。

Deterministic Finite Automatons (DFAs) and Nondeterministic Finite Automatons (NFAs) have exactly the same capabilities and limitations.确定性有限自动机 (DFA) 和非确定性有限自动机 (NFA) 具有完全相同的功能和限制。 The only difference is notational convenience.唯一的区别是符号方便。

A finite automaton is a processor that has states and reads input, each input character potentially setting it into another state.有限自动机是具有状态并读取输入的处理器,每个输入字符都可能将其设置为另一种状态。 For example, a state might be "just read two Cs in a row" or "am starting a word".例如,状态可能是“连续读两个 C”或“正在开始一个单词”。 These are usually used for quick scans of text to find patterns, such as lexical scanning of source code to turn it into tokens.这些通常用于快速扫描文本以查找模式,例如对源代码进行词法扫描以将其转换为标记。

A deterministic finite automaton is in one state at a time, which is implementable.确定性有限自动机一次处于一种状态,这是可以实现的。 A nondeterministic finite automaton can be in more than one state at a time: for example, in a language where identifiers can begin with a digit, there might be a state "reading a number" and another state "reading an identifier", and an NFA could be in both at the same time when reading something starting "123".非确定性有限自动机一次可以处于多个状态:例如,在标识符可以以数字开头的语言中,可能存在“读取数字”状态和另一种状态“读取标识符”,以及在阅读以“123”开头的内容时,NFA 可能同时出现在两者中。 Which state actually applies would depend on whether it encountered something not numeric before the end of the word.实际适用哪种状态取决于它是否在单词结尾之前遇到了非数字的内容。

Now, we can express "reading a number or identifier" as a state itself, and suddenly we don't need the NFA.现在,我们可以将“读取数字或标识符”表示为一个状态本身,突然间我们不需要 NFA。 If we express combinations of states in an NFA as states themselves, we've got a DFA with a lot more states than the NFA, but which does the same thing.如果我们将 NFA 中的状态组合表示为状态本身,我们得到的 DFA 具有比 NFA 多得多的状态,但它做同样的事情。

It's a matter of which is easier to read or write or deal with.这是一个更容易阅读、写作或处理的问题。 DFAs are easier to understand per se, but NFAs are generally smaller. DFA 本身更容易理解,但 NFA 通常较小。

Here's a non-technical answer from Microsoft:这是微软的非技术性回答:

DFA engines run in linear time because they do not require backtracking (and thus they never test the same character twice). DFA 引擎在线性时间内运行,因为它们不需要回溯(因此它们从不测试同一个字符两次)。 They can also guarantee matching the longest possible string.他们还可以保证匹配最长的字符串。 However, since a DFA engine contains only finite state, it cannot match a pattern with backreferences, and because it does not construct an explicit expansion, it cannot capture subexpressions.但是,由于 DFA 引擎仅包含有限状态,因此它无法匹配具有反向引用的模式,并且由于它不构造显式扩展,因此无法捕获子表达式。

Traditional NFA engines run so-called "greedy" match backtracking algorithms, testing all possible expansions of a regular expression in a specific order and accepting the first match.传统 NFA 引擎运行所谓的“贪婪”匹配回溯算法,以特定顺序测试正则表达式的所有可能扩展并接受第一个匹配。 Because a traditional NFA constructs a specific expansion of the regular expression for a successful match, it can capture subexpression matches and matching backreferences.因为传统的 NFA 为成功匹配构造了正则表达式的特定扩展,所以它可以捕获子表达式匹配和匹配反向引用。 However, because a traditional NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths.然而,由于传统的 NFA 回溯,如果状态是通过不同的路径到达的,它可以多次访问完全相同的状态。 As a result, it can run exponentially slowly in the worst case.因此,在最坏的情况下,它的运行速度会呈指数级增长。 Because a traditional NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.因为传统 NFA 接受它找到的第一个匹配项,所以它也可以留下其他(可能更长的)匹配项未被发现。

POSIX NFA engines are like traditional NFA engines, except that they continue to backtrack until they can guarantee that they have found the longest match possible. POSIX NFA 引擎与传统 NFA 引擎类似,不同之处在于它们会继续回溯,直到可以保证找到最长的匹配项为止。 As a result, a POSIX NFA engine is slower than a traditional NFA engine, and when using a POSIX NFA you cannot favor a shorter match over a longer one by changing the order of the backtracking search.因此,POSIX NFA 引擎比传统 NFA 引擎慢,并且在使用 POSIX NFA 时,您不能通过更改回溯搜索的顺序来支持较短的匹配而不是较长的匹配。

Traditional NFA engines are favored by programmers because they are more expressive than either DFA or POSIX NFA engines.传统的 NFA 引擎受到程序员的青睐,因为它们比 DFA 或 POSIX NFA 引擎更具表现力。 Although in the worst case they can run slowly, you can steer them to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.尽管在最坏的情况下它们可能运行缓慢,但您可以使用减少歧义和限制回溯的模式引导它们在线性或多项式时间内找到匹配项。

[http://msdn.microsoft.com/en-us/library/0yzc2yb0.aspx] [http://msdn.microsoft.com/en-us/library/0yzc2yb0.aspx]

A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions .一个简单的、非技术性的解释,从 Jeffrey Friedl 的书Mastering Regular Expressions 中转述。

CAVEAT :警告

While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct.虽然这本书通常被认为是“正则表达式圣经”,但对于 DFA 和 NFA 之间的区别是否真的正确,似乎存在一些争议。 I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not.我不是计算机科学家,我不明白什么是真正的“正则”表达式背后的大部分理论,无论是确定性的还是非确定性的。 After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers.争议开始后,我因此删除了此答案,但此后在其他答案的评论中引用了它。 I would be very interested in discussing this further - can it be that Friedl really is wrong?我很想进一步讨论这个问题——难道弗里德尔真的错了? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)?还是我弄错了 Friedl(但我昨天晚上重读了那章,就像我记得的那样......)?

Edit: It appears that Friedl and I are indeed wrong.编辑:看来弗里德尔和我确实错了。 Please check out Eamon's excellent comments below.请在下面查看 Eamon 的精彩评论。


Original answer:原答案:

A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. DFA 引擎逐个字符地遍历输入字符串,并尝试(并记住)此时正则表达式可以匹配的所有可能方式。 If it reaches the end of the string, it declares success.如果它到达字符串的末尾,则声明成功。

Imagine the string AAB and the regex A*AB .想象一下字符串AAB和正则表达式A*AB We now step through our string letter by letter.我们现在一个字母一个字母地遍历我们的字符串。

  1. A : A

    • First branch: Can be matched by A* .第一个分支:可以通过A*匹配。
    • Second branch: Can be matched by ignoring the A* (zero repetitions are allowed) and using the second A in the regex.第二个分支:可以通过忽略A* (允许零重复)并使用正则表达式中的第二个A来匹配。
  2. A : A

    • First branch: Can be matched by expanding A* .第一个分支:可以通过扩展A*来匹配。
    • Second branch: Can't be matched by B .第二个分支:不能被B匹配。 Second branch fails.第二个分支失败。 But:但是:
    • Third branch: Can be matched by not expanding A* and using the second A instead.第三个分支:可以通过不扩展A*而使用第二个A来匹配。
  3. B : B

    • First branch: Can't be matched by expanding A* or by moving on in the regex to the next token A .第一个分支:无法通过扩展A*或通过在正则表达式中移动到下一个标记A来匹配。 First branch fails.第一个分支失败。
    • Third branch: Can be matched.第三个分支:可以匹配。 Hooray!万岁!

A DFA engine never backtracks in the string. DFA 引擎永远不会在字符串中回溯。


An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. NFA 引擎逐个遍历正则表达式标记,并尝试对字符串进行所有可能的排列,并在必要时回溯。 If it reaches the end of the regex, it declares success.如果它到达正则表达式的末尾,则宣布成功。

Imagine the same string and the same regex as before.想象一下与以前相同的字符串和相同的正则表达式。 We now step through our regex token by token:我们现在逐个遍历我们的正则表达式令牌:

  1. A* : Match AA . A* : 匹配AA Remember the backtracking positions 0 (start of string) and 1.记住回溯位置 0(字符串的开头)和 1。
  2. A : Doesn't match. A :不匹配。 But we have a backtracking position we can return to and try again.但是我们有一个可以返回并再次尝试的回溯位置。 The regex engine steps back one character.正则表达式引擎后退一个字符。 Now A matches.现在A匹配。
  3. B : Matches. B :火柴。 End of regex reached (with one backtracking position to spare).正则表达式结束(有一个回溯位置备用)。 Hooray!万岁!

Both NFAs and DFAs are finite automata, as their names say.顾名思义,NFA 和 DFA 都是有限自动机。

Both can be represented as a starting state, a success (or "accept") state (or set of success states), and a state table listing transitions.两者都可以表示为一个起始状态、一个成功(或“接受”)状态(或一组成功状态)和一个列出转换的状态表。

In the state table of a DFA, each <state₀, input> key will transit to one and only one state₁ .在 DFA 的状态表中,每个<state₀, input>键将转换为一个且只有一个state₁

In the state table of an NFA, each <state₀, input> will transit to a set of states.在 NFA 的状态表中,每个<state₀, input>将转换为一状态。

When you take a DFA, reset it to it's start state, give it a sequence of input symbols, and you will know exactly what end state it's in and whether it's a success state or not.当您使用 DFA 时,将其重置为它的开始状态,给它一个输入符号序列,您将确切地知道它处于什么结束状态以及它是否是成功状态。

When you take an NFA, however, it will, for each input symbol, look up the set of possible result states, and (in theory) randomly, "nondeterministically," select one of them.但是,当您采用 NFA 时,它将为每个输入符号查找可能的结果状态集,并且(理论上)随机地“非确定性地”选择其中一个。 If there exists a sequence of random selections which leads to one of the success states for that input string, then the NFA is said to succeed for that string.如果存在导致该输入字符串的成功状态之一的随机选择序列,则称该 NFA 对该字符串成功。 In other words, you are expected to pretend that it magically always selects the right one.换句话说,您应该假装它总是神奇地选择正确的。

One early question in computing was whether NFAs were more powerful than DFAs, due to that magic, and the answer turned out to be no since any NFA could be translated into an equivalent DFA.计算中的一个早期问题是,由于这种魔力,NFA 是否比 DFA 更强大,答案是否定的,因为任何 NFA 都可以转换为等效的 DFA。 Their capabilities and limitations are exactly precisely the same as one another.它们的能力和局限性完全相同。

For those wondering how real, non-magical, NFA engine can "magically" select the right successor state for a given symbol, this page describes the two common approaches.对于那些想知道真实的、非魔法的 NFA 引擎如何“神奇地”为给定符号选择正确的后继状态的人,本页介绍了两种常见方法。

I find the explanation given in Regular Expressions, The Complete Tutorial by Jan Goyvaerts to be the most usable.我发现正则表达式中给出的解释, Jan Goyvaerts的完整教程是最有用的。 See page 7 of this PDF:请参阅此 PDF 的第 7 页:

https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

Among other points made on page 7, There are two kinds of regular expression engines: text-directed engines, and regex-directed engines.在第 7 页的其他要点中,有两种正则表达式引擎:文本导向引擎和正则表达式导向引擎。 Jeffrey Friedl calls them DFA and NFA engines, respectively. Jeffrey Friedl 分别称它们为 DFA 和 NFA 引擎。 ...certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines. ...某些非常有用的功能,例如惰性量词和反向引用,只能在正则表达式导向的引擎中实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM