简体繁体 English

与NFA和DFA并行匹配的正则表达式？哪一个更快？

[英]Parallel regex matching with NFA vs DFA? Which one is faster?

原文 2016-06-17 23:11:44 1 1 regex/ parallel-processing/ computer-science/ dfa/ nfa

I was reading about NFA and DFA and it seems that the most popular and fastest way of implementing regex matcher is to create NFA from regex, convert it to DFA, minimize that DFA, implement it in any language and use it. 我正在阅读有关NFA和DFA的信息，看来实现正则表达式匹配器的最流行和最快的方法是从正则表达式创建NFA，将其转换为DFA，最小化该DFA，以任何语言进行实现并使用它。

DFA is a better choice over NFA because it has only one transition for an input, while NFA can have many. 与NFA相比，DFA是更好的选择，因为它只有一个输入转换，而NFA可以有许多转换。 Thus, DFA has only one path to follow, while NFA - many. 因此，DFA只有一条可以遵循的道路，而NFA有很多。

But, this is where I do not understand. 但是，这是我不明白的地方。 Why we have to keep track of NFA states and go back to them which slows us, can we split into different threads when encountered an input to more than one state and compute each path in parallel? 为什么我们必须跟踪NFA状态并返回它们，这使我们放慢速度，当遇到多个状态的输入并并行计算每条路径时，我们可以拆分为不同的线程吗？ Wouldn't be faster over DFA? 使用DFA会更快吗？ Or I missing something? 还是我错过了什么？

1 个解决方案

Generally speaking, DFA is faster, but NFA is more compact. 一般而言，DFA速度更快，但NFA更为紧凑。 The NFA is proportional to the size of the regular expression. NFA与正则表达式的大小成正比。 (Informal proof: each operator node in a regular expression's syntax just adds a new node to the NFA graph.) Because the DFA is formed from subsets of sets of the NFA states, there are cases when it can be quite large. （非正式证明：正则表达式语法中的每个运算符节点只会在NFA图中添加一个新节点。）由于DFA是由NFA状态集的子集组成的，因此在某些情况下它可能会很大。 In the worst case, a DFA is exponentially sized wrt the regular expression. 在最坏的情况下，DFA与正则表达式的大小成指数关系。 An example of this is the expression of the form (a|b)(a|b)(a|b)(a|b)...(a|b) where there are N (a|b) units translates to a DFA whose size is O(2**N). 这样的一个例子是(a|b)(a|b)(a|b)(a|b)...(a|b)形式的表达式，其中有N (a|b)单位转换为DFA，其大小为O（2 ** N）。 It contains transitions through unique states for all of the combinations for a and b . 它包含a和b所有组合的唯一状态转换。 A degenerate DFA could exceed the size of the CPU cache in cases where the data structures required to simulate the equivalent NFA fit into cache. 在模拟等效NFA所需的数据结构适合高速缓存的情况下，退化的DFA可能会超过CPU高速缓存的大小。

There is somewhat more up-front cost to a DFA, due to the extra steps. 由于采取了额外的步骤，DFA的前期费用有所增加。 So tradeoffs apply: will enough data be processed by the NFA simulator to justify building a DFA. 因此需要进行权衡：NFA仿真器将处理足够的数据以证明构建DFA是合理的。

An NFA simulation can entirely avoid touching parts of the regular expression which don't apply to an input at all. NFA模拟可以完全避免触及正则表达式中根本不适用于输入的部分。 For instance, suppose a regex has the form R1|R2, where R1 is very simple and small, and R2 is a huge, complicated beast. 例如，假设正则表达式的形式为R1 | R2，其中R1非常简单且很小，而R2是巨大，复杂的野兽。 Suppose the inputs usually just match R1 and R2 hardly ever applies (as in, no part of it at all, say, due to some mismatching prefix). 假设输入通常仅匹配R1和R2几乎不适用（例如，由于前缀不匹配，因此根本不包含任何部分）。 This influences the tradeoff: compiling to DFA means everything is compiled, the simple R1 part and the monstrous R2 part. 这会影响权衡：编译为DFA意味着一切都已编译，简单的R1部分和可怕的R2部分。

Lastly, an implementation doesn't have to be strictly NFA or DFA. 最后，实施不一定严格是NFA或DFA。 An NFA simulator can cache the state sets which it computes. NFA仿真器可以缓存其计算的状态集。 Those cached states are equivalent to the DFA states and provide a similar benefit as compilation to DFA. 这些缓存状态与DFA状态等效，并提供与DFA编译类似的好处。 You can think of this is "JIT for the NFA". 您可以想到的是“针对NFA的JIT”。 The cache can be trimmed to some fixed size, and subject to a replacement policy, so that expressions whose complete DFA's would be large can be handled in less memory (and nearly as fast, if the data shows good locality of reference in the cache). 可以将缓存修剪成一定的固定大小，并遵循替换策略，以便可以在更少的内存中处理完整DFA较大的表达式（如果数据在缓存中显示了良好的引用局部性，则可以几乎一样快地处理）。。