简体   繁体   English

将字符集转换为 nfa/dfa 的高效算法

[英]Efficient algorithm for converting a character set into a nfa/dfa

I'm currently working on a scanner generator.我目前正在研究扫描仪生成器。 The generator already works fine.发电机已经工作正常。 But when using character classes the algorithm gets very slow.但是当使用字符类时,算法变得非常慢。

The scanner generator produces a scanner for UTF8 encoded files.扫描仪生成器为 UTF8 编码文件生成扫描仪。 The full range of characters (0x000000 to 0x10ffff) should be supported.应该支持全范围的字符(0x000000 到 0x10ffff)。

If I use large character sets, like the any operator '.'如果我使用大字符集,例如 any 运算符 '.' or the unicode property {L}, the nfa (and also the dfa) contains a lot of states ( > 10000 ).或 unicode 属性 {L},nfa(以及 dfa)包含很多状态(> 10000)。 So the convertation for nfa to dfa and create the minimal dfa takes a long time (even if the output minimal dfa contains only a few states).因此,将 nfa 转换为 dfa 并创建最小 dfa 需要很长时间(即使输出的最小 dfa 仅包含几个状态)。

Here's my current implementation of creating a character set part of the nfa.这是我当前创建 nfa 字符集部分的实现。

void CreateNfaPart(int startStateIndex, int endStateIndex, Set<int> characters)
{
transitions[startStateIndex] = CreateEmptyTransitionsArray();
foreach (int character in characters) {
    // get the utf8 encoded bytes for the character
    byte[] encoded = EncodingHelper.EncodeCharacter(character);
    int tStartStateIndex = startStateIndex;
    for (int i = 0; i < encoded.Length - 1; i++) {
        int tEndStateIndex = transitions[tStartStateIndex][encoded[i]];
        if (tEndStateIndex == -1) {
           tEndStateIndex = CreateState();
               transitions[tEndStateIndex] = CreateEmptyTransitionsArray();
        }                   
        transitions[tStartStateIndex][encoded[i]] = tEndStateIndex;
        tStartStateIndex = tEndStateIndex;
    }
    transitions[tStartStateIndex][encoded[encoded.Length - 1]] = endStateIndex;
}

Does anyone know how to implement the function much more efficiently to create only the necessary states?有谁知道如何更有效地实现该功能以仅创建必要的状态?

EDIT:编辑:

To be more specific I need a function like:更具体地说,我需要一个函数,如:

List<Set<byte>[]> Convert(Set<int> characters)
{
     ???????
}

A helper function to convert a character (int) to a UTF8 encoding byte[] is defined as:将字符 (int) 转换为 UTF8 编码 byte[] 的辅助函数定义为:

byte[] EncodeCharacter(int character)
{ ... }

I'll clarify what I think you're asking for: to union a set of Unicode codepoints such that you produce a state-minimal DFA where transitions represent UTF8-encoded sequences for those codepoints.我将澄清我认为您要求的内容:联合一组 Unicode 代码点,以便您生成一个状态最小的 DFA,其中转换表示这些代码点的 UTF8 编码序列。

When you say "more efficiently", that could apply to runtime, memory usage, or to compactness of the end result.当您说“更高效”时,这可能适用于运行时、内存使用或最终结果的紧凑性。 The usual meaning for "minimal" in finite automata refers to using the fewest states to describe any given language, which is what you're getting at by "create only the necessary states".有限自动机中“最小”的通常含义是指使用最少的状态来描述任何给定的语言,这就是“仅创建必要的状态”所得到的。

Every finite automata has exactly one equivalent state minimal DFA (see the Myhill-Nerode theorem [1], or Hopcroft & Ullman [2]).每个有限自动机都恰好有一个等效状态最小DFA(参见Myhill-Nerode定理 [1] 或 Hopcroft & Ullman [2])。 For your purposes, we can construct this minimal DFA directly using the Aho-Corasick algorithm [3].出于您的目的,我们可以直接使用 Aho-Corasick 算法 [3] 构建这个最小 DFA。

To do this, we need a mapping from Unicode codepoints to their corresponding UTF8 encodings.为此,我们需要从 Unicode 代码点到它们相应的 UTF8 编码的映射。 There's no need to store all of these UTF8 byte sequences in advance;无需预先存储所有这些 UTF8 字节序列; they can be encoded on the fly.它们可以即时编码。 The UTF8 encoding algorithm is well documented and I won't repeat it here. UTF8 编码算法有据可查,这里不再赘述。

Aho-Corasick works by first constructing a trie . Aho-Corasick 首先构建一个trie In your case this would be a trie of each UTF8 sequence added in turn.在您的情况下,这将是依次添加的每个 UTF8 序列的 trie。 Then that trie is annotated with transitions turning it into a DAG per the rest of the algorithm.然后用转换注释该树,根据算法的其余部分将其转换为 DAG。 There's a nice overview of the algorithm here , but I suggest reading the paper itself.这里有一个很好的算法概述,但我建议阅读论文本身。

Pseudocode for this approach:这种方法的伪代码:

trie = empty
foreach codepoint in input_set:
   bytes[] = utf8_encode(codepoint)
   trie_add_key(bytes)
dfa = add_failure_edges(trie) # per the rest of AC

This approach (forming a trie of UTF8-encoded sequences, then Aho-Corasick, then rendering out DFA) is the approach taken in the implementation for my regexp and finite state machine libraries, where I do exactly this for constructing Unicode character classes.这种方法(形成一组 UTF8 编码序列,然后是 Aho-Corasick,然后呈现 DFA)是在我的正则表达式和有限状态机库的实现中采用的方法,我正是在其中构建 Unicode 字符类。 Here you can see code for:在这里您可以看到以下代码:

Other approaches (as mentioned in other answers to this question) include working on codepoints and expressing ranges of codepoints, rather than spelling out every byte sequence.其他方法(如该问题的其他答案中所述)包括处理代码点和表达代码点的范围,而不是拼出每个字节序列。

[1] Myhill-Nerode: Nerode, Anil (1958), Linear Automaton Transformations , Proceedings of the AMS, 9, JSTOR 2033204 [1] Myhill-Nerode:Nerode, Anil (1958),线性自动机变换,AMS 会议录,9,JSTOR 2033204
[2] Hopcroft & Ullman (1979), Section 3.4, Theorem 3.10, p.67 [2] Hopcroft & Ullman (1979),第 3.4 节,定理 3.10,第 67 页
[3] Aho, Alfred V.; [3] Aho,Alfred V.; Corasick, Margaret J. (June 1975). Corasick, Margaret J.(1975 年 6 月)。 Efficient string matching: An aid to bibliographic search .高效的字符串匹配:对书目搜索的帮助 Communications of the ACM. ACM 的通讯。 18 (6): 333–340. 18 (6): 333–340。

There are a number of ways to handle it.有多种处理方法。 They all boil down to treating sets of characters at a time in the data structures, instead of enumerating the entire alphabet ever at all.它们都归结为在数据结构中一次处理一组字符,而不是枚举整个字母表。 It's also how you make scanners for Unicode in a reasonable amount of memory.这也是您如何在合理的内存量中为 Unicode 制作扫描仪的方法。

You've many choices about how to represent and process sets of characters.关于如何表示和处理字符集,您有多种选择。 I'm presently working with a solution that keeps an ordered list of boundary conditions and corresponding target states.我目前正在使用一种解决方案,该解决方案保留边界条件和相应目标状态的有序列表。 You can process operations on these lists much faster than you could if you had to scan the entire alphabet at each juncture.如果您必须在每个节点扫描整个字母表,您可以更快地处理这些列表上的操作。 In fact, it's fast enough that it runs in Python with acceptable speed.事实上,它足够快,可以在 Python 中以可接受的速度运行。

看看像 Google RE2 和 TRE 这样的正则表达式库在做什么。

I had the same problem with my scanner generator, so I've come up with the idea of replacing intervals by their ids which is determined using interval tree.我的扫描仪生成器也有同样的问题,所以我想出了用间隔​​树确定的 ID 替换间隔的想法。 For instance a..z range in dfa can be represented as: 97, 98, 99, ..., 122, instead I represent ranges as [97, 122], then build interval tree structure out of them, so at the end they are represented as ids that is referring to the interval tree.例如,dfa 中的 a..z 范围可以表示为:97, 98, 99, ..., 122,而我将范围表示为 [97, 122],然后从中构建区间树结构,所以最后它们表示为引用区间树的 id。 Given the following RE: a..z+, we end up with such DFA:鉴于以下 RE:a..z+,我们最终得到这样的 DFA:

0 -> a -> 1
0 -> b -> 1
0 -> c -> 1
0 -> ... -> 1
0 -> z -> 1

1 -> a -> 1
1 -> b -> 1
1 -> c -> 1
1 -> ... -> 1
1 -> z -> 1
1 -> E -> ACCEPT

Now compress intervals:现在压缩间隔:

0 -> a..z -> 1

1 -> a..z -> 1
1 -> E -> ACCEPT

Extract all intervals from your DFA and build interval tree out of them:从 DFA 中提取所有区间并从中构建区间树:

{
    "left": null,
    "middle": {
        id: 0,
        interval: [a, z],
    },
    "right": null
}

Replace actual intervals to their ids:将实际间隔替换为其 ID:

0 -> 0 -> 1
1 -> 0 -> 1
1 -> E -> ACCEPT

In this library ( http://mtimmerm.github.io/dfalex/ ) I do it by putting a range of consecutive characters on each transition, instead of single characters.在这个库 ( http://mtimmerm.github.io/dfalex/ ) 中,我通过在每个转换上放置一系列连续字符而不是单个字符来做到这一点。 This is carried through all the steps of NFA constuction, NFA->DFA conversion, DFA minimization, and optimization.这通过 NFA 构建、NFA->DFA 转换、DFA 最小化和优化的所有步骤进行。

It's quite compact, but it adds code complexity to every step.它非常紧凑,但它增加了每一步的代码复杂性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM