简体   繁体   English

构建可组合的有向图(Thompson 的扫描仪生成器构造算法)

[英]Building composable directed graphs (Thompson's construction algorithm for scanner generator)

I am currently writing a scanner generator based on Thompson's construction algorithm to turn regular expressions into NFAs.我目前正在编写基于Thompson 构造算法的扫描仪生成器,以将正则表达式转换为 NFA。 Basically, I need to parse an expression and create a directed graph from it.基本上,我需要解析一个表达式并从中创建一个有向图。 I usually store my diGraphs as adjacency lists, but this time, I need to be able to combine existing diGraphs into a new diGraph very efficiently.我通常将我的有向图存储为邻接表,但这一次,我需要能够非常有效地将现有有向图组合成一个新的有向图。 I can't afford to copy my adjacency lists every time I read a new character.每次我读到一个新字符时,我都无法复制我的邻接表。

I was considering creating a very lightweight NFA struct, that wouldn't own its own nodes/states.我正在考虑创建一个非常轻量级的 NFA 结构,它不会拥有自己的节点/状态。

struct Transition {
  State* next_state;
  char transition_symbol;
};

struct State {
  std::vector<Transition> transitions;
};

struct NFA {
  State* start_state;
  State* accepting_state;
};

That would allow me to simply reassign pointers to create new NFAs.这将允许我简单地重新分配指针以创建新的 NFA。 All my states would be stored in a central location (NFABuilder?).我所有的状态都将存储在一个中央位置(NFABuilder?)。 The composition would be done through external functions like so:组合将通过外部函数完成,如下所示:

NFA create_trivial_nfa(char symbol) {
  State* start_state = new State();
  State* accepting_state = new State();
  start_state->transitions.emplace_back(accepting_state, symbol);
  // Something must own start_state and accepting_state
  return NFA{start_state, accepting_state};
}

NFA concatenate_nfas(NFA&& nfa0, NFA&& nfa1) {
  nfa0.accepting_state->transitions.emplace_back(nfa1.start_state, '\0');
  return NFA{nfa0.start_state, nfa1.accepting_state};
}

Here, I would use move semantics to make it clear that nfa0 and nfa1 are not longer to be used as standalone NFAs (since I modified their internal states).在这里,我将使用移动语义来明确 nfa0 和 nfa1 不再用作独立的 NFA(因为我修改了它们的内部状态)。

Does this approach make sense, or is there a problem I have not yet anticipated?这种方法是否有意义,或者是否存在我尚未预料到的问题? If it does make sense, what should be the owner of all these states?如果确实有意义,那么所有这些状态的所有者应该是什么? I am also anticipating a padding issue with my transitions.我也预计我的转换会出现填充问题。 When packed in a vector, a Transition will have a size of 16 bytes instead of 9 (on a 64bit architecture).当打包在向量中时,转换的大小为 16 字节,而不是 9 字节(在 64 位架构上)。 Is this something I should worry about or is it just noise in the grand scheme of things?这是我应该担心的事情还是只是大局中的噪音? (this is my first compiler. I am following Engineering a Compiler, by Cooper & Torczon ) (这是我的第一个编译器。我正在关注Cooper & Torczon 的 Engineering a Compiler

The essence of Thompson's construction is that it creates an NFA with the following characteristics: Thompson 构造的本质在于它创建了一个具有以下特征的 NFA:

  1. There are at most 2|R|最多有2|R| states, where |R|状态,其中|R| is the length of the regex.是正则表达式的长度。

  2. Every state has either exactly one out transition labeled with a character or at most two ε transitions.每个状态要么恰好有一个用字符标记的输出转换,要么最多有两个 ε 转换。 (That is, no state has both a labeled transition and an ε transition.) (也就是说,没有状态同时具有标记转换和 ε 转换。)

The latter fact suggests that representing a state as后一个事实表明,将一个状态表示为

struct State {
  std::vector<std::tuple<char, State*>> transitions;
}

(which is a slight abbreviation of your code) is a very high overhead representation, where the overhead has a lot more to do with the overhead of the std::vector used to hold exactly one or two transitions than the padding of a single transition. (这是您的代码的略微缩写)是一种非常高的开销表示,其中开销更多地与用于保存一两个转换的std::vector的开销有关,而不是单个转换的填充. Moreover, the above representation does not provide a clear technique for representing ε transitions, unless the intention was to reserve one character code for ε (and thereby make it impossible to use that character code in a regular expression).此外,上述表示没有提供用于表示 ε 转换的清晰技术,除非意图是为 ε 保留一个字符代码(从而使得无法在正则表达式中使用该字符代码)。

A more practical representation might be更实际的表示可能是

enum class StateType { EPSILON, IMPORTANT };
struct State {
  StateType type;
  char      label;
  State*    next[2];
};

(That formulation doesn't store the number of transitions in next , on the assumption that we can use a sentinel value to indicate that next[1] doesn't apply. Alternatively, we could just set next[1] = next[0]; in such a case. Remember that it only matters for ε states.) (该公式不存储next的转换数量,假设我们可以使用标记值来指示next[1]不适用。或者,我们可以设置next[1] = next[0];在这种情况下。记住它只对 ε 状态重要。)

Moreover, since we know there are no more than 2|R|此外,由于我们知道不超过2|R| State objects in the NFA, we could replace the State* pointers with small integers. NFA 中的State对象,我们可以用小整数替换State*指针。 That will set some sort of limit on the size of a regular expression which could be handled, but it's pretty uncommon to encounter gigabyte regexes.这将对可以处理的正则表达式的大小设置某种限制,但遇到千兆字节的正则表达式非常罕见。 Using consecutive integers rather than pointers will also make certain graph algorithms more manageable, in particular the transitive closure algorithm which is fundamental to the subset construction.使用连续整数而不是指针也会使某些图算法更易于管理,特别是传递闭包算法,它是子集构造的基础。

Another interesting fact about the NFA constructed by the Thompson algorithm is that the in-degree of States is also limited to 2 (and again, if there are two in-transitions, both will be ε transitions).关于 Thompson 算法构建的 NFA 的另一个有趣的事实是,状态的入度也限制为 2(同样,如果有两个过渡,则两者都是 ε 过渡)。 This allows us to avoid prematurely creating the sub-machines' final states (which will not be needed if the sub-machine is the left-hand argument to concatenation).这允许我们避免过早地创建子机器的最终状态(如果子机器是连接的左侧参数,则不需要)。 Instead, we can represent the sub-machine with just three indices: the index of the start state, and the indices of the at most two internal states which will have transitions to the final state once it is added.相反,我们可以只用三个索引来表示子机器:开始状态的索引,以及最多两个内部状态的索引,一旦添加,它们将转换到最终状态。

I think the above is reasonably close to Thompson's original implementation, although I'm sure he used a lot more optimisation tricks.我认为上述内容与 Thompson 的原始实现相当接近,尽管我确信他使用了更多优化技巧。 But it's worth reading Section 3.9 of Aho, Lam, Sethi & Ullman (the "Dragon Book"), which describes ways of optimising construction of state machines.但是值得一读 Aho, Lam, Sethi & Ullman(“龙之书”)的第 3.9 节,其中描述了优化状态机构造的方法。

Independent to the theoretical reductions, it's worth noting that aside from the trie of keyword patterns, the majority of state transitions in lexical analysis involve character sets rather than individual characters, and often these sets are quite large, particularly if the unit of lexical analysis is a Unicode codepoint rather than an ascii character.独立于理论简化,值得注意的是,除了关键字模式的 trie 之外,词法分析中的大多数状态转换都涉及字符集而不是单个字符,而且这些集通常非常大,特别是如果词法分析的单位是Unicode 代码点而不是 ascii 字符。 Using character sets instead of characters does complicate the subset construction algorithm, but it will normally dramatically reduce the state count.使用字符集而不是字符确实会使子集构造算法复杂化,但它通常会显着减少状态计数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM