简体   繁体   中英

Building composable directed graphs (Thompson's construction algorithm for scanner generator)

I am currently writing a scanner generator based on Thompson's construction algorithm to turn regular expressions into NFAs. Basically, I need to parse an expression and create a directed graph from it. I usually store my diGraphs as adjacency lists, but this time, I need to be able to combine existing diGraphs into a new diGraph very efficiently. I can't afford to copy my adjacency lists every time I read a new character.

I was considering creating a very lightweight NFA struct, that wouldn't own its own nodes/states.

struct Transition {
  State* next_state;
  char transition_symbol;
};

struct State {
  std::vector<Transition> transitions;
};

struct NFA {
  State* start_state;
  State* accepting_state;
};

That would allow me to simply reassign pointers to create new NFAs. All my states would be stored in a central location (NFABuilder?). The composition would be done through external functions like so:

NFA create_trivial_nfa(char symbol) {
  State* start_state = new State();
  State* accepting_state = new State();
  start_state->transitions.emplace_back(accepting_state, symbol);
  // Something must own start_state and accepting_state
  return NFA{start_state, accepting_state};
}

NFA concatenate_nfas(NFA&& nfa0, NFA&& nfa1) {
  nfa0.accepting_state->transitions.emplace_back(nfa1.start_state, '\0');
  return NFA{nfa0.start_state, nfa1.accepting_state};
}

Here, I would use move semantics to make it clear that nfa0 and nfa1 are not longer to be used as standalone NFAs (since I modified their internal states).

Does this approach make sense, or is there a problem I have not yet anticipated? If it does make sense, what should be the owner of all these states? I am also anticipating a padding issue with my transitions. When packed in a vector, a Transition will have a size of 16 bytes instead of 9 (on a 64bit architecture). Is this something I should worry about or is it just noise in the grand scheme of things? (this is my first compiler. I am following Engineering a Compiler, by Cooper & Torczon )

The essence of Thompson's construction is that it creates an NFA with the following characteristics:

  1. There are at most 2|R| states, where |R| is the length of the regex.

  2. Every state has either exactly one out transition labeled with a character or at most two ε transitions. (That is, no state has both a labeled transition and an ε transition.)

The latter fact suggests that representing a state as

struct State {
  std::vector<std::tuple<char, State*>> transitions;
}

(which is a slight abbreviation of your code) is a very high overhead representation, where the overhead has a lot more to do with the overhead of the std::vector used to hold exactly one or two transitions than the padding of a single transition. Moreover, the above representation does not provide a clear technique for representing ε transitions, unless the intention was to reserve one character code for ε (and thereby make it impossible to use that character code in a regular expression).

A more practical representation might be

enum class StateType { EPSILON, IMPORTANT };
struct State {
  StateType type;
  char      label;
  State*    next[2];
};

(That formulation doesn't store the number of transitions in next , on the assumption that we can use a sentinel value to indicate that next[1] doesn't apply. Alternatively, we could just set next[1] = next[0]; in such a case. Remember that it only matters for ε states.)

Moreover, since we know there are no more than 2|R| State objects in the NFA, we could replace the State* pointers with small integers. That will set some sort of limit on the size of a regular expression which could be handled, but it's pretty uncommon to encounter gigabyte regexes. Using consecutive integers rather than pointers will also make certain graph algorithms more manageable, in particular the transitive closure algorithm which is fundamental to the subset construction.

Another interesting fact about the NFA constructed by the Thompson algorithm is that the in-degree of States is also limited to 2 (and again, if there are two in-transitions, both will be ε transitions). This allows us to avoid prematurely creating the sub-machines' final states (which will not be needed if the sub-machine is the left-hand argument to concatenation). Instead, we can represent the sub-machine with just three indices: the index of the start state, and the indices of the at most two internal states which will have transitions to the final state once it is added.

I think the above is reasonably close to Thompson's original implementation, although I'm sure he used a lot more optimisation tricks. But it's worth reading Section 3.9 of Aho, Lam, Sethi & Ullman (the "Dragon Book"), which describes ways of optimising construction of state machines.

Independent to the theoretical reductions, it's worth noting that aside from the trie of keyword patterns, the majority of state transitions in lexical analysis involve character sets rather than individual characters, and often these sets are quite large, particularly if the unit of lexical analysis is a Unicode codepoint rather than an ascii character. Using character sets instead of characters does complicate the subset construction algorithm, but it will normally dramatically reduce the state count.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM