简体   繁体   中英

Communication between lexer and parser

Every time I write a simple lexer and parser, I stumble upon the same question: how should the lexer and the parser communicate? I see four different approaches:

  1. The lexer eagerly converts the entire input string into a vector of tokens. Once this is done, the vector is fed to the parser which converts it into a tree. This is by far the simplest solution to implement, but since all tokens are stored in memory, it wastes a lot of space.

  2. Each time the lexer finds a token, it invokes a function on the parser, passing the current token. In my experience, this only works if the parser can naturally be implemented as a state machine like LALR parsers. By contrast, I don't think it would work at all for recursive descent parsers.

  3. Each time the parser needs a token, it asks the lexer for the next one. This is very easy to implement in C# due to the yield keyword, but quite hard in C++ which doesn't have it.

  4. The lexer and parser communicate through an asynchronous queue. This is commonly known under the title "producer/consumer", and it should simplify the communication between the lexer and the parser a lot. Does it also outperform the other solutions on multicores? Or is lexing too trivial?

Is my analysis sound? Are there other approaches I haven't thought of? What is used in real-world compilers? It would be really cool if compiler writers like Eric Lippert could shed some light on this issue.

While I wouldn't classify much of the above as incorrect , I do believe several items are misleading.

  1. Lexing an entire input before running a parser has many advantages over other options. Implementations vary, but in general the memory required for this operation is not a problem, especially when you consider the type of information that you'd like to have available for reporting compilation errors.

    • Benefits
      • Potentially more information available during error reporting.
      • Languages written in a way that allows lexing to occur before parsing are easier to specify and write compilers for.
    • Drawbacks
      • Some languages require context-sensitive lexers that simply cannot operate before the parsing phase.

    Language implementation note: This is my preferred strategy, as it results in separable code and is best suited for translation to implementing an IDE for the language.

    Parser implementation note: I experimented with ANTLR v3 regarding memory overhead with this strategy. The C target uses over 130 bytes per token, and the Java target uses around 44 bytes per token. With a modified C# target, I showed it's possible to fully represent the tokenized input with only 8 bytes per token, making this strategy practical for even quite large source files.

    Language design note: I encourage people designing a new language to do so in a way that allows this parsing strategy, whether or not they end up choosing it for their reference compiler.

  2. It appears you've described a "push" version of what I generally see described as a "pull" parser like you have in #3. My work emphasis has always been on LL parsing, so this wasn't really an option for me. I would be surprised if there are benefits to this over #3, but cannot rule them out.

  3. The most misleading part of this is the statement about C++. Proper use of iterators in C++ makes it exceptionally well suited to this type of behavior.

  4. A queue seems like a rehash of #3 with a middleman. While abstracting independent operations has many advantages in areas like modular software development, a lexer/parser pair for a distributable product offering is highly performance-sensitive, and this type of abstraction removes the ability to do certain types of optimization regarding data structure and memory layout. I would encourage the use of option #3 over this.

    As an additional note on multi-core parsing: The initial lexer/parser phases of compilation for a single compilation unit generally cannot be parallelized, nor do they need to be considering how easy it is to simply run parallel compilation tasks on different compilation units (eg one lexer/parser operation on each source file, parallelizing across the source files but only using a single thread for any given file).

Regarding other options: For a compiler intended for widespread use (commercial or otherwise), generally implementers choose a parsing strategy and implementation which provides the best performance under the constraints of the target language. Some languages (eg Go) can be parsed exceptionally quickly with a simple LR parsing strategy, and using a "more powerful" parsing strategy (read: unnecessary features) would only serve to slow things down. Other languages (eg C++) are extremely challenging or impossible to parse with typical algorithms, so slower but more powerful/flexible parsers are employed.

I think there is no golden rule here. Requirements may vary from one case to another. So, reasonable solutions can be different also. Let me comment on your options from my own experience.

  1. "Vector of tokens". This solution may have big memory footprint. Imagine compiling source file with a lot of headers. Storing the token itself is not enough. Error message should contain context with the file name and the line number. It may happen that lexer depends on the parser. Reasonable example: ">>" - is this a shift operator or this is closing of 2 layers of template instantiations? I would down vote this option.

  2. (2,3). "One part calls another". My impression is that more complex system should call less complex one. I consider lexer to be more simple. This means parser should call lexer. I do not see why C# is better than C++. I implemented C/C++ lexer as a subroutine (in reality this is a complex class) that is called from the grammar based parser. There were no problems in this implementation.

  3. "Communicating processes". This seems to me an overkill. There is nothing wrong in this approach, but maybe it is better to keep the things simple? Multicore aspect. Compiling single file is a relatively rare case. I would recommend to load each core with its own file.

I do not see other reasonable options of combiming lexer and parser together.

I wrote these notes thinking about compiling sources of the software project. Parsing a short query request is completely different thing, and reasons can significantly differ. My answer is based on my own experience. Other people may see this differently.

The lexer-parser relationship is simpler than the most general case of coroutines , because in general the communication is one-way; the parser does not need to send information back to the lexer. This is why the method of eager generation works (with some penalty, although it does mean that you can discard the input earlier).

As you've observed, if either the lexer or the parser can be written in a reinvocable style then the other can be treated as a simple subroutine. This can always be implemented as a source code transformation, with local variables translated to object slots.

Although C++ doesn't have language support for coroutines, it is possible to make use of library support, in particular fibers . The Unix setcontext family is one option; another is to use multithreading but with a synchronous queue (essentially single-threading but switching between two threads of control).

Also consider for #1 that you lex tokens that don't need it, for example if there is an error, and in addition, you may run low on memory or I/O bandwidth. I believe that the best solution is that employed by parsers generated by tools like Bison, where the parser calls the lexer to get the next token. Minimizes space requirements and memory bandwidth requirements.

#4 is just not going to be worth it. Lexing and parsing are inherently synchronous- there's just not enough processing going on to justify the costs of communication. In addition, typically you parse/lex multiple files concurrently- this can already max out all your cores at once.

The way I handle it in my toy buildsystem project in progress is by having a "file reader" class, with a function bool next_token(std::string&,const std::set<char>&) . This class contains one line of input (for error reporting purposes with line number). The function accepts a std::string reference to put the token in, and a std::set<char> which contains the "token-ending" characters. My input class is both parser and lexer, but you could easily split it up if you need more fanciness. So the parsing functions just call next_token and can do their thing, including very detailed error output.

If you need to keep the verbatim input, you'll need to store each line that's read in a vector<string> or something, but not store each token seperately and/or double-y.

The code I'm talking about is located here:

https://github.com/rubenvb/Ambrosia/blob/master/libAmbrosia/Source/nectar_loader.cpp

(search for ::next_token and the extract_nectar function is where is all begins)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM