简体繁体 English

理解词法分析器中的双缓冲

[英]Understanding Double Buffering in Lexical Analyzer

原文 2021-12-12 07:18:10 6 1 compiler-construction

I'm reading the "Purple Dragon Book" on Compilers as part of my Compiler Construction course at university.作为大学编译器构建课程的一部分，我正在阅读关于编译器的“紫龙书”。 I'm having trouble understanding some things about double buffering while scanning input as part of Lexical Analysis.在作为词法分析的一部分扫描输入时，我无法理解有关双缓冲的一些事情。

Here's the text in book:这是书中的文字：

" "

Two pointers to the input are maintained:维护了两个指向输入的指针：

Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine.指针 lexemeBegin，标记当前词位的开始，我们试图确定其范围。
Pointer forward scans ahead until a pattern match is found;指针向前扫描，直到找到模式匹配； the exact strategy whereby this determination is made will be covered in the balance of this chapter.做出这一决定的确切策略将在本章的其余部分介绍。

" "

So, correct me if I'm wrong: One buffer is read, and when the input of that buffer is exhausted, the other buffer is filled with new data from source file, and now buffers are swapped.所以，如果我错了，请纠正我：读取一个缓冲区，当该缓冲区的输入用尽时，另一个缓冲区被源文件中的新数据填充，现在缓冲区被交换。 Forward and beginnig pointers now point to the freshly filled buffer. Forward 和 beginnig 指针现在指向新填充的缓冲区。

My question is, what if some part of lexeme is at the end of current buffer?我的问题是，如果词位的某些部分位于当前缓冲区的末尾怎么办？ Then when buffers switch, half of the lexeme will be end of one buffer, and half at the end of new buffer.然后当缓冲区切换时，一半的词位将在一个缓冲区的末尾，另一半在新缓冲区的末尾。 The pointers will move to new buffer, and we don't exactly know that the other half was left in other buffer?指针将移动到新缓冲区，我们不知道另一半留在其他缓冲区中吗？

Sorry if the question is vague, but I've been agonizing for quite some time on how this scenario will be handled.对不起，如果问题含糊不清，但我已经为如何处理这种情况而苦恼了一段时间。 I think same problem will occour using single buffer.我认为使用单个缓冲区会出现同样的问题。

1 个解决方案

The way Flex handles end of buffer is to relocate the text scanned since lexemeBegin to the start of the buffer, and then fill the buffer from the relocated value of forward to the end of the buffer. Flex 处理缓冲区结束的方式是lexemeBegin开始扫描的文本重新定位到缓冲区的开头，然后从forward的重新定位值填充缓冲区到缓冲区的末尾。 So there's only one buffer unless the buffer has to be expanded because the incomplete token completely fills it (in which case there is a brief time when there is a short buffer being copied from and a long buffer being copied to.)所以只有一个缓冲区，除非必须扩展缓冲区，因为不完整的令牌完全填满了它（在这种情况下，有一个短暂的时间从一个短缓冲区复制到一个长缓冲区。）

That's not the only way to do it, but it seems to work pretty well in practice.这不是唯一的方法，但在实践中似乎效果很好。

The original lex used a different strategy;原来的 lex 使用了不同的策略； it used a token buffer and read input one character at a time with fgetc until the end of the token was discovered.它使用令牌缓冲区并使用fgetc一次读取一个字符，直到发现令牌的末尾。 Each read character was added to the token buffer as it was read.每个读取的字符都在读取时添加到令牌缓冲区。 I'm pretty sure it backed up if necessary by copying to the beginning of the token buffer and then rescanning to get the new state, but I'm just working from memory.如果有必要，我很确定它通过复制到令牌缓冲区的开头然后重新扫描以获取新的 state 进行备份，但我只是从 memory 开始工作。

The advantage of the flex architecture is that it reads a bit faster, since it's not working a character at a rime, and it can handle big tokens, albeit inefficiently. flex 架构的优势在于它的读取速度更快，因为它不能立即处理一个字符，并且它可以处理大标记，尽管效率低下。 Lex couldn't handle big tokens because the token buffer ( yytext ) was a static array, not dynamically allocated, and therefore couldn't be grown. Lex 无法处理大令牌，因为令牌缓冲区 ( yytext ) 是一个 static 数组，不是动态分配的，因此无法增长。

I don't know of a scanner generator which double buffers, but I'll take a look at the Dragon Book in the morning.我不知道双缓冲的扫描仪生成器，但我会在早上看一下Dragon Book。