我的词法分析器很困难

Question

I'm trying to program a lexical analyzer to a standard C translation unit, so I've divided the possible tokens into 6 groups; 我正在尝试将词法分析器编程为标准C转换单元，因此将可能的标记分为6组。 for each group there's a regular expression, which will be converted to a DFA: 每个组都有一个正则表达式，它将被转换为DFA：

Keyword - (will have a symbol table containing "goto", "int"....) 关键字-（将具有一个包含“ goto”，“ int” ....的符号表）
Identifers - [a-zA-z][a-zA-Z0-9]* 标识符-[a-zA-z] [a-zA-Z0-9] *
Numeric Constants - [0-9]+/.?[0-9]* 数字常数-[0-9] + /。？[0-9] *
String Constants - ""[EVERY_ASCII_CHARACTER]*"" 字符串常量-“” [EVERY_ASCII_CHARACTER] *“”
Special Symbols - (will have a symbol table containing ";", "(", "{"....) 特殊符号-（将包含一个包含“;”，“（”，“ {” ...的符号表）。
Operators - (will have a symbol table containing "+", "-"....) 运算符-（将具有包含“ +”，“-” ....的符号表）

My Analyzer's input is a stream of bytes/ASCII characters. 分析仪的输入是字节/ ASCII字符流。 My algorithm is the following: 我的算法如下：

assuming there's a stream of characters, x1...xN
 foreach i=1, i<=n, i++
    if x1...xI accepts one or more of the 6 group's DFA
    {
       take the longest-token
       add x1...xI to token-linked-list
       delete x1...xI from input
    }

However, this algorithm will assume that every byte it is given, which is a letter, is an identifier, since after an input of 1 character, it accepts the DFA of the identifiers tokens ([a-zA-Z][a-zA-Z0-9]*). 但是，此算法将假定给定的每个字节（即一个字母）都是一个标识符，因为在输入1个字符后，它会接受标识符令牌（[a-zA-Z] [a-zA -Z0-9] *）。

Another possible problem is for the input "intx;", my algorithm will tokenize this stream into "int", "x", ";" 另一个可能的问题是输入“ intx;”，我的算法会将这个流标记化为“ int”，“ x”，“;” which of course is an error. 当然这是一个错误。

I'm trying to think about a new algorithm, but I keep failing. 我正在尝试考虑一种新算法，但是我一直失败。 Any suggestions? 有什么建议么？

Answer 1

Code your scanner so that it treats identifiers and keywords the same until the reading is finished. 对扫描仪进行编码，以使其在读取完成之前将标识符和关键字视为相同。

When you have the complete token, look it up in the keyword table, and designate it a keyword if you find it and as an identifier if you don't find it. 当您拥有完整的令牌时，请在关键字表中查找它，如果找到它，则将其指定为关键字，如果找不到它，则将其指定为标识符。 This deals with the intx problem immediately; 这立即解决了intx问题。 the scanner reads intx and that's not a keyword so it must be be an identifier. 扫描程序将读取intx ，它不是关键字，因此它必须是标识符。

I note that your identifiers don't allow underscores. 我注意到您的标识符不允许使用下划线。 That's not necessarily a problem, but many languages do allow underscores in identifiers. 不一定是问题，但是许多语言的确允许标识符中使用下划线。

Answer 2

Tokenizers generally FIRST split the input stream into tokens, based on rules which dictate what constitute an END of token, and only later decide what kind of token it is (or an error otherwise). 令牌生成器通常首先根据指示令牌结束构成的规则将输入流分成令牌，然后稍后再确定它是哪种令牌（否则将产生错误）。 Typical end of token are things like white space (when not part of literal string), operators, special delimiters, etc. 令牌的典型结尾是诸如空格（不属于文字字符串的一部分），运算符，特殊分隔符等内容。

Answer 3

It seems you are missing the greediness aspect of competing DFAs. 看来您缺少竞争性DFA的greediness感。 greedy matching is usually the most useful (left-most longest match) because it solves the problem of how to choose between competing DFAs. greedy匹配通常是最有用的（最左最长的匹配），因为它解决了如何在竞争DFA之间进行选择的问题。 Once you've matched int you have another node in the IDENTIFIER DFA that advances to intx . 匹配int之后，IDENTIFIER DFA intx有另一个节点前进到intx 。 Your finate automata doesn't exit until it reaches something it can't consume, and if it isn't in a valid accept state at the end of input, or at the point where another DFA is accepting, it is pruned and the other DFA is matched. 您的最终自动机只有在达到无法消耗的容量时才会退出，并且如果在输入结束时或在另一个DFA接受时未处于有效的接受状态，则会对其进行修剪，而另一个DFA已匹配。

Flex, for example, defaults to greedy matching. 例如，Flex默认为贪婪匹配。

In other words, your proposed problem of intx isn't a problem... 换句话说，您提出的intx问题不是问题...

If you have 2 rules that compete for int 如果您有2条竞争int规则

rule 1 is the token "int" 规则1是令牌“ int”
rule 2 is IDENTIFIER 规则2是IDENTIFIER

When we reach 当我们到达

i n t i n t

we don't immediately ACCEPT int because we see another rule (rule 2) where further input x progresses the automata to a NEXT state: 我们不会立即接受int因为我们看到另一个规则（规则2），其中进一步的输入x将自动机前进到NEXT状态：

i n t x i n t x

If rule 2 is in an ACCEPT state at that point, then rule 1 is discarded by definition. 如果此时规则2处于ACCEPT状态，则定义将规则1丢弃。 But if rule 2 is still not in ACCEPT state, we must keep rule 1 around while we examine more input to see if we could eventually reach an ACCEPT state in rule 2 that is longer than rule 1. If we receive some other character that matches neither rule, we check if rule 2 automata is in an ACCEPT state for intx , if so, it is the match. 但是，如果规则2仍未处于ACCEPT状态，则在检查更多输入以查看是否最终可以达到规则2中比规则1长的ACCEPT状态时，我们必须保留规则1。如果我们收到其他匹配的字符这两个规则都不行，我们检查规则2自动机是否对intx处于ACCEPT状态，如果是，则为匹配。 If not, it is discarded, and the longest previous match (rule 1) is accepted, however in this case, rule 2 is in ACCEPT state and matches intx 如果不是，则将其丢弃，并接受最长的先前匹配项（规则1），但是在这种情况下，规则2处于接受状态并匹配intx

In the case that 2 rules reach an ACCEPT or EXIT state simultaneously, then precedence is used (order of the rule in the grammar). 如果两个规则同时达到接受或退出状态，则使用优先级（规则在语法中的顺序）。 Generally you put your keywords first so IDENTIFIER doesn't match first. 通常，您将关键字放在第一位，因此IDENTIFIER不匹配。

我的词法分析器很困难

问题描述

3 个解决方案

解决方案1
2 2014-10-16 02:29:24

解决方案2
1 2014-10-16 01:47:43

解决方案3
1 2014-10-16 02:12:40

我的词法分析器很困难

问题描述

3 个解决方案

解决方案1 2 2014-10-16 02:29:24

解决方案2 1 2014-10-16 01:47:43

解决方案3 1 2014-10-16 02:12:40

解决方案1
2 2014-10-16 02:29:24

解决方案2
1 2014-10-16 01:47:43

解决方案3
1 2014-10-16 02:12:40