简体   繁体   English

如何从AC文件中找到令牌?

[英]How to find tokens from a c file?

I am trying to generate tokens from a C source file. 我正在尝试从C源文件生成令牌。 I have split the C file into an array line and stored the words of the entire file in an array words . 我已将C文件拆分为一个数组line ,并将整个文件的单词存储在一个数组words

The problem is with the strtok() function, which is splitting the line on whitespace characters. 问题在于strtok()函数,该函数在空白字符上分割行。 Because of this, I am not getting certain delimiters like parentheses and brackets because there is no whitespace between them and other tokens. 因此,我没有得到某些分隔符,例如括号和方括号,因为它们与其他标记之间没有空格。

How do I determine which one is an identifier and which one is an operator? 如何确定哪个是标识符,哪个是运算符?

Code so far: 到目前为止的代码:

int main()
{
    /* ... */

    char line[300][200];
    char delim[]=" \n\t";
    char *words[1000];
    char *token;

    while (fgets(&line[i][0], 100, fp1) != NULL)
    {
        token = strtok(&line[i][0], delim);

        while (token != NULL)
        {
            words[j++] = token;
            token = strtok(NULL, delim);
        }

        i++;
    }

    for(i = 0; i < 50; i++)
    {
        printf("%s\n", words[i]);
    }

    return 0;
}

This is a tricky question, something that needs probably more depth than a StackOverflow answer. 这是一个棘手的问题,这个问题可能需要比StackOverflow答案更深入的问题。 I'll try, nonetheless. 尽管如此,我会尝试的。

Tokenizing the input is the first part of the compilation process . 对输入进行标记化是编译过程第一部分 The objective is to simplify the task of the parser, which is going to make an abstract syntax tree with the contents of the file. 目的是简化解析器的任务,该解析器将使用文件的内容创建一个抽象语法树。 How do we simplify this? 我们如何简化呢? We do recognize those tokens that have a special meaning, also identifiers, operators... C is indeed a tricky, complex language. 我们确实认识到那些具有特殊含义的标记,还有标识符,运算符... C确实是一种棘手的复杂语言。 Let's simplify the language to tokenize: we'll start with a typical calculator. 让我们简化语言来标记化:我们将从一个典型的计算器开始。

An input example would be: 输入示例为:

( 4 +5)* 2

When syntax is free, you can add or skip spaces, so as you have already experimented, splitting by space is not an option. 当语法免费时,您可以添加或跳过空格,因此,正如您已经尝试过的那样,不能选择按空格分割。

The tokenized output for the example above would be: LPAR, LIT, OP, LIT, RPAR, OP, LIT. 上面的示例的标记化输出为:LPAR,LIT,OP,LIT,RPAR,OP,LIT。 The meaning goes as follows: 含义如下:

LPAR: Left parenthesis
RPAR: Right parenthesis
LIT:  Literal (a number)
OP:   Operator (say: +, -, * and /).

The complete ouput would therefore be: 因此,完整的输出为:

{ LPAR, LIT(4), OP('+'), LIT(5), RPAR, OP('*'), LIT(2) }

Your lexer basically has to advance in the input string, char by char, using a state machine. 您的词法分析器基本上必须使用状态机逐字符逐个输入。 For example, when you read a number, you enter in the "input literal" state, in which only other numbers and '.' 例如,当您阅读一个数字时,您将进入“输入文字”状态,在该状态中仅包含其他数字和“。”。 are allowed. 被允许。

Now the parser has an easier task. 现在,解析器的任务更加简单。 If you feed it with the previous tokens, it does not have to skip spaces, or distinguish between a negative number and a minus operator, it can just advance in a list or array. 如果使用先前的标记来填充它,则不必跳过空格,也不必区分负数和减号,它可以在列表或数组中前进。 It can behave following the type of the token, and some of them have associated data, as you can see. 如您所见,它可以遵循令牌的类型运行,并且其中一些具有关联的数据。

This is only an introduction of the introduction, anyway. 无论如何,这只是介绍的介绍。 Information about the whole compilation process could fill a book. 有关整个编译过程的信息可以填满一本书。 And there are actually many books devoted to this topic, such as the famous " Dragon book " from Aho, Sethi&Ullman. 实际上,有很多专门针对该主题的书籍,例如Aho,Sethi&Ullman着名的《 龙书 》。 A more updated one is the " Tiger book ". 更新的一本是《 老虎书 》。

Finally, lexers are quite similar among each others, and it is therefore possible to find generic lexers out there. 最后,词法分析器彼此之间非常相似,因此可以在那里找到通用词法分析器 You can also even find the C grammar for that kind of tools. 您甚至还可以找到此类工具的C语法

Hope this (somehow) helps. 希望这(以某种方式)有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM