Lexical Analyzer C程序，用于识别令牌

Question

I wrote a C program for lex analyzer (a small code) that will identify keywords, identifiers and constants. 我写了一个用于lex分析器的C程序（一个小代码），它将识别关键字，标识符和常量。 I am taking a string (C source code as a string) and then converting splitting it into words. 我正在获取一个字符串（C源代码为字符串），然后将其拆分为单词。

#include <stdio.h>
#include <conio.h>
#include <string.h>

char symTable[5][7] = { "int", "void", "float", "char", "string" };

int main() {
    int i, j, k = 0, flag = 0;
    char string[7];
    char str[] = "int main(){printf(\"Hello\");return 0;}";
    char *ptr;
    printf("Splitting string \"%s\" into tokens:\n", str);
    ptr = strtok(str, " (){};""");
    printf("\n\n");
    while (ptr != NULL) {
        printf ("%s\n", ptr);

        for (i = k; i < 5; i++) {
            memset(&string[0], 0, sizeof(string));
            for (j = 0; j < 7; j++) {
                string[j] = symTable[i][j];
            }

            if (strcmp(ptr, string) == 0) {
                printf("Keyword\n\n");
                break;
            } else
            if (string[j] == 0 || string[j] == 1 || string[j] == 2 ||
                string[j] == 3 || string[j] == 4 || string[j] == 5 ||
                string[j] == 6 || string[j] == 7 || string[j] == 8 ||
                string[j] == 9) {
                printf("Constant\n\n");
                break;
            } else {
                printf("Identifier\n\n");
                break;
            }
        }
        ptr = strtok(NULL, " (){};""");
        k++;
    }
    _getch();
    return 0;
}

With the above code, I am able to identify keywords and identifiers but I couldn't obtain the result for numbers. 使用上面的代码，我可以识别关键字和标识符，但无法获得数字的结果。 I've tried using strspn() but of no avail. 我试过使用strspn()但无济于事。 I even replaced 0,1,2...,9 to '0','1',....,'9' . 我什至将0,1,2...,9替换为'0','1',....,'9' 。

Any help would be appreciated. 任何帮助，将不胜感激。

Answer 1

Here are some problems in your parser: 这是您的解析器中的一些问题：

The test string[j] == 0 does not test if string[j] is the digit 0 . 测试string[j] == 0不会测试string[j]是否为数字0 。 The characters for digits are written '0' through '9' , their values are 48 to 57 in ASCII and UTF-8. 数字字符写为'0'至'9' ，其值在ASCII和UTF-8中为48至57。 Furthermore, you should be comparing *p instead of string[j] to test if you have a digit in the string indicating the start of a number. 此外，您应该比较*p而不是string[j]以测试string[j]是否有数字表示数字的开头。
Splitting the string with strtok() is not a good idea: it modifies the string and overwrites the first separator character with '\\0' : this will prevent matching operators such as ( , ) ... 用strtok()分割字符串不是一个好主意：它修改字符串并用'\\0'覆盖第一个分隔符：这将防止匹配运算符，例如( ， ) ...
The string " (){};""" is exactly the same as " (){};" 字符串" (){};"""与" (){};"完全相同 . 。 In order to escape " inside strings, you must use \\" . 为了转义"内部字符串，您必须使用\\" 。

To write a lexer for C, you should switch on the first character and check the following characters depending on the value of the first character: 要为C编写词法分析器，您应该打开第一个字符并根据第一个字符的值检查以下字符：

if you have white space, skip it 如果您有空格，请跳过它
if you have // , it is a line comment: skip all characters up to the newline. 如果有// ，则为行注释：跳过所有字符，直到换行符。
if you have /* , it is a block comment: skip all characters until you get the pair */ . 如果有/* ，则它是一个块注释：跳过所有字符，直到获得对*/ 。
if you have a ' , you have a character constant: parse the characters, handling escape sequences until you get a closing ' . 如果您有一个' ，则您有一个字符常量：解析字符，处理转义序列，直到获得结束' 。
if you have a " , you have astring literal. do the same as for character constants. 如果您有一个" ，则您有一个字符串文字。其作用与字符常量相同。
if you have a digit, consume all subsequent digits, you have an integer. 如果您有一个数字，消耗掉所有后续数字，那么您就有一个整数。 Parsing the full number syntax requires much more code: leave that for later. 解析整数语法需要更多代码：将其留待以后使用。
if you have a letter or an underscore: consume all subsequent letters, digits and underscores, then compare the word with the set of predefined keywords. 如果您有字母或下划线：使用所有后续字母，数字和下划线，然后将该单词与一组预定义的关键字进行比较。 You have either a keyword or an identifier. 您有一个关键字或一个标识符。
otherwise, you have an operator: check if the next characters are part of a 2 or 3 character operator, such as == and >>= . 否则，您有一个运算符：检查下一个字符是否是2或3个字符运算符的一部分，例如==和>>= 。

That's about it for a simple C parser. 这就是一个简单的C解析器。 The full syntax requires more work, but you will get there one step at a time. 完整的语法需要更多的工作，但是您一次只能到达一个步骤。

Answer 2

When you're writing lexer, always create specific function that finds your tokens (name yylex is used for tool System Lex , that is why I used that name). 在编写lexer时，请始终创建用于查找令牌的特定函数（名称yylex用于工具System Lex ，这就是我使用该名称的原因）。 Writing lexer in main is not smart idea, especially if you want to do syntax, semantic analysis later on. 用main编写词法分析器不是一个聪明的主意，尤其是如果您以后要进行语法和语义分析。

From your question it is not clear whether you just want to figure out what are number tokens, or whether you want token + fetch number value. 从您的问题尚不清楚，您是否只想弄清楚数字令牌是什么，还是要令牌+获取数字值。 I will assume first one. 我将假设第一个。

This is example code , that finds whole numbers: 这是示例代码 ，可以找到整数：

int yylex(){

    /* We read one char from standard input */
    char c = getchar();

    /* If we read new line, we will return end of input token */
    if(c == '\n')
        return EOI;

    /* If we see digit on input, we can not return number token at the moment. 
         For example input could be 123a and that is lexical error  */
    if(isdigit(c)){

        while(isdigit(c = getchar()))
            ;

        ungetc(c,stdin);
        return NUM;
    }

    /* Additional code for keywords, identifiers, errors, etc. */
}

Tokens EOI , NUM , etc. should be defined on top. 令牌EOI ， NUM等应在顶部定义。 Later on, when you want to write syntax analysis, you use these tokens to figure out whether code responds to language syntax or not. 稍后，当您要编写语法分析时，可以使用这些标记来确定代码是否响应语言语法。 In lexical analysis, usually ASCII values are not defined at all, your lexer function would simply return ')' for example. 在词法分析中，通常根本没有定义ASCII值，例如，您的词法分析器函数将仅返回')' 。 Knowing that, tokens should be defined above 255 value. 知道这一点，令牌应定义为255值以上。 For example: 例如：

#define EOI 256
#define NUM 257

If you have any futher questions, feel free to ask. 如果您还有其他疑问，请随时提问。

Answer 3

string[j]==1

This test is wrong ⁽¹⁾ (on all C implementations I heard of), since string[j] is some char eg using ASCII (or UTF-8 , or even the old EBCDIC used on IBM mainframes) encoding and the encoding of the char digit 1 is not the the number 1. On my Linux/x86-64 machine (and on most machines using ASCII or UTF-8, eg almost all of them) using UTF-8, the character 1 is encoded as the byte of code 48 (that is (char)48 == '1' ) 该测试是错误的 ^（1）（在我听说过的所有C实现中），因为string[j]是一些char例如使用ASCII （或UTF-8 ，甚至是IBM大型机上使用的旧EBCDIC ）编码和char digit 1不是数字1。在我的Linux / x86-64机器上（以及在大多数使用ASCII或UTF-8的机器上，例如几乎所有机器），使用UTF-8，字符 1被编码为代码48（即(char)48 == '1' ）

You probably want 你可能想要

string[j]=='1'

and you should consider using the standard isdigit (and related) function. 并且您应该考虑使用标准的isdigit （和相关）功能。

Be aware that UTF-8 is practically used everywhere but is a multi-byte encoding (of displayable characters). 请注意，UTF-8实际上在任何地方都可以使用，但是它是一种多字节编码（可显示字符）。 See this answer . 看到这个答案。

Note (1): the string[j]==1 test is probably misplaced too! 注意（1）： string[j]==1测试可能也放错了位置！ Perhaps you might test isdigit(*ptr) at some better place. 也许您可以在更好的地方测试isdigit(*ptr) 。

PS. PS。 Please take the habit of compiling with all warnings and debug info (eg with gcc -Wall -Wextra -g if using GCC ...) and use the debugger (eg gdb ). 请养成使用所有警告和调试信息进行编译的习惯（例如，如果使用GCC，则使用gcc -Wall -Wextra -g ），并使用调试器 （例如gdb ）。 You should have find out your bug in less time than it took you to get an answer here. 您应该比在这里得到答案所花费的时间要短。

Lexical Analyzer C程序，用于识别令牌

问题描述

3 个解决方案

解决方案1
3 2017-07-10 02:05:13

解决方案2
0 2016-07-13 09:10:05

解决方案3
0 2017-06-24 07:39:05

Lexical Analyzer C程序，用于识别令牌

问题描述

3 个解决方案

解决方案1 3 2017-07-10 02:05:13

解决方案2 0 2016-07-13 09:10:05

解决方案3 0 2017-06-24 07:39:05

解决方案1
3 2017-07-10 02:05:13

解决方案2
0 2016-07-13 09:10:05

解决方案3
0 2017-06-24 07:39:05