简体   繁体   English

在C中创建词法分析器

[英]Creating a Lexical Analyzer in C

I am trying to create a lexical analyzer in C. The program reads another program as input to convert it into tokens, and the source code is here- 我正在尝试在C语言中创建一个词法分析器。该程序读取另一个程序作为输入以将其转换为令牌,并且源代码在此处-

#include <stdio.h>
#include <conio.h>
#include <string.h>

int main()  {
    FILE *fp;
    char read[50];
    char seprators [] = "\n";
    char *p;
    fp=fopen("C:\\Sum.c", "r");

    clrscr();

    while ( fgets(read, sizeof(read)-1, fp) !=NULL )    {
        //Get the first token
        p=strtok(read, seprators);

        //Get and print other tokens
        while (p!=NULL) {
            printf("%s\n", p);
            p=strtok(NULL, seprators);
        }
    }

    return 0;
}

And the contents of Sum.c are- Sum.c的内容是-

#include <stdio.h>

int main()  {
    int x;
    int y;
    int sum;

    printf("Enter two numbers\n");
    scanf("%d%d", &x, &y);

    sum=x+y;

    printf("The sum of these numbers is %d", sum);

    return 0;
}

I am not getting the correct output and only see a blank screen in place of output. 我没有得到正确的输出,只能看到空白屏幕代替输出。

Can anybody please tell me where am I going wrong?? 谁能告诉我我要去哪里错了吗? Thank you so much in advance.. 提前谢谢你..

You've asked a few question since this one, so I guess you've moved on. 从这开始,您已经问了几个问题,所以我想您已经继续。 There are a few things that can be noted about your problem and your start at a solution that can help others starting to solve a similar problem. 关于您的问题以及您从一个可以帮助其他人开始解决类似问题的解决方案开始的注意事项。 You'll also find that people can often be slow at answering things that are obvious homework . 您还会发现,人们通常很难回答明显的家庭作业 We often wait until homework deadlines have passed. 我们经常等到作业截止日期过去。 :-) :-)

First, I noted you used a few features specific to Borland C compiler which are non-standard and would not make the solution portable or generic. 首先,我注意到您使用了Borland C编译器特有的一些功能,这些功能是非标准的,不会使该解决方案具有可移植性或通用性。 YOu could solve the problem without them just fine, and that is usually a good choice. 如果没有他们,您可以解决问题,通常这是一个不错的选择。 For example, you used #include <conio.h> just to clear the screen with a clrscr(); 例如,您使用#include <conio.h>只是clrscr();清除屏幕clrscr(); which is probably unnecessary and not relevant to the lexer problem. 这可能是不必要的,并且与词法分析器问题无关。

I tested the program, and as written it works! 我测试了该程序,并按其编写的方式工作! It transcribes all the lines of the file Sum.c to stdout . 它将文件Sum.c所有行转录为stdout If you only saw a blank screen it is because it could not find the file. 如果您只看到黑屏,那是因为找不到文件。 Either you did not write it to your C:\\ directory or had a different name. 您没有将其写入C:\\目录,或者使用了其他名称。 As already mentioned by @WhozCraig you need to check that the file was found and opened properly . 如@WhozCraig所述, 您需要检查是否已找到并正确打开了文件

I see you are using the C function strtok to divide the input up into tokens. 我看到您正在使用C函数strtok将输入分为令牌。 There are some nice examples of using this in the documentation you could include in your code , which do more than your simple case. 可以包含在代码中的文档中,有一些使用它的不错的示例 ,它们的作用比简单的情况还多。 As mentioned by @Grijesh Chauhan there are more separators to consider than \\n , or end-of-line. 正如@Grijesh Chauhan所提到的,要考虑的分隔符比\\n或行尾要多。 What about spaces and tabs, for example. 例如,空格和制表符呢?

However, in programs, things are not always separated by spaces and lines. 但是,在程序中,事物并不总是由空格和行分隔。 Take this example: 举个例子:

result=(number*scale)+total;

If we only used white space as a separator, then it would not identify the words used and only pick up the whole expression, which is obviously not tokenization. 如果我们仅使用空格作为分隔符,则它将无法识别所使用的单词,而只会拾取整个表达式,这显然不是标记化。 We could add these things to the separator list: 我们可以将这些内容添加到分隔符列表中:

char seprators [] = "\n=(*)+;";

Then your code would pick out those words too. 然后,您的代码也会选择这些单词。 There is still a flaw in that strategy, because in programming languages, those symbols are also tokens that need to be identified. 该策略仍然存在缺陷,因为在编程语言中,这些符号也是需要标识的标记。 The problem with programming language tokenization is there are no clear separators between tokens. 编程语言标记化的问题在于标记之间没有明确的分隔符。

There is a lot of theory behind this, but basically we have to write down the patterns that form the basis of the tokens we want to recognise and not look at the gaps between them, because as has been shown, there aren't any! 这背后有很多理论,但基本上我们必须写下构成我们要识别的代币基础的模式,而不要看一下它们之间的差距,因为如前所述,没有任何东西! These patterns are normally written as regular expressions . 这些模式通常被写为正则表达式 Computer Science theory tells us that we can use finite state automata to match these regular expressions. 计算机科学理论告诉我们,我们可以使用有限状态自动机来匹配这些正则表达式。 Writing a lexer involves a particular style of coding , which has this style: 编写词法分析器涉及一种特殊的编码样式,该样式具有以下特点:

while ( NOT <<EOF>> ) {
  switch ( next_symbol() ) {

     case state_symbol[1]: 
              ....
             break;

      case state_symbol[2]:
              ....
              break;

       default:
             error(diagnostic);
  }
}

So, now, perhaps the value of the academic assignment becomes clearer. 因此,现在,也许学术作业的价值变得更加清晰。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM