简体   繁体   English

使用词法进行词法分析

[英]Lexical analysis using lex

Before I will show what I have done here is the assignment I have tried to do(I'm new so im not really sure if im doing it allright). 在我展示我在这里所做的工作之前,是我尝试做的作业(我是新来的,所以我不太确定我是否做得很好)。

1.  Implement lexical analyzer (using FLEX), as follows:
- Lexical analyzer supplies services next_token(), back_token()
- Lexical analyzer reads text from the input file and identifies tokens. This 
   happens when function   next_token() is called. 
- When a token is identified in the input text, it should be stored in a data 
   structure. For each token, the following attributes are saved:
   * token type
   * token lexeme
   * number of the line in the input text in which this token was found.
- Blanks, tabs, new lines – are not tokens, and should be ignored
- For each token, print (on a separate line) its type (e.g. rel_op , number , etc.) 
   and lexeme 
- Each operation, keyword, separation sign and each type of number should be 
   implemented as a token of a different kind
- Kinds of tokens are coded with integer numbers, for example:
        # define  ID_tok  1
        # define COMMA_tok  2

using Flex I have written this: 使用Flex我已经写了这个:

%{
#include<stdio.h>
int line=1;
# define  ID_tok  1
# define COMMA_tok  2
#define REL_OP_tok 3
#define NUMBER_tok 4
#define KEYWORDS_tok 5
%}
binary_ar_op "*"|"/"|"+"|"-"
rel_op "=="|"!="|">"|"<"|"<="|">="
id      [a-z][a-z0-9]*
number    ["+"|"-"]?[0-9]*("."[0-9]+)?
keywords "prabegin"|"parend"|"task"|"begin"|"end"|"integer"|"real"|"do"|"until"|"od"|"send"|"accept"

%%
\n   {line++;  printf("\n%d:",line);}

{binary_ar_op}+ {printf( "An binary_ar_op: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}

{rel_op}+ {printf( "An rel_op: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}

{id}+ {printf( "An id: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}


{number}+ {printf( "An number: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}
%%
int yywrap()
{
return 1;
}
main()
{
printf("Enter a string of data\n");
yylex();
}

As you can see I already defined all the types I was suppost to, I do not understand how to implement next and back, some guideness whould be great,also I saved the line number , but I guess they mean other then what I wrote, I also dont understand the define part they wrote about. 如您所见,我已经定义了所有我要支持的类型,我不知道如何实现下一个和后面的内容,有些指导性可能很棒,我也保存了行号,但是我想它们的含义与我写的不一样,我也不理解他们写的定义部分。

I know that this question seems odd, but I got this assignment without any guidness or explanation before, so I'm learning it on my own and I dont really understand all(Although I know the theory, thank you!). 我知道这个问题看起来很奇怪,但是我之前没有任何指导或解释就得到了这个作业,所以我是自己学习的,我并不太了解(尽管我知道理论,谢谢!)。

I did something very similar in our company project(s). 我在公司项目中做了非常相似的事情。

About tokens 关于代币

I make enumerations for them and... 我为他们列举了...

About next_token() 关于next_token()

My intention was to store the whole token related information into an object with: 我的意图是通过以下方式将与令牌相关的全部信息存储到一个对象中:

  • the acutal token (an enumeration value) 实际令牌(枚举值)
  • the lexeme (a std::string) lexeme(一个std :: string)
  • the file position (consisting of a file name pointer, the line, and the column). 文件位置(由文件名指针,行和列组成)。

Additionally, I wanted to use smart pointers with these generated objects, not to mention, they should be C++ objects. 另外,我想对这些生成的对象使用智能指针,更不用说,它们应该是C ++对象。

This is what I realized: 这是我意识到的:

  1. It is easy to redefine the yylex() function. 重新定义yylex()函数很容易。 Thus, you can even rename it and change its signature. 因此,您甚至可以重命名它并更改其签名。

  2. It is very difficult (if not impossible) to put this together with yacc/bison. 很难(如果不是不可能)将其与yacc /野牛放在一起。 The main issue was that data is passed from lex (generated code) to yacc/bison (generated code) using a C union ( %union if I remember right). 主要问题是数据使用C联合( %union如果我没记错的话)从lex(生成的代码)传递到yacc / bison(生成的代码)。 AC union and C++ objects -- that doesn't work well. AC union和C ++对象-效果不佳。 (One object in C union may work but multiple definitely not.) (C联合中的一个对象可能有效,但多个绝对不能。)

For my luck, the 2nd issue was actually not existing for me because I use flex but write (meanwhile generate) recursive descent parsers (directly in C++). 幸运的是,第二个问题实际上对我来说并不存在,因为我使用flex但编写(同时生成)递归下降解析器(直接在C ++中)。

So, how to solve the 1st issue? 那么,如何解决第一个问题呢? This is from my code: 这是从我的代码:

/// redefines proto for generated yylex function.
#define YY_DECL \
  RF::YAMS::Sim::ActionScript::RefToken \
  RF::YAMS::Sim::ActionScript::Compiler::lex(yyscan_t yyscanner)

This is the flex man page where you find the documentation. 这是Flex手册页 ,您可以在其中找到文档。 To find explanation how to redefine the yylex function, please, search on this website for "YY_DECL". 要找到有关如何重新定义yylex函数的说明,请在此网站上搜索“ YY_DECL”。

My parser calls lex() whenever it needs a new token. 每当需要新令牌时,我的解析器就会调用lex()

Notes: 笔记:

  1. In my case, I renamed yylex() and even made it a method of my parser class. 就我而言,我重命名了yylex() ,甚至使其成为了我的解析器类的方法。 (I did this to simplify access of lexer to private parser variables.) (我这样做是为了简化词法分析器对私有解析器变量的访问。)

  2. I provided ful scope operators because the generated lex code does not consider any namespace I use in my C++ code. 我提供了有效的作用域运算符,因为生成的lex代码不考虑我在C ++代码中使用的任何名称空间。

  3. The yyscan_t yyscanner parameter has to be there because I generate re-entrant scanners. yyscan_t yyscanner参数必须存在,因为我生成了可重入的扫描器。 You have to decide whether or not it should be there. 您必须决定是否应该在那里。 (Instead you could provide other arguments also.) (相反,您也可以提供其他参数。)

  4. The returned RefToken is a smart pointer to the produced token. 返回的RefToken是指向产生的令牌的智能指针。 (Smart pointers makes it very easy to produce and consume tokens in different "places" without the danger of memory leaks.) (通过智能指针,可以很容易地在不同的“位置”产生和使用令牌,而没有内存泄漏的危险。)

If the generated lexer shall be combined with a bison-generated parser it is probably not as easy. 如果将生成的词法分析器与bison生成的解析器结合使用,则可能不那么容易。 In this case, I would use static variables and organize queues to pass values from lexer to parser. 在这种情况下,我将使用静态变量并组织队列以将值从lexer传递到解析器。 This would work but is, of course, not as elegant and save as the above method. 这将起作用,但是当然不如上面的方法那样优雅且省钱。

About back_token() 关于back_token()

Once you have a parser which consumes the tokens you may do with them whatever you want. 一旦拥有使用令牌的解析器,就可以随便使用它们。 In my case, one of the requirements was an option for back-tracking. 就我而言,要求之一是回溯的选项。 Thus, from time to time I have to push back tokens to input. 因此,我不得不不时地推回令牌来输入。 For this, I simply stack them in the parser. 为此,我只是将它们堆叠在解析器中。 If a new token is required I check 1st whether this stack is empty. 如果需要新令牌,请检查1st此堆栈是否为空。 If not the uppermost token is popped else the lex() method is called to obtain a really new token. 如果没有弹出最上面的令牌,则调用lex()方法以获得一个真正的新令牌。 I guess a similar solution could be used to implement back_token() in your case. 我猜在您的情况下,可以使用类似的解决方案来实现back_token()

About blanks 关于空白

There are actually two types of rules (ie rule actions) in my lexer: 我的词法分析器实际上有两种类型的规则(即规则操作):

  1. actions which end up with return new Token(...); 最终return new Token(...);

  2. actions which end up with break; 以失败告终的行动break;

The latter I use to consume separators (ie blanks etc.) and even comments (the parser even does not see them). 我使用后者来消耗分隔符(例如,空格等),甚至注释(解析器甚至看不到它们)。 This works because the lexer is actually nothing else than a switch() wrapped in a for() loop. 之所以可行,是因为该词法分析器实际上只是包裹在for()循环中的switch()而已。 (I learnt this "trick" from the flex doc. where it is mentioned explicitly somewhere.) (我从flex文档中学到了这个“技巧”,在某处明确提到了该技​​巧)。

What else... 还有什么...

Beside of YY_DECL , I redefined YY_INPUT also. 除了YY_DECL ,我还重新定义了YY_INPUT I did this to use the lexer with C++ std::stream (instead of yyin ). 我这样做是为了将词法分析器与C ++ std::stream (而不是yyin )一起使用。

IMHO flex does provide a very comprehensive manual. IMHO flex确实提供了非常全面的手册。 However, whenever I'm in doubt I look into the C file generated by flex. 但是,每当我有疑问时,我都会查看flex生成的C文件。 There are these horrible int arrays for the finite automaton which I usually simply ignore. 我通常简单地忽略有限自动机的这些可怕的int数组。 The rest is the infra-structure around them, and you will find your C actions (written in the lex rules) somewhere embedded. 剩下的就是它们周围的基础结构,您会发现C操作(用lex规则编写)嵌入在某个位置。 Examining the code around may make things clearer. 检查周围的代码可以使事情变得更清楚。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM