简体   繁体   English

实施词汇分析器

[英]Implementing a lexical analzyer

I have an assignment to implement a lexical analyzer for a language c--. 我有一项任务要为语言c--实现一个词法分析器。 We must translate the c-- code into a series of tokens that will be represented as integers internally, since it will be easier to manipulate. 我们必须将c--代码转换为一系列标记,这些标记将在内部表示为整数,因为它更易于操作。 Some lexical conventions of the language are that there are keywords like double, else, if, int, return,void, and while. 该语言的一些词汇约定是:有诸如double,if,int,return,void和while等其他关键字。 Also special symbols like + - * / < <= > >= == != = ; 还有特殊符号,例如+-* / <<=>> = ==!= =; , . ,。 ( ) [ ] { } /* */ //. ()[] {} / * * / //。 identifiers can begin with any letter or underscore followed by any combination of letters, numbers and underscores. 标识符可以以任何字母或下划线开头,然后是字母,数字和下划线的任意组合。 white spaces separate tokens and are ignored. 空格将标记分开,将被忽略。 numbers can be integers or decimals and comments lines and blocks are allowed. 数字可以是整数或小数,并且允许使用注释行和块。

import java.io.*;
public class Lex {

    public static boolean contains(char[] a, char b){
        for (int i = 0; i < a.length; i++) {
            if(b == a[i])
                return true;
        }
        return false;
    } 
    public static void main(String args[]) throws FileNotFoundException, IOException{

        //Declaring token values as constant integers.
        final int T_DOUBLE = 0; 
        final int T_ELSE = 1;
        final int T_IF = 2; 
        final int T_INT = 3;
        final int T_RETURN = 4; 
        final int T_VOID = 5;
        final int T_WHILE = 6; 
        final int T_PLUS = 7;
        final int T_MINUS = 8; 
        final int T_MULTIPLICATION = 9;
        final int T_DIVISION = 10; 
        final int T_LESS = 11;
        final int T_LESSEQUAL = 12; 
        final int T_GREATER = 13;
        final int T_GREATEREQUAL = 14; 
        final int T_EQUAL = 16;
        final int T_NOTEQUAL = 17;
        final int T_ASSIGNOP = 18; 
        final int T_SMEICOLON = 19;
        final int T_PERIOD = 20; 
        final int T_LEFTPAREN = 21;
        final int T_RIGHTPAREN = 22; 
        final int T_LEFTBRACKET = 23;
        final int T_RIGHTBRACKET = 24; 
        final int T_LEFTBRACE = 25;
        final int T_RIGHTBRACE = 26; 
        final int T_ID = 27;
        final int T_NUM = 28;
        char[] letters_ = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D',
            'E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','_'};
        char[] numbers = {'0','1','2','3','4','5','6','7','8','9'};
        char[] symbols = {'+','-','*','/','<','>','!','=',':',',','.','(',')','[',']','{','}'};
        FileInputStream fstream = new FileInputStream("src\\testCode.txt");
        DataInputStream in = new DataInputStream(fstream);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        BufferedWriter bw1 = new BufferedWriter(new FileWriter(new File("src\\output.txt"), true));
        BufferedWriter bw2 = new BufferedWriter(new FileWriter(new File("src\\output2.txt"), true));
        String scanner;String temp = "";
        int n = 0;
        while((scanner = br.readLine()) != null){
            for (int i = 0; i < scanner.length(); i++) {
                for (int j = 0; j < scanner.length(); j++) {
                    if(contains(letters_,scanner.charAt(i)) || contains(numbers,scanner.charAt(i)) || contains(symbols,scanner.charAt(i))){
                        j++;
                        n++;
                        if(scanner.charAt(j) == ' ' || scanner.charAt(j) == '\n' || scanner.charAt(j) == '\t'){

                        }
                    }

                }

            }
        }

        in.close();


    }

}

This is our test code: 这是我们的测试代码:

int fact(int x) {
// recursive factorial function 
   if (x>1) 
      return x * fact(x-1);
   else return 1;
}

void main(void) {
/* CS 311 project 2
A lexical analyzer */
   int x, y, z;
   double _funny;
   x = get_integer();
   _Funny = get_double();
   if (x>0) 
      print_line(fact(x));
   else if (_funny != 3.14) 
      print_line(x*_funny);
}

This should be our output 这应该是我们的输出

3 27 21 3 27 22 25 2 21 27 13 28 22 4 27 9 27 21 27 8 28 22 18 1 4 28 18 26 5 27 21 5 22 25 3 27 19 27 19 27 18 0 27 18 27 17 27 21 22 18 27 17 27 21 22 18 2 21 27 13 28 22 27 21 27 21 27 22 22 18 1 2 21 27 12 28 22 27 21 27 9 27 22 18 26 3 27 21 3 27 22 25 2 21 27 13 28 22 4 27 9 27 21 27 8 28 22 18 1 4 28 18 26 5 27 21 5 22 25 3 27 19 27 19 27 18 0 27 18 27 17 27 21 22 18 27 17 27 21 22 18 2 21 27 13 28 22 27 21 27 21 27 22 22 18 1 2 21 27 12 28 22 27 21 21 27 9 27 22 18 26

INT id leftparen INT id rightparen leftbrace IF leftparen id greater num rightparen RETURN id multiplication id leftparen id minus num rightparen semicolon ELSE RETURN num semicolon rightbrace VOID id leftparen VOID rightparen leftbrace INT id comma id comma id semicolon DOUBLE id semicolon id assignop id leftparen rightparen semicolon id assignop id leftparen rightparen semicolon IF leftparen id greater num rightparen id leftparen id leftparen id rightparen rightparen semicolon ELSE IF leftparen id notequal num rightparen id leftparen id multiplication id rightparen semicolon rightbrace INT id leftparen INT id rightparen leftbrace如果IF leftparen id更大num rightparen RETURN id乘法id leftparen id减去num rightparen分号id Assignop id leftparen rightparen分号IF leftparen id更大的数字rightparen id leftparen id leftparen id rightparen rightparen分号ELSE IF leftparen id notequal num rightparen id leftparen id乘法id rightparen分号rightbrace

Ok ive written some code based on user John's suggestion. 好了,我根据用户约翰的建议编写了一些代码。 Im still confused on how this will work. 我仍然对这将如何工作感到困惑。 When i iterate the second loop to find white space or a symbol how I know what type of token came before the ws of symbol. 当我迭代第二个循环以查找空格或符号时,我如何知道哪种符号类型出现在符号ws之前。 I've tried to put chracters i skip in a string and use a case statement to determine it but I think it writes the whole file into the string so my tokens never match. 我试图让我跳过字符串,并使用case语句来确定它,但我认为它将整个文件写入字符串,因此我的令牌永远不匹配。 Also how can method find comments and safely ignore them? 还有方法如何找到注释并安全地忽略它们?

There are a couple of different ways to approach this program. 有两种不同的方法可以执行此程序。 Without writing the code, I will try to explain what you need to do. 在不编写代码的情况下,我将尝试解释您需要做什么。

From the example you have submitted. 从您提交的示例中。

Your instructor has given you the key to the program. 您的老师已为您提供了该程序的密钥。 He has given you the output and you can construct a state table. 他给了您输出,您可以构造一个状态表。

You can either go through the output and manually do this to check your answer, or create a small program to do this for you. 您可以浏览输出并手动执行此操作以检查您的答案,也可以创建一个小程序来为您执行此操作。

This is a table with the state number on the left, and the corresponding word on the right. 这是一个表,其状态编号在左侧,而相应的单词在右侧。

         3  int,  
         27 ID,
         21 leftparen, 
         22 right paren,
         25 left brace s, 
         2  if,
         13  greater, 

and so on. 等等。

You will need to create an input buffer 您将需要创建一个输入缓冲区
2 output buffers 2个输出缓冲器
2 loops one outer and one inner loop 2个循环,一个外部循环和一个内部循环
1 case statement that corresponds to the state table. 1个与状态表相对应的case语句。

when your going through the input buffer, you initialize the outer loop initialize the inner loop compare this first character and determine if its a valid character? 当您遍历输入缓冲区时,初始化外部循环初始化内部循环比较此第一个字符并确定其是否为有效字符? if not increment loops till you find a valid character 如果不是,则递增循环,直到找到有效字符为止

Once you find the valid character it is the beginning of a token. 一旦找到有效字符,它就是令牌的开始。 Then find the end of the token by incrementing the inner loop by finding white space, or a special symbol. 然后,通过查找空白或特殊符号来增加内部循环,从而找到令牌的结尾。 Then use a case statement to output the number in one buffer and the word that corresponds to the second buffer. 然后,使用case语句在一个缓冲区中输出数字和与第二个缓冲区相对应的单词。

Then print out the number buffer. 然后打印出数字缓冲区。 Then print out the word buffer. 然后打印出字缓冲区。

Then increment outer loop to inner loop + 1 Make inner loop equal to outer loop 然后将外循环增加到内循环+ 1使内循环等于外循环

continue till you find the End of File. 继续直到找到文件结尾。 If they match your teachers output you are finished. 如果他们与您的老师的输出匹配,您就完成了。 If not you have a logic error. 如果不是,那么您就有逻辑错误。 Then check to which value is invalid, and look at that part of the program. 然后检查哪个值无效,并查看程序的该部分。

Its been 20 yrs guys. 已经20岁了。

Pretty familiar tast, except for the fact I was writing LLK analizer... In your case, try to look in the way of formal grammars and like - you're step is almost required step before performing analisys by those grammars. 除了我正在编写LLK分析器外,还很熟悉品尝……在您的情况下,请尝试以正式语法的方式查看,例如-在这些语法执行analisys之前,您几乎需要采取这一步骤。 Maybe some working parsers (opensource) like lex && flex would help. 也许像lex && flex这样的工作解析器(开源)会有所帮助。

INHO, the easiest way is to read input file character by character into some string and check, does this string matches one of your regexps... If it does - write appropriate code to the output and clear the string you're using as buffer. INHO,最简单的方法是将输入文件逐个字符读入某个字符串并检查,该字符串是否与您的正则表达式之一匹配...如果匹配,将适当的代码写入输出并清除用作缓冲区的字符串。 There are two problems in this case: this works in O(n*m), where n is the length of you're text and m is number of regular expressions you have (in worth case), and second - you must not use prefixed expressions...I meen that you must not have any expression to have another one as prefix (beginning), or this expression would be unreachable. 在这种情况下有两个问题:这适用于O(n * m),其中n是您的文本长度,m是您拥有的正则表达式的数量(在适当情况下),其次-您不得使用带前缀的表达式...我的意思是,您必须没有任何表达式可以将另一个作为前缀(开头),否则此表达式将无法访问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM