简体   繁体   English

“特设”词法分析器

[英]“Ad Hoc” lexical analyzer

So for a project I am trying to create a simple lexical analyzer for a fake programming language that is read in from a file. 因此,对于一个项目,我试图为从文件中读取的假编程语言创建一个简单的词法分析器。 I asked a question earlier in the week asking how I can implement such a program and relieved an answer telling me to: Create an input buffer and two output buffers. 我在本周早些时候提出了一个问题,询问如何实现这样的程序,并给出了一个答案,告诉我:创建一个输入缓冲区和两个输出缓冲区。 initialize two loops and increment them until i find the start of a token. 初始化两个循环并递增它们,直到找到令牌的开始。 once I have found the start, increment the second loop until i find a white space or symbol, and then use a case statement to output to the two output files, then make the outer loop equal to the inner and continue scanning. 一旦找到起点,请增加第二个循环,直到找到空白或符号,然后使用case语句输出到两个输出文件,然后使外部循环等于内部循环并继续扫描。 I've done some research and this method is similar to a loop and switch method or "ad hoc" method. 我已经做过一些研究,这种方法类似于循环和切换方法或“ ad hoc”方法。

import java.io.*;

public class Lex {

    public static boolean contains(char[] a, char b){
        for (int i = 0; i < a.length; i++) {
            if(b == a[i])
                return true;
        }
        return false;
    } 
    public static void main(String args[]) throws FileNotFoundException, IOException{

        //Declaring token values as constant integers.
        final int T_DOUBLE = 0; 
        final int T_ELSE = 1;
        final int T_IF = 2; 
        final int T_INT = 3;
        final int T_RETURN = 4; 
        final int T_VOID = 5;
        final int T_WHILE = 6; 
        final int T_PLUS = 7;
        final int T_MINUS = 8; 
        final int T_MULTIPLICATION = 9;
        final int T_DIVISION = 10; 
        final int T_LESS = 11;
        final int T_LESSEQUAL = 12; 
        final int T_GREATER = 13;
        final int T_GREATEREQUAL = 14; 
        final int T_EQUAL = 16;
        final int T_NOTEQUAL = 17;
        final int T_ASSIGNOP = 18; 
        final int T_SMEICOLON = 19;
        final int T_PERIOD = 20; 
        final int T_LEFTPAREN = 21;
        final int T_RIGHTPAREN = 22; 
        final int T_LEFTBRACKET = 23;
        final int T_RIGHTBRACKET = 24; 
        final int T_LEFTBRACE = 25;
        final int T_RIGHTBRACE = 26; 
        final int T_ID = 27;
        final int T_NUM = 28;
        char[] letters_ = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D',
            'E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','_'};
        char[] numbers = {'0','1','2','3','4','5','6','7','8','9'};
        char[] symbols = {'+','-','*','/','<','>','!','=',':',',','.','(',')','[',']','{','}'};
        FileInputStream fstream = new FileInputStream("src\\testCode.txt");
        DataInputStream in = new DataInputStream(fstream);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        BufferedWriter bw1 = new BufferedWriter(new FileWriter(new File("src\\output.txt"), true));
        BufferedWriter bw2 = new BufferedWriter(new FileWriter(new File("src\\output2.txt"), true));
        String scanner;String temp = "";
        int n = 0;
        while((scanner = br.readLine()) != null){
            for (int i = 0; i < scanner.length(); i++) {
                for (int j = 0; j < scanner.length(); j++) {
                    if(contains(letters_,scanner.charAt(i)) || contains(numbers,scanner.charAt(i)) || contains(symbols,scanner.charAt(i))){
                        j++;
                        n++;
                        if(scanner.charAt(j) == ' ' || scanner.charAt(j) == '\n' || scanner.charAt(j) == '\t'){

                        }
                    }

                }

            }
        }

        in.close();


    }

}

My question is how can I determine what token to assign a word after i find a white space or symbol. 我的问题是找到空白或符号后,如何确定分配单词的令牌。 Can i put each character before the ws and symbol in a string and compare it like that? 我可以将每个字符放在ws和符号之前放在一个字符串中并进行比较吗? I've tried something similar but it wrote my whole input file into the string so my tokens would not match in my switch statement. 我尝试过类似的操作,但是它将整个输入文件写入了字符串,因此我的标记在switch语句中不匹配。 Also using this method how can I safely ignore comments and comment blocks as they should not be tokenized. 同样使用这种方法,我如何可以安全地忽略注释和注释块,因为它们不应被标记化。

The classical approach to building lexers is via a switch statement inside a loop. 构建词法分析器的经典方法是通过循环内的switch语句。 The basic idea is to process each char exactly once rather than rescanning it. 基本思想是每个字符只处理一次,而不是重新扫描。 Cases A to Z and a to z can begin an identifier, so those cases must suck in all the possible identifier characters until you hit one that isn't, assembling them into an identifier token, and returning IDENTIFIER to the caller. 案例A到Z和案例A到z可以开始一个标识符,因此这些案例必须吸收所有可能的标识符字符,直到您击中没有的字符,将它们组合成标识符令牌,然后将IDENTIFIER返回给调用者。 Similarly cases 0 to 9 can begin a number, so you suck in the number and return INTEGER or DOUBLE or whatever it was. 同样,从0到9的情况可以以数字开头,因此您要吸收该数字并返回INTEGER或DOUBLE或其他任何数字。 Cases space, tab, newline, form feed, etc, are whitespace, so suck up all the whitespace and continue the outer loop without returning at all. 情况下的空格,制表符,换行符,换页符等都是空白,因此要吸收所有空白并继续外循环而根本不返回。 All the others are punctuation, so you suck them up, sorting out the one-char ones from the two-char ones, and typically return the character value itself for the one-char ones, and a special token value for the others. 所有其他字符都是标点符号,因此您要吸取它们,从两个字符的字符中挑选出一个字符的字符,并通常为一个字符的字符本身返回字符值,为其他字符返回特殊的令牌值。 Don't forget to handle EOF correctly :-) Adjust the cases and rules to suit the language you are analyzing. 不要忘了正确处理EOF :-)调整大小写和规则以适合您正在分析的语言。

It depends on how complex you need your lexer to be. 这取决于您的词法分析器的复杂程度。 If you are, as you are now, splitting on whitespace you could simply compare each lexeme to a series of regular expressions, to see which one that matches it. 如果您像现在一样在空白处进行拆分,则可以将每个词素与一系列正则表达式进行比较,以查看与之匹配的正则表达式。 This is a simple way of doing it and not very efficient, but that might not factor into your decision. 这是一种简单的操作方式,效率不高,但可能不会影响您的决策。

A "real" lexer usually works as a finite automata. “真实”词法分析器通常用作有限自动机。 If you know how to construct an automata that can recognize a regular expression you can combine several of these into a larger automata which recognizes several expressions in O(1) complexity. 如果您知道如何构造可识别正则表达式的自动机,则可以将其中的几个组合成一个较大的自动机,该自动机可以识别O(1)复杂度的多个表达式。 I have written a series of articles on this subject, if that is of interest. 如果对此感兴趣,我已经写了一系列有关此主题的文章 It's a complex but rewarding task. 这是一项复杂但有益的任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM