“Ad Hoc” lexical analyzer

Question

So for a project I am trying to create a simple lexical analyzer for a fake programming language that is read in from a file. I asked a question earlier in the week asking how I can implement such a program and relieved an answer telling me to: Create an input buffer and two output buffers. initialize two loops and increment them until i find the start of a token. once I have found the start, increment the second loop until i find a white space or symbol, and then use a case statement to output to the two output files, then make the outer loop equal to the inner and continue scanning. I've done some research and this method is similar to a loop and switch method or "ad hoc" method.

import java.io.*;

public class Lex {

    public static boolean contains(char[] a, char b){
        for (int i = 0; i < a.length; i++) {
            if(b == a[i])
                return true;
        }
        return false;
    } 
    public static void main(String args[]) throws FileNotFoundException, IOException{

        //Declaring token values as constant integers.
        final int T_DOUBLE = 0; 
        final int T_ELSE = 1;
        final int T_IF = 2; 
        final int T_INT = 3;
        final int T_RETURN = 4; 
        final int T_VOID = 5;
        final int T_WHILE = 6; 
        final int T_PLUS = 7;
        final int T_MINUS = 8; 
        final int T_MULTIPLICATION = 9;
        final int T_DIVISION = 10; 
        final int T_LESS = 11;
        final int T_LESSEQUAL = 12; 
        final int T_GREATER = 13;
        final int T_GREATEREQUAL = 14; 
        final int T_EQUAL = 16;
        final int T_NOTEQUAL = 17;
        final int T_ASSIGNOP = 18; 
        final int T_SMEICOLON = 19;
        final int T_PERIOD = 20; 
        final int T_LEFTPAREN = 21;
        final int T_RIGHTPAREN = 22; 
        final int T_LEFTBRACKET = 23;
        final int T_RIGHTBRACKET = 24; 
        final int T_LEFTBRACE = 25;
        final int T_RIGHTBRACE = 26; 
        final int T_ID = 27;
        final int T_NUM = 28;
        char[] letters_ = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D',
            'E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','_'};
        char[] numbers = {'0','1','2','3','4','5','6','7','8','9'};
        char[] symbols = {'+','-','*','/','<','>','!','=',':',',','.','(',')','[',']','{','}'};
        FileInputStream fstream = new FileInputStream("src\\testCode.txt");
        DataInputStream in = new DataInputStream(fstream);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        BufferedWriter bw1 = new BufferedWriter(new FileWriter(new File("src\\output.txt"), true));
        BufferedWriter bw2 = new BufferedWriter(new FileWriter(new File("src\\output2.txt"), true));
        String scanner;String temp = "";
        int n = 0;
        while((scanner = br.readLine()) != null){
            for (int i = 0; i < scanner.length(); i++) {
                for (int j = 0; j < scanner.length(); j++) {
                    if(contains(letters_,scanner.charAt(i)) || contains(numbers,scanner.charAt(i)) || contains(symbols,scanner.charAt(i))){
                        j++;
                        n++;
                        if(scanner.charAt(j) == ' ' || scanner.charAt(j) == '\n' || scanner.charAt(j) == '\t'){

                        }
                    }

                }

            }
        }

        in.close();


    }

}

My question is how can I determine what token to assign a word after i find a white space or symbol. Can i put each character before the ws and symbol in a string and compare it like that? I've tried something similar but it wrote my whole input file into the string so my tokens would not match in my switch statement. Also using this method how can I safely ignore comments and comment blocks as they should not be tokenized.

Answer 1

The classical approach to building lexers is via a switch statement inside a loop. The basic idea is to process each char exactly once rather than rescanning it. Cases A to Z and a to z can begin an identifier, so those cases must suck in all the possible identifier characters until you hit one that isn't, assembling them into an identifier token, and returning IDENTIFIER to the caller. Similarly cases 0 to 9 can begin a number, so you suck in the number and return INTEGER or DOUBLE or whatever it was. Cases space, tab, newline, form feed, etc, are whitespace, so suck up all the whitespace and continue the outer loop without returning at all. All the others are punctuation, so you suck them up, sorting out the one-char ones from the two-char ones, and typically return the character value itself for the one-char ones, and a special token value for the others. Don't forget to handle EOF correctly :-) Adjust the cases and rules to suit the language you are analyzing.

Answer 2

It depends on how complex you need your lexer to be. If you are, as you are now, splitting on whitespace you could simply compare each lexeme to a series of regular expressions, to see which one that matches it. This is a simple way of doing it and not very efficient, but that might not factor into your decision.

A "real" lexer usually works as a finite automata. If you know how to construct an automata that can recognize a regular expression you can combine several of these into a larger automata which recognizes several expressions in O(1) complexity. I have written a series of articles on this subject, if that is of interest. It's a complex but rewarding task.

“Ad Hoc” lexical analyzer

Question

2 answers

solution1
1 2012-05-13 02:04:29

solution2
0 2012-05-12 18:15:34

“Ad Hoc” lexical analyzer

Question

2 answers

solution1 1 2012-05-13 02:04:29

solution2 0 2012-05-12 18:15:34

solution1
1 2012-05-13 02:04:29

solution2
0 2012-05-12 18:15:34