简体   繁体   中英

How to deal with operators while tokenizing in Java (StreamTokenizer)

I'm writing a tokenizer in Java which has to deal with operators, and whitespace chars between tokens is not necessary.

I need to recognize something like "<=" as a token, while also recognizing "<" and "=".

Right now I have:

if (token == '<')
        if (nextToken == '=')
            this.tokenList.add(27); // <=
        else
            // add 2 tokens separately

Is there anyway for StreamTokenizer to do this on its own? I've read through the API, but I don't see anything.

Can I specify combination of token that can be counted as one? Ideally, getNextToken would remove both token at once.

Thanks!

What StreamTokenizer provides you is the functionality of basic Lexer. You have to use these to make your high end version.

You have to make use of nextToken() and pushBack() very judiciously. For example in the below I am taking care of < , << and <= . If you see an operator < then look ahead in the stream for a clue and if you don't find a following < or = then push back the look ahead token back into the stream.

>> Sample Code

import java.io.IOException;
import java.io.StreamTokenizer;
import java.io.StringReader;

public class LexerTest 
{
    private StringReader r;

    public LexerTest(StringReader stringReader) {
        r = stringReader;
    }

    public static void main(String[] args) throws IOException 
    {
        String s = "test = test1 + (test2 * test3 * (test4 - 2);";
        new LexerTest(new StringReader(s)).printTokens();

        System.out.println("\n### Test 2 ###\n");
        s = "test = if(test1 < test2){ test3 = (test4 - 2);}";
        new LexerTest(new StringReader(s)).printTokens();

        System.out.println("\n### Test 3 ###\n");
        s = "test = if(test1 <= test2){ test3 = (test4 - 2);}";
        new LexerTest(new StringReader(s)).printTokens();

        System.out.println("\n### Test 4 ###\n");
        s = "test = if(test1 < test2){ test3 = (test4 << 2);}";
        new LexerTest(new StringReader(s)).printTokens();
    }

    private void printTokens() throws IOException 
    {
        StreamTokenizer st = new StreamTokenizer(r);
        st.eolIsSignificant(true);

        int token = st.nextToken();
        while (token != StreamTokenizer.TT_EOF) 
        {
            token = st.nextToken();
            switch (token) 
            {
            case StreamTokenizer.TT_NUMBER:
                double num = st.nval;
                System.out.println("Number found: " + num);
                break;
            case StreamTokenizer.TT_WORD:
                String word = st.sval;
                System.out.println("Word found: " + word);
                break;
            case '+':
                break;
            case '-':
                break;
            case '/':
                break;
            case '*':
                break;
            case '<':
            {
                int t = st.nextToken();
                switch(t)
                {
                case '=':
                    System.out.println("<=");
                    break;
                case '<':
                    System.out.println("<<");
                    break;
                    default:
                        st.pushBack();
                        System.out.println("<");
                        break;
                }
            }
            }
        }

    }
}

Hope this will help.

This isn't a typical scenario for the provided tokenizer classes. More like something that a fully-blown parser has to handle. Even if you need to build such a tokenizer by hand, you may find it educational to study code produced by parser generators such as javacc or antlr. Focus on how they handle "lookahead", which is what you are asking about here.

Unless this is a homework problem where you aren't allowed to use a parser generator, you will get better results by using one.

nextToken()将跳过空格,因此+++ +将被识别为相同!

StreamTokenizer is very basic tool for handling this.

You can create your own lookAhead function to solve your purpose.

You read a '<' then call your lookahead and if there is a '=' or not - act accordingly

you could use a stack to save your previous state.

PS: This would get much more complicated with bigger expressions and If you want more functionality surely you should delve into lexers & parsers

It looks like StreamTokenizer is a bit on the basic side.

I'd recommend you build a lexer on top of StreamTokenizer. What this lexer would do is give you a stream of actual tokens in the usual sense. That is, <= would be given as a single token , not two separate tokens.

Better still, bin StreamTokenizer and write a lexer that just looks at the characters directly. StreamTokenizer does too little to be useful for parsing an advanced grammar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM