Design Pattern For Making An Assembler

Question

I'm making an 8051 assembler.

Before everything is a tokenizer which reads next tokens, sets error flags, recognizes EOF, etc.
Then there is the main loop of the compiler, which reads next tokens and check for valid mnemonics:

mnemonic= NextToken();
if (mnemonic.Error)
{
    //throw some error
}
else if (mnemonic.Text == "ADD")
{
    ...
}
else if (mnemonic.Text == "ADDC")
{
    ...
}

And it continues to several cases. Worse than that is the code inside each case, which checks for valid parameters then converts it to compiled code. Right now it looks like this:

if (mnemonic.Text == "MOV")
{
    arg1 = NextToken();
    if (arg1.Error) { /* throw error */ break; }
    arg2 = NextToken();
    if (arg2.Error) { /* throw error */ break; }

    if (arg1.Text == "A")
    {
        if (arg2.Text == "B")
            output << 0x1234; //Example compiled code
        else if (arg2.Text == "@B")
            output << 0x5678; //Example compiled code
        else
            /* throw "Invalid parameters" */
    }
    else if (arg1.Text == "B")
    {
        if (arg2.Text == "A")
            output << 0x9ABC; //Example compiled code
        else if (arg2.Text == "@A")
            output << 0x0DEF; //Example compiled code
        else
            /* throw "Invalid parameters" */
    }
}

For each of the mnemonics I have to check for valid parameters then create the correct compiled code. Very similar codes for checking the valid parameters for each mnemonic repeat in each case.

So is there a design pattern for improving this code?
Or simply a simpler way to implement this?

Edit: I accepted plinth's answer, thanks to him. Still if you have ideas on this, i will be happy to learn them. Thanks all.

Answer 1

I've written a number of assemblers over the years doing hand parsing and frankly, you're probably better off using a grammar language and a parser generator.

Here's why - a typical assembly line will probably look something like this:

[label:] [instruction|directive][newline]

and an instruction will be:

plain-mnemonic|mnemonic-withargs

and a directive will be:

plain-directive|directive-withargs

etc.

With a decent parser generator like Gold , you should be able to knock out a grammar for 8051 in a few hours. The advantage to this over hand parsing is that you will be able to have complicated enough expressions in your assembly code like:

.define kMagicNumber 0xdeadbeef
CMPA #(2 * kMagicNumber + 1)

which can be a real bear to do by hand.

If you want to do it by hand, make a table of all your mnemonics which will also include the various allowable addressing modes that they support and for each addressing mode, the number of bytes that each variant will take and the opcode for it. Something like this:

enum {
    Implied = 1, Direct = 2, Extended = 4, Indexed = 8 // etc
} AddressingMode; 

/* for a 4 char mnemonic, this struct will be 5 bytes.  A typical small processor
 * has on the order of 100 instructions, making this table come in at ~500 bytes when all
 * is said and done.
 * The time to binary search that will be, worst case 8 compares on the mnemonic.
 * I claim that I/O will take way more time than look up.
 * You will also need a table and/or a routine that given a mnemonic and addressing mode
 * will give you the actual opcode.
 */

struct InstructionInfo {
    char Mnemonic[4];
    char AddessingMode;
}

/* order them by mnemonic */
static InstructionInfo instrs[] = {
    { {'A', 'D', 'D', '\0'}, Direct|Extended|Indexed },
    { {'A', 'D', 'D', 'A'}, Direct|Extended|Indexed },
    { {'S', 'U', 'B', '\0'}, Direct|Extended|Indexed },
    { {'S', 'U', 'B', 'A'}, Direct|Extended|Indexed }
}; /* etc */

static int nInstrs = sizeof(instrs)/sizeof(InstrcutionInfo);

InstructionInfo *GetInstruction(char *mnemonic) {
   /* binary search for mnemonic */
}

int InstructionSize(AddressingMode mode)
{
    switch (mode) {
    case Inplied: return 1;
    / * etc */
    }
 }

Then you will have a list of every instruction which in turn contains a list of all the addressing modes.

So your parser becomes something like this:

char *line = ReadLine();
int nextStart = 0;
int labelLen;
char *label = GetLabel(line, &labelLen, nextStart, &nextStart); // may be empty
int mnemonicLen;
char *mnemonic = GetMnemonic(line, &mnemonicLen, nextStart, &nextStart); // may be empty
if (IsOpcode(mnemonic, mnemonicLen)) {
    AddressingModeInfo info = GetAddressingModeInfo(line, nextStart, &nextStart);
    if (IsValidInstruction(mnemonic, info)) {
        GenerateCode(mnemonic, info);
    }
    else throw new BadInstructionException(mnemonic, info);
}
else if (IsDirective()) { /* etc. */ }

Answer 2

Yes. Most assemblers use a table of data which describes the instructions: mnemonic, op code, operands forms etc.

I suggest looking at the source code for as . ~~I'm having some trouble finding it though.~~ Look here . (Thanks to Hossein.)

Answer 3

I think you should look into the Visitor pattern. It might not make your code that much simpler, but will reduce coupling and increase reusability. SableCC is a java framework to build compilers that uses it extensively.

Answer 4

Have you looked at the "Command Dispatcher" pattern?

http://en.wikipedia.org/wiki/Command_pattern

The general idea would be to create an object that handles each instruction (command), and create a look-up table that maps each instruction to the handler class. Each command class would have a common interface (Command.Execute( *args ) for example) which would definitely give you a cleaner / more flexible design than your current enormous switch statement.

Answer 5

When I was playing with a Microcode emulator tool, I converted everything into descendants of an Instruction class. From Instruction were category classes, such as Arithmetic_Instruction and Branch_Instruction . I used a factory pattern to create the instances.

Your best bet may be to get a hold of the assembly language syntax specification. Write a lexer to convert to tokens (**please, don't use if-elseif-else ladders). Then based on semantics, issue the code.

Long time ago, assemblers were a minimum of two passes: The first to resolve constants and form the skeletal code (including symbol tables). The second pass was to generate more concrete or absolute values.

Have you read the Dragon Book lately?

Design Pattern For Making An Assembler

Question

5 answers

solution1
8 ACCPTED 2011-04-07 20:11:01

solution2
4 2011-04-07 19:54:34

solution3
0 2011-04-07 19:55:47

solution4
0 2011-04-07 19:55:48

solution5
0 2011-04-07 20:09:47

Design Pattern For Making An Assembler

Question

5 answers

solution1 8 ACCPTED 2011-04-07 20:11:01

solution2 4 2011-04-07 19:54:34

solution3 0 2011-04-07 19:55:47

solution4 0 2011-04-07 19:55:48

solution5 0 2011-04-07 20:09:47

solution1
8 ACCPTED 2011-04-07 20:11:01

solution2
4 2011-04-07 19:54:34

solution3
0 2011-04-07 19:55:47

solution4
0 2011-04-07 19:55:48

solution5
0 2011-04-07 20:09:47