Lexical analyzer (java) for HTML Markdown source code

Question

I do not even know where to begin writing the character-by-character lexical analyzer. I wrote BNF grammar rules for a Markdown language (specifically, HTML) based on rules and specifics I was given, so none should need to be added. I now have to design and implement a character-by-character lexical analyzer that partitions the lexemes of a source file in my Markdown language into tokens. Here is my BNF GRAMMAR:

Terminals:

#DOCUMENT BEGIN,
#DOCUMENT END
#HEAD BEGIN,
#HEAD END,
#TITLE BEGIN,
#TITLE END,
#PARAGRAPH BEGIN,
#PARAGRAPH END,
#BOLD BEGIN,
#BOLD END,
#ITALICS BEGIN,
#ITALICS END,
#LIST BEGIN,
#LIST END,
#ITEM BEGIN,
#ITEM END,
#LINK BEGIN,
#TEXT,
#ADDRESS,
#LINK END,
#DEFINE BEGIN,
#NAME,
#VALUE,
#DEFINE END,
#USE BEGIN,
#USE END

Note that these terminals are not case sensitive.

Non-Terminals:

<document> ::= #DOCUMENT BEGIN <macro-‐define> <head> <body> #DOCUMENT END

<head> ::= #HEAD BEGIN <title> #HEAD END | ε

<title> ::= #TITLE BEGIN <text> #TITLE END | ε

<body> ::= <inner-‐text> <body>
           | <paragraph> <body>
           | <bold> <body>
           | <italics> <body>
           | <list> <body>
           | ε

<paragraph> ::= #PARAGRAPH BEGIN <macro-‐define> <inner-‐paragraph> #PARAGRAPH END

<inner-‐paragraph> ::= <inner-‐text> <inner-‐paragraph>
                      | <bold> <inner-‐paragraph>
                      | <italics> <inner-‐paragraph>
                      | <list> <inner-‐paragraph>
                      | ε

<inner-‐text> ::= <macro-‐use> <inner-‐text>
                  | <text> <inner-‐text>
                  | ε

<macro-‐define> ::= #DEFINE BEGIN #NAME <text> #VALUE <body> #DEFINE END <macro-‐define>
                    | ε

<macro-‐use> ::= #USE BEGIN <text> #USE END | ε

<bold> ::= #BOLD BEGIN <macro-‐define> <inner-‐text> #BOLD END

<italics> ::= #ITALICS BEGIN <macro-‐define> <inner-‐text> #ITALICS END

<link> ::= #LINK BEGIN #TEXT <text> #ADDRESS <text> #LINK END

<list> ::= #LIST BEGIN #ITEM BEGIN <macro-‐define> <inner-‐list> #ITEM END <list-‐items> #LIST END

<list-‐items> ::= #ITEM BEGIN <macro-‐define> <inner-‐list> #ITEM END <list-‐items> | ε

<inner-‐list> ::= | <bold> <inner-‐list>
                  | <italics> <inner-‐list>
                  | <list><inner-‐list>
                  | <inner-‐text> <inner-‐list>
                  | ε

<text> ::= Any plain text | ε

We can assume that HTML characters such as "<", ">", "&", and "/" do not appear in any of the text in the source file. We can also assume that "#" only appears prior to one of our Markdown annotations (eg, #DOCUMENT). I think it would be best to have separate Java classes to represent token objects such as: DocumentBegin, DocumentEnd, ParagraphBegin, ParagraphEnd, etc. Any lexical errors encountered (eg, #DOC BEGIN) should be reported as output to the console with as much error info as possible. The compiler should exit after first error is encountered. If an error is encountered, no output file should be created.

My problem is, I know what a lexical analyzer is supposed to do, but honestly I have no clue where to begin with coding/implementing. If you need more explanation of what the problem is asking, please just ask and I can do my best to explain. This was one part of a big project we had due for my class. I was unable to complete this part and lost a lot of points, but now I just need to understand it so once we are tested on it, I won't be as lost.

Answer 1

Ok, this is quite a bit late, but here we go.

A lexical analyzer is often associated with grammars (and BNF notation) but the two are actually a bit different.

Lexical analyzers turn characters into Tokens, which are somewhat processed "atoms" of a grammar, while parsers turn the tokens into some intermediate structure (usually a tree). Focusing just on the Lexical analyzer part, you can think of it as a low pass processing of the input, much that we process letters into words.

Since you already have the BNF grammar, you already know all the Tokens (end words) you are going to use, so make them into a list. The idea is how to decide quickly which series of letters will map to each item in the list. For example

#, D, E, F, I, N, E, <whitespace> => #DEFINE
#, D, O, C, U, M, E, N, T, <whitespace> => #DOCUMENT
B, E, G, I, N, <whitespace> => BEGIN
E, N, D, <whitespace> => END

There are a few problems that creep up in parsing:

First, you have a lot of comparison to do. The first character read in might be a '#', and if it is, then you still have over 20 items that it might match. This means that you will have to continue your match into the next character, which if it were a 'D' would still mean that there are two possible matches '#DEFINE' and '#DOCUMENT'.

Second, if you have words like '#BEGIN' and '#BEGINNING' after you have processed '#BEGIN' you can't decide between the two until you grab the next character. Grabbing the next character in a system that considers that "consumption" of the character complicates the processing for the next Token. Peeking or look-ahead may be required, but those add complexity in the logic to decide which tokens to generate.

Third, You have a wild-card 'text' token. That token could match nearly anything, so you need to check it against all your other tokens, to make sure that your Token generation logic will always know which Token it should generate.

Ideally, the Token generator (Lexer) doesn't depend on any parsing to "know" the next token; however, there are languages just complicated enough that the parser give "hints" to the Lexer. Avoiding these kinds of systems make for cleaner compiler implementations; unfortunately, in some already existing languages it is not always possible to build things this way.

So, know that you have an idea of what to do (which you probably already had in some sense) how do you go about it?

Well, you need some sort of index to keep track of the characters you have consumed (that is have fully translated into tokens) so you don't accidentally give a character a double impact on the Token stream. You need a second pointer for "look ahead" if you are going to look ahead, and odds are you will want to limit the amount of look ahead (to make the logic less difficult).

Then you need an unknown number of data structures (called Tokens). While it is not always necessary to do so, I recommend keeping track of the staring line number, the starting character index, the ending line number, and the ending character index in the Token. It makes debugging a lot easier. In addition, it is a good idea to "capture" the substring within the token. You can call this what you will, but some people call it an "image" of the Token.

Naturally, if your parser can differentiate between Tokens of different types then you should store the type of that token in (or with) the token by some means. Occasionally a person has a concept of the "value" of the token, and it may be stored too.

After some effort, you should be able to push a string of characters into the Lexer and have a stream of Tokens come out. Good luck.

Answer 2

The best (aka only one I know) lexical analyzer I have found for doing this in Java is called JFlex. We used it at University to tokenise languages and I have used it commercially to create syntax highlighting for domain specific languages in applications.

JFlex Lexical Analyzer

http://jflex.de/

Cup Parser

http://www2.cs.tum.edu/projects/cup/

A little bit about LALR(1) Parsers

http://en.wikipedia.org/wiki/LALR_parser

If you need examples (ie example code) message me and I'll send you some notes. A quick google did not show up anything too useful although I'm sure some of the university sites (ie Princeton) might have something.

Cheers,

John

Lexical analyzer (java) for HTML Markdown source code

Question

2 answers

solution1
1 2013-11-11 03:43:23

solution2
0 2013-11-11 02:58:52

Lexical analyzer (java) for HTML Markdown source code

Question

2 answers

solution1 1 2013-11-11 03:43:23

solution2 0 2013-11-11 02:58:52

solution1
1 2013-11-11 03:43:23

solution2
0 2013-11-11 02:58:52