I am hand-writing a parser for a simple regular expression engine.
The engine supports a .. z
|
*
and concatenation and parentheses
Here is the CFG I made:
exp = concat factor1
factor1 = "|" exp | e
concat = term factor2
factor2 = concat | e
term = element factor3
factor3 = * | e
element = (exp) | a .. z
which is equal to
S = T X
X = "|" S | E
T = F Y
Y = T | E
F = U Z
Z = *| E
U = (S) | a .. z
For alternation and closure, I can easily handle them by looking ahead and choose a production based on the token. However, there is no way to handle concatenation by looking ahead cause it is implicit.
I am wondering how can I handle concatenation or is there anything wrong with my grammar?
And this is my OCaml code for parsing:
type regex =
| Closure of regex
| Char of char
| Concatenation of regex * regex
| Alternation of regex * regex
(*| Epsilon*)
exception IllegalExpression of string
type token =
| End
| Alphabet of char
| Star
| LParen
| RParen
| Pipe
let rec parse_S (l : token list) : (regex * token list) =
let (a1, l1) = parse_T l in
let (t, rest) = lookahead l1 in
match t with
| Pipe ->
let (a2, l2) = parse_S rest in
(Alternation (a1, a2), l2)
| _ -> (a1, l1)
and parse_T (l : token list) : (regex * token list) =
let (a1, l1) = parse_F l in
let (t, rest) = lookahead l1 in
match t with
| Alphabet c -> (Concatenation (a1, Char c), rest)
| LParen ->
(let (a, l1) = parse_S rest in
let (t1, l2) = lookahead l1 in
match t1 with
| RParen -> (Concatenation (a1, a), l2)
| _ -> raise (IllegalExpression "Unbalanced parentheses"))
| _ ->
let (a2, rest) = parse_T l1 in
(Concatenation (a1, a2), rest)
and parse_F (l : token list) : (regex * token list) =
let (a1, l1) = parse_U l in
let (t, rest) = lookahead l1 in
match t with
| Star -> (Closure a1, rest)
| _ -> (a1, l1)
and parse_U (l : token list) : (regex * token list) =
let (t, rest) = lookahead l in
match t with
| Alphabet c -> (Char c, rest)
| LParen ->
(let (a, l1) = parse_S rest in
let (t1, l2) = lookahead l1 in
match t1 with
| RParen -> (a, l2)
| _ -> raise (IllegalExpression "Unbalanced parentheses"))
| _ -> raise (IllegalExpression "Unknown token")
For a LL grammar the FIRST sets are the tokens that are allowed as first token for a rule. To can construct them iteratively till you reach a fixed point.
Start with step 1 and then repeat steps 2 and 3 until the FIRST sets reach a fixed point (don't change). Now you have the true FIRST sets for your grammar and can decide every rule using the lookahead.
Note: In your code the parse_T function doesn't match the FIRST(T) set. If you look at for example 'a|b' then is enters parse_T and the 'a' is matched by the parse_F call. The lookahead then is '|' which matches epsilon in your grammar but not in your code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.