简体   繁体   中英

Represent regular expression as context free grammar

I am hand-writing a parser for a simple regular expression engine.

The engine supports a .. z | * and concatenation and parentheses

Here is the CFG I made:

 exp = concat factor1
 factor1 = "|" exp | e
 concat = term factor2
 factor2 = concat | e
 term = element factor3
 factor3 = * | e
 element = (exp) | a .. z

which is equal to

 S = T X
 X = "|" S | E
 T = F Y 
 Y = T | E
 F = U Z
 Z = *| E
 U = (S) | a .. z

For alternation and closure, I can easily handle them by looking ahead and choose a production based on the token. However, there is no way to handle concatenation by looking ahead cause it is implicit.

I am wondering how can I handle concatenation or is there anything wrong with my grammar?

And this is my OCaml code for parsing:

type regex = 
  | Closure of regex
  | Char of char
  | Concatenation of regex * regex
  | Alternation of regex * regex
  (*| Epsilon*)


exception IllegalExpression of string

type token = 
  | End
  | Alphabet of char
  | Star
  | LParen
  | RParen
  | Pipe

let rec parse_S (l : token list) : (regex * token list) = 
  let (a1, l1) = parse_T l in
  let (t, rest) = lookahead l1 in 
  match t with
  | Pipe ->                                   
      let (a2, l2) = parse_S rest in
      (Alternation (a1, a2), l2)
  | _ -> (a1, l1)                             

and parse_T (l : token list) : (regex * token list) = 
  let (a1, l1) = parse_F l in
  let (t, rest) = lookahead l1 in 
  match t with
  | Alphabet c -> (Concatenation (a1, Char c), rest)
  | LParen -> 
     (let (a, l1) = parse_S rest in
      let (t1, l2) = lookahead l1 in
      match t1 with
      | RParen -> (Concatenation (a1, a), l2)
      | _ -> raise (IllegalExpression "Unbalanced parentheses"))
  | _ -> 
      let (a2, rest) = parse_T l1 in
      (Concatenation (a1, a2), rest)


and parse_F (l : token list) : (regex * token list) = 
  let (a1, l1) = parse_U l in 
  let (t, rest) = lookahead l1 in 
  match t with
  | Star -> (Closure a1, rest)
  | _ -> (a1, l1)

and parse_U (l : token list) : (regex * token list) = 
  let (t, rest) = lookahead l in
  match t with
  | Alphabet c -> (Char c, rest)
  | LParen -> 
     (let (a, l1) = parse_S rest in
      let (t1, l2) = lookahead l1 in
      match t1 with
      | RParen -> (a, l2)
      | _ -> raise (IllegalExpression "Unbalanced parentheses"))
  | _ -> raise (IllegalExpression "Unknown token")

For a LL grammar the FIRST sets are the tokens that are allowed as first token for a rule. To can construct them iteratively till you reach a fixed point.

  1. a rule starting with a token has that token in its FIRST set
  2. a rule starting with a term has the FIRST set of that term in its FIRST set
  3. a rule T = A | B has the union of FIRST(A) and FIRST(B) as FIRST set

Start with step 1 and then repeat steps 2 and 3 until the FIRST sets reach a fixed point (don't change). Now you have the true FIRST sets for your grammar and can decide every rule using the lookahead.

Note: In your code the parse_T function doesn't match the FIRST(T) set. If you look at for example 'a|b' then is enters parse_T and the 'a' is matched by the parse_F call. The lookahead then is '|' which matches epsilon in your grammar but not in your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM