[英]Represent regular expression as context free grammar
我正在為一個簡單的正則表達式引擎編寫解析器。
引擎支持a .. z
|
*
以及串聯和括號
這是我制作的CFG:
exp = concat factor1
factor1 = "|" exp | e
concat = term factor2
factor2 = concat | e
term = element factor3
factor3 = * | e
element = (exp) | a .. z
等於
S = T X
X = "|" S | E
T = F Y
Y = T | E
F = U Z
Z = *| E
U = (S) | a .. z
對於交替和關閉,我可以輕松地處理它們,方法是向前看,然后根據令牌選擇產品。 但是,由於它是隱式的,因此無法通過向前看來處理串聯。
我想知道如何處理串聯或語法有問題嗎?
這是我的OCaml代碼進行解析:
type regex =
| Closure of regex
| Char of char
| Concatenation of regex * regex
| Alternation of regex * regex
(*| Epsilon*)
exception IllegalExpression of string
type token =
| End
| Alphabet of char
| Star
| LParen
| RParen
| Pipe
let rec parse_S (l : token list) : (regex * token list) =
let (a1, l1) = parse_T l in
let (t, rest) = lookahead l1 in
match t with
| Pipe ->
let (a2, l2) = parse_S rest in
(Alternation (a1, a2), l2)
| _ -> (a1, l1)
and parse_T (l : token list) : (regex * token list) =
let (a1, l1) = parse_F l in
let (t, rest) = lookahead l1 in
match t with
| Alphabet c -> (Concatenation (a1, Char c), rest)
| LParen ->
(let (a, l1) = parse_S rest in
let (t1, l2) = lookahead l1 in
match t1 with
| RParen -> (Concatenation (a1, a), l2)
| _ -> raise (IllegalExpression "Unbalanced parentheses"))
| _ ->
let (a2, rest) = parse_T l1 in
(Concatenation (a1, a2), rest)
and parse_F (l : token list) : (regex * token list) =
let (a1, l1) = parse_U l in
let (t, rest) = lookahead l1 in
match t with
| Star -> (Closure a1, rest)
| _ -> (a1, l1)
and parse_U (l : token list) : (regex * token list) =
let (t, rest) = lookahead l in
match t with
| Alphabet c -> (Char c, rest)
| LParen ->
(let (a, l1) = parse_S rest in
let (t1, l2) = lookahead l1 in
match t1 with
| RParen -> (a, l2)
| _ -> raise (IllegalExpression "Unbalanced parentheses"))
| _ -> raise (IllegalExpression "Unknown token")
對於LL語法,FIRST集是允許作為規則的第一個標記的標記。 可以迭代地構造它們,直到達到固定點為止。
從第1步開始,然后重復第2步和第3步,直到FIRST設置達到固定點(不要更改)。 現在,您已經有了語法的真正第一集,並且可以使用前瞻性來確定每個規則。
注意:在您的代碼中,parse_T函數與FIRST(T)集不匹配。 例如,如果您查看“ a | b”,則輸入parse_T,而“ a”與parse_F調用匹配。 然后,前瞻為“ |” 在您的語法中匹配epsilon,但在您的代碼中不匹配。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.