簡體   English   中英

解析沒有結束標記的元素/字段,非貪婪正則表達式的問題,自定義詞法分析器的使用

[英]Parsing elements/fields without end marker, problems with non-greedy regex, usage of a custom lexer

我希望能夠解析 Textile 標記語言( https://textile-lang.com/ )中的文件,以便將其轉換為 LaTeX。 我擁有的文件有點像 Textile 的擴展,因為它添加了字段和腳注。 下面給出了一個示例文件。

test.textile

#[contents]#
p. This is a paragraph.

With some bullet points:

* Bullet 1
* Bullet 2
* Bullet 3

And a code block:

bc.. # Program to display the Fibonacci sequence up to n-th term

"""\
* Program to display the Fibonacci sequence up to n-th term
"""

search_string = r"<5>"

nterms = int(input("How many terms? "))

# first two terms
n1, n2 = 0, 1
count = 0

p. And after the block of code another paragraph, with a footnote<1>.

bc. fn1. This is the footnote contents

p. And here is another paragraph

#[second_field]#
Some more contents

要解析文件,我有以下解析器。

解析器.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)

以及以下語法。 這個語法還沒有解析項目符號和腳注,因為我已經遇到了另一個問題。

文法雲雀

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR

FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s

當我運行解析器時,我得到以下 output。

output

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('FIELD_NAME', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])

文件的整個 rest 被解析為段落,這是正確的,因為/.+/s可以匹配任何內容。 因此,我將 STR 的定義更改為/.+?/s以使其不貪婪,但現在 output 如下(打印精美):

output

elements
  element
    field
      #[
      field_name        contents
      ]#

    content
      paragraph
        p.
        T
      paragraph h
      paragraph i
      paragraph s
      paragraph
      paragraph i
      paragraph s
      paragraph
      paragraph a
      paragraph
--snip--
      paragraph

      paragraph #
      paragraph [
      paragraph s
      paragraph e
      paragraph c
      paragraph o
      paragraph n
      paragraph d
      paragraph _
      paragraph f
      paragraph i
      paragraph e
      paragraph l
      paragraph d
      paragraph ]
      paragraph #

它將每個字符解析為一個段落。 它仍然將整個文件解析為段落元素。

我對這個問題的第一個解決方案是創建一個詞法分析器,它為 FIELD_START、FIELD_END、PARAGRAPH_START、CODE_BLOCK_START 和腳注相關的標記創建標記。

我的詞法分析器如下所示:

from lark.lexer import Lexer, Token
import re

class MyLexer(Lexer):
    def __init__(self, *args, **kwargs):
        pass

    def lex(self, data):
        tokens = {
            "FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
            "FIELD_END": r"\]#[\r\n]",
            "FOOTNOTE_ANCHOR": r"<\d>",
            "FOOTNOTE_START": r"bc. fn\d. ",
            "PARAGRAPH_START": r"p\. ",
            "CODE_BLOCK_START": r"bc\.\.? ",
        }
        regex = '|'.join([f"({r})" for r in tokens.values()])
        for x in re.split(regex, data):
            if not x:
                continue
            for token_type, token_regex in tokens.items():
                if re.match(token_regex, x):
                    yield Token(token_type, x)
                    break
            else:
                yield Token("STR", x)

parser = Lark(grammar, lexer=MyLexer, start='elements')

它根據給定的標記創建一個正則表達式,然后通過正則表達式拆分整個字符串並將所有內容作為標記返回,無論是定義的標記還是“STR”。 新語法如下所示:

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR 
    | paragraph 
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR 
    | FOOTNOTE_ANCHOR
    | footnote
footnote: FOOTNOTE_START STR

%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR

解析器的output如下:

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]), 
            Tree(Token('RULE', 'code_block'), [
                Token('CODE_BLOCK_START', 'bc.. '), 
                Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('FOOTNOTE_ANCHOR', '<5>'), 
                Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And after the block of code another paragraph, with a footnote'), 
                Token('FOOTNOTE_ANCHOR', '<1>'), 
                Token('STR', '.\n\n'), 
                Tree(Token('RULE', 'footnote'), [
                    Token('FOOTNOTE_START', 'bc. fn1. '), 
                    Token('STR', 'This is the footnote contents\n\n')])]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And here is another paragraph')])])]), 
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '\n\n#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'second_field')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Token('STR', 'Some more contents\n')])])])

這會正確解析不同的字段和腳注,但是,代碼塊會被檢測到的 FOOTNOTE_ANCHOR 中斷。 因為 Lexer 不知道上下文,它也會嘗試解析代碼中的腳注錨點。 嘗試解析項目符號時會出現同樣的問題。

問題的最佳解決方案是什么? 我真的需要詞法分析器嗎? 我的詞法分析器是否正確實施? (我可以找到很少的關於如何使用自定義詞法分析器的例子)。 我可以只對一些令牌進行 lex 並將 rest 留給“父”詞法分析器嗎?

基於使用雲雀語法識別多行部分,我能夠找到解決方案。

重要的部分是不要使用/.+/s匹配多行,因為這樣解析器將沒有機會匹配其他標記。 最好逐行匹配,以便解析器有機會為每一行匹配一個新規則。 我還將解析器切換到“lalr”,它不適用於標准解析器。

解析器.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements', parser="lalr")
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree.pretty())

文法雲雀

elements: element+
?element: field content
?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START (LINE NEWLINE)+
paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
bullets: (BULLET paragraph_line)+
footnote: FOOTNOTE_START LINE NEWLINE
paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE

FIELD_START: "#["
FIELD_NAME: /[^\]]+/
FIELD_END: "]#"
FOOTNOTE_ANCHOR: /<\d>/
FOOTNOTE_START: /bc\. fn\d\. /
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
LINE.-1: /.+/
BULLET.-2: "*"
PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/

%import common.NEWLINE

output:

elements
  element
    field
      #[
      field_name        contents
      ]#


    content
      paragraph
        p.
        paragraph_line
          This is a paragraph.



        paragraph_line
          With some bullet points:



        bullets
          *
          paragraph_line
             Bullet 1


          *
          paragraph_line
             Bullet 2


          *
          paragraph_line
             Bullet 3



        paragraph_line
          And a code block:



      code_block
        bc..
        # Program to display the Fibonacci sequence up to n-th term



        """\


        * Program to display the Fibonacci sequence up to n-th term


        """



        search_string = r"<5>"



        nterms = int(input("How many terms? "))



        # first two terms


        n1, n2 = 0, 1


        count = 0



      paragraph
        p.
        paragraph_line
          And after the block of code another paragraph, with a footnote
          <1>
          .



        footnote
          bc. fn1.
          This is the footnote contents



      paragraph
        p.
        paragraph_line
          And here is another paragraph



  element
    field
      #[
      field_name        second_field
      ]#


    content
      paragraph
        paragraph_line
          Some more contents

請注意,解析器也正確地傳遞了項目符號和腳注。 為了解析一行中的腳注錨點,我制作了一個特殊的PARAGRAPH_LINE ,它在遇到的第一個腳注或行尾停止。 另請注意,行優先於項目符號,因此項目符號不會在代碼塊中匹配(因為它查找正常行),僅在段落中。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM