解析没有结束标记的元素/字段，非贪婪正则表达式的问题，自定义词法分析器的使用

Question

我希望能够解析 Textile 标记语言（ https://textile-lang.com/ ）中的文件，以便将其转换为 LaTeX。 我拥有的文件有点像 Textile 的扩展，因为它添加了字段和脚注。 下面给出了一个示例文件。

test.textile

#[contents]#
p. This is a paragraph.

With some bullet points:

* Bullet 1
* Bullet 2
* Bullet 3

And a code block:

bc.. # Program to display the Fibonacci sequence up to n-th term

"""\
* Program to display the Fibonacci sequence up to n-th term
"""

search_string = r"<5>"

nterms = int(input("How many terms? "))

# first two terms
n1, n2 = 0, 1
count = 0

p. And after the block of code another paragraph, with a footnote<1>.

bc. fn1. This is the footnote contents

p. And here is another paragraph

#[second_field]#
Some more contents

要解析文件，我有以下解析器。

解析器.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)

以及以下语法。 这个语法还没有解析项目符号和脚注，因为我已经遇到了另一个问题。

文法云雀

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR

FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s

当我运行解析器时，我得到以下 output。

output

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('FIELD_NAME', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])

文件的整个 rest 被解析为段落，这是正确的，因为/.+/s可以匹配任何内容。 因此，我将 STR 的定义更改为/.+?/s以使其不贪婪，但现在 output 如下（打印精美）：

output

elements
  element
    field
      #[
      field_name        contents
      ]#

    content
      paragraph
        p.
        T
      paragraph h
      paragraph i
      paragraph s
      paragraph
      paragraph i
      paragraph s
      paragraph
      paragraph a
      paragraph
--snip--
      paragraph

      paragraph #
      paragraph [
      paragraph s
      paragraph e
      paragraph c
      paragraph o
      paragraph n
      paragraph d
      paragraph _
      paragraph f
      paragraph i
      paragraph e
      paragraph l
      paragraph d
      paragraph ]
      paragraph #

它将每个字符解析为一个段落。 它仍然将整个文件解析为段落元素。

我对这个问题的第一个解决方案是创建一个词法分析器，它为 FIELD_START、FIELD_END、PARAGRAPH_START、CODE_BLOCK_START 和脚注相关的标记创建标记。

我的词法分析器如下所示：

from lark.lexer import Lexer, Token
import re

class MyLexer(Lexer):
    def __init__(self, *args, **kwargs):
        pass

    def lex(self, data):
        tokens = {
            "FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
            "FIELD_END": r"\]#[\r\n]",
            "FOOTNOTE_ANCHOR": r"<\d>",
            "FOOTNOTE_START": r"bc. fn\d. ",
            "PARAGRAPH_START": r"p\. ",
            "CODE_BLOCK_START": r"bc\.\.? ",
        }
        regex = '|'.join([f"({r})" for r in tokens.values()])
        for x in re.split(regex, data):
            if not x:
                continue
            for token_type, token_regex in tokens.items():
                if re.match(token_regex, x):
                    yield Token(token_type, x)
                    break
            else:
                yield Token("STR", x)

parser = Lark(grammar, lexer=MyLexer, start='elements')

它根据给定的标记创建一个正则表达式，然后通过正则表达式拆分整个字符串并将所有内容作为标记返回，无论是定义的标记还是“STR”。 新语法如下所示：

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR 
    | paragraph 
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR 
    | FOOTNOTE_ANCHOR
    | footnote
footnote: FOOTNOTE_START STR

%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR

解析器的output如下：

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]), 
            Tree(Token('RULE', 'code_block'), [
                Token('CODE_BLOCK_START', 'bc.. '), 
                Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('FOOTNOTE_ANCHOR', '<5>'), 
                Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And after the block of code another paragraph, with a footnote'), 
                Token('FOOTNOTE_ANCHOR', '<1>'), 
                Token('STR', '.\n\n'), 
                Tree(Token('RULE', 'footnote'), [
                    Token('FOOTNOTE_START', 'bc. fn1. '), 
                    Token('STR', 'This is the footnote contents\n\n')])]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And here is another paragraph')])])]), 
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '\n\n#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'second_field')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Token('STR', 'Some more contents\n')])])])

这会正确解析不同的字段和脚注，但是，代码块会被检测到的 FOOTNOTE_ANCHOR 中断。 因为 Lexer 不知道上下文，它也会尝试解析代码中的脚注锚点。 尝试解析项目符号时会出现同样的问题。

问题的最佳解决方案是什么？ 我真的需要词法分析器吗？ 我的词法分析器是否正确实施？ （我可以找到很少的关于如何使用自定义词法分析器的例子）。 我可以只对一些令牌进行 lex 并将 rest 留给“父”词法分析器吗？

Answer 1

基于使用云雀语法识别多行部分，我能够找到解决方案。

重要的部分是不要使用/.+/s匹配多行，因为这样解析器将没有机会匹配其他标记。 最好逐行匹配，以便解析器有机会为每一行匹配一个新规则。 我还将解析器切换到“lalr”，它不适用于标准解析器。

解析器.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements', parser="lalr")
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree.pretty())

文法云雀

elements: element+
?element: field content
?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START (LINE NEWLINE)+
paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
bullets: (BULLET paragraph_line)+
footnote: FOOTNOTE_START LINE NEWLINE
paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE

FIELD_START: "#["
FIELD_NAME: /[^\]]+/
FIELD_END: "]#"
FOOTNOTE_ANCHOR: /<\d>/
FOOTNOTE_START: /bc\. fn\d\. /
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
LINE.-1: /.+/
BULLET.-2: "*"
PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/

%import common.NEWLINE

output：

elements
  element
    field
      #[
      field_name        contents
      ]#


    content
      paragraph
        p.
        paragraph_line
          This is a paragraph.



        paragraph_line
          With some bullet points:



        bullets
          *
          paragraph_line
             Bullet 1


          *
          paragraph_line
             Bullet 2


          *
          paragraph_line
             Bullet 3



        paragraph_line
          And a code block:



      code_block
        bc..
        # Program to display the Fibonacci sequence up to n-th term



        """\


        * Program to display the Fibonacci sequence up to n-th term


        """



        search_string = r"<5>"



        nterms = int(input("How many terms? "))



        # first two terms


        n1, n2 = 0, 1


        count = 0



      paragraph
        p.
        paragraph_line
          And after the block of code another paragraph, with a footnote
          <1>
          .



        footnote
          bc. fn1.
          This is the footnote contents



      paragraph
        p.
        paragraph_line
          And here is another paragraph



  element
    field
      #[
      field_name        second_field
      ]#


    content
      paragraph
        paragraph_line
          Some more contents

请注意，解析器也正确地传递了项目符号和脚注。 为了解析一行中的脚注锚点，我制作了一个特殊的PARAGRAPH_LINE ，它在遇到的第一个脚注或行尾停止。 另请注意，行优先于项目符号，因此项目符号不会在代码块中匹配（因为它查找正常行），仅在段落中。

解析没有结束标记的元素/字段，非贪婪正则表达式的问题，自定义词法分析器的使用

问题描述

1 个解决方案

解决方案1
0 2022-01-27 09:50:11

解析没有结束标记的元素/字段，非贪婪正则表达式的问题，自定义词法分析器的使用

问题描述

1 个解决方案

解决方案1 0 2022-01-27 09:50:11

解决方案1
0 2022-01-27 09:50:11