解析没有结束标记的元素/字段，非贪婪正则表达式的问题，自定义词法分析器的使用

Question

I want to be able to parse files in the Textile markup lanaguage ( https://textile-lang.com/ ) in order to convert it to LaTeX.我希望能够解析 Textile 标记语言（ https://textile-lang.com/ ）中的文件，以便将其转换为 LaTeX。 The file I have is a bit of an extension of Textile, since it adds fields and footnotes.我拥有的文件有点像 Textile 的扩展，因为它添加了字段和脚注。 An example file is given below.下面给出了一个示例文件。

test.textile test.textile

#[contents]#
p. This is a paragraph.

With some bullet points:

* Bullet 1
* Bullet 2
* Bullet 3

And a code block:

bc.. # Program to display the Fibonacci sequence up to n-th term

"""\
* Program to display the Fibonacci sequence up to n-th term
"""

search_string = r"<5>"

nterms = int(input("How many terms? "))

# first two terms
n1, n2 = 0, 1
count = 0

p. And after the block of code another paragraph, with a footnote<1>.

bc. fn1. This is the footnote contents

p. And here is another paragraph

#[second_field]#
Some more contents

To parse the file I have the following parser.要解析文件，我有以下解析器。

parser.py解析器.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements')
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree)

And the following grammar.以及以下语法。 This grammar does not parse bullet points and footnotes yet, because I run into another problem already.这个语法还没有解析项目符号和脚注，因为我已经遇到了另一个问题。

grammar.lark文法云雀

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? STR

FIELD_START: /(\A|[\r\n]{2,})#\[/
FIELD_NAME: /[^\]]+/
FIELD_END: /\]#[\r\n]/
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
STR: /.+/s

When I run the parser, I get the following output.当我运行解析器时，我得到以下 output。

output output

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('FIELD_NAME', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\nbc.. # Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"<5>"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\np. And after the block of code another paragraph, with a footnote<1>.\n\nbc. fn1. This is the footnote contents\n\np. And here is another paragraph\n\n#[second_field]#\nSome more contents\n')])])])])

The whole rest of the file is parsed as the paragraph, which is sort of correct, since /.+/s can match anything.文件的整个 rest 被解析为段落，这是正确的，因为/.+/s可以匹配任何内容。 So, I change the definition of STR to /.+?/s to make it non-greedy, but now the output is as follows (pretty printed):因此，我将 STR 的定义更改为/.+?/s以使其不贪婪，但现在 output 如下（打印精美）：

output output

elements
  element
    field
      #[
      field_name        contents
      ]#

    content
      paragraph
        p.
        T
      paragraph h
      paragraph i
      paragraph s
      paragraph
      paragraph i
      paragraph s
      paragraph
      paragraph a
      paragraph
--snip--
      paragraph

      paragraph #
      paragraph [
      paragraph s
      paragraph e
      paragraph c
      paragraph o
      paragraph n
      paragraph d
      paragraph _
      paragraph f
      paragraph i
      paragraph e
      paragraph l
      paragraph d
      paragraph ]
      paragraph #

It parses each character as a paragraph.它将每个字符解析为一个段落。 And it still parses the whole file as paragraph elements.它仍然将整个文件解析为段落元素。

My first solution to this problem was to create a lexer, which creates tokens for FIELD_START, FIELD_END, PARAGRAPH_START, CODE_BLOCK_START and footnote-related tokens.我对这个问题的第一个解决方案是创建一个词法分析器，它为 FIELD_START、FIELD_END、PARAGRAPH_START、CODE_BLOCK_START 和脚注相关的标记创建标记。

My lexer looks as follows:我的词法分析器如下所示：

from lark.lexer import Lexer, Token
import re

class MyLexer(Lexer):
    def __init__(self, *args, **kwargs):
        pass

    def lex(self, data):
        tokens = {
            "FIELD_START": r"(?:\A|[\r\n]{2,})#\[",
            "FIELD_END": r"\]#[\r\n]",
            "FOOTNOTE_ANCHOR": r"<\d>",
            "FOOTNOTE_START": r"bc. fn\d. ",
            "PARAGRAPH_START": r"p\. ",
            "CODE_BLOCK_START": r"bc\.\.? ",
        }
        regex = '|'.join([f"({r})" for r in tokens.values()])
        for x in re.split(regex, data):
            if not x:
                continue
            for token_type, token_regex in tokens.items():
                if re.match(token_regex, x):
                    yield Token(token_type, x)
                    break
            else:
                yield Token("STR", x)

parser = Lark(grammar, lexer=MyLexer, start='elements')

It creates a regex based on the given tokens, then splits the whole string by the regex and returns everything as a token, either a defined token, or "STR".它根据给定的标记创建一个正则表达式，然后通过正则表达式拆分整个字符串并将所有内容作为标记返回，无论是定义的标记还是“STR”。 The new grammar looks as follows:新语法如下所示：

elements: element+
?element: field content
?field: FIELD_START field_name FIELD_END
field_name: STR
content: contents*
?contents: STR 
    | paragraph 
    | code_block
code_block: CODE_BLOCK_START STR
paragraph: PARAGRAPH_START? paragraph_contents+
?paragraph_contents: STR 
    | FOOTNOTE_ANCHOR
    | footnote
footnote: FOOTNOTE_START STR

%declare FIELD_START FIELD_END FOOTNOTE_ANCHOR FOOTNOTE_START PARAGRAPH_START CODE_BLOCK_START STR

The output of the parser is as follows:解析器的output如下：

Tree(Token('RULE', 'elements'), [
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'contents')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'This is a paragraph.\n\nWith some bullet points:\n\n* Bullet 1\n* Bullet 2\n* Bullet 3\n\nAnd a code block:\n\n')]), 
            Tree(Token('RULE', 'code_block'), [
                Token('CODE_BLOCK_START', 'bc.. '), 
                Token('STR', '# Program to display the Fibonacci sequence up to n-th term\n\n"""\\\n* Program to display the Fibonacci sequence up to n-th term\n"""\n\nsearch_string = r"')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('FOOTNOTE_ANCHOR', '<5>'), 
                Token('STR', '"\n\nnterms = int(input("How many terms? "))\n\n# first two terms\nn1, n2 = 0, 1\ncount = 0\n\n')]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And after the block of code another paragraph, with a footnote'), 
                Token('FOOTNOTE_ANCHOR', '<1>'), 
                Token('STR', '.\n\n'), 
                Tree(Token('RULE', 'footnote'), [
                    Token('FOOTNOTE_START', 'bc. fn1. '), 
                    Token('STR', 'This is the footnote contents\n\n')])]), 
            Tree(Token('RULE', 'paragraph'), [
                Token('PARAGRAPH_START', 'p. '), 
                Token('STR', 'And here is another paragraph')])])]), 
    Tree(Token('RULE', 'element'), [
        Tree(Token('RULE', 'field'), [
            Token('FIELD_START', '\n\n#['), 
            Tree(Token('RULE', 'field_name'), [
                Token('STR', 'second_field')]), 
            Token('FIELD_END', ']#\n')]), 
        Tree(Token('RULE', 'content'), [
            Token('STR', 'Some more contents\n')])])])

This correctly parses the different fields and footnotes, however, the code block is interrupted by a detected FOOTNOTE_ANCHOR.这会正确解析不同的字段和脚注，但是，代码块会被检测到的 FOOTNOTE_ANCHOR 中断。 Because the Lexer does not know context it tries to parse footnote anchors in code as well.因为 Lexer 不知道上下文，它也会尝试解析代码中的脚注锚点。 The same problem would occur when trying to parse bullet points.尝试解析项目符号时会出现同样的问题。

What is the best solution to the problem?问题的最佳解决方案是什么？ Do I really need a lexer?我真的需要词法分析器吗？ Is my lexer implemented correctly?我的词法分析器是否正确实施？ (I can find very little examples on how to use a custom lexer for text). （我可以找到很少的关于如何使用自定义词法分析器的例子）。 Can I maybe only lex some tokens and leave the rest to a "parent" lexer?我可以只对一些令牌进行 lex 并将 rest 留给“父”词法分析器吗？

Answer 1

Based on recognizing multi-line sections with lark grammar I was able to find a solution.基于使用云雀语法识别多行部分，我能够找到解决方案。

The important part is to not use /.+/s to match multiple lines, as then the parser will not have the opportunity to match other tokens.重要的部分是不要使用/.+/s匹配多行，因为这样解析器将没有机会匹配其他标记。 It is better to match line by line, so that the parser has opportunity to match a new rule for each line.最好逐行匹配，以便解析器有机会为每一行匹配一个新规则。 I also switched the parser to "lalr", it did not work with the standard parser.我还将解析器切换到“lalr”，它不适用于标准解析器。

parser.py解析器.py

from lark import Lark

def read_file(filename):
    with open(filename) as f:
        return f.read()

grammar = read_file('grammar.lark')
parser = Lark(grammar, start='elements', parser="lalr")
textile = read_file('test.textile')
tree = parser.parse(textile)
print(tree.pretty())

grammar.lark文法云雀

elements: element+
?element: field content
?field: NEWLINE* FIELD_START field_name FIELD_END NEWLINE
field_name: FIELD_NAME
content: contents*
?contents: paragraph
    | code_block
code_block: CODE_BLOCK_START (LINE NEWLINE)+
paragraph: PARAGRAPH_START? (paragraph_line | bullets | footnote)+
bullets: (BULLET paragraph_line)+
footnote: FOOTNOTE_START LINE NEWLINE
paragraph_line: (PARAGRAPH_LINE | FOOTNOTE_ANCHOR)+ NEWLINE

FIELD_START: "#["
FIELD_NAME: /[^\]]+/
FIELD_END: "]#"
FOOTNOTE_ANCHOR: /<\d>/
FOOTNOTE_START: /bc\. fn\d\. /
CODE_BLOCK_START: /bc\.\.? /
PARAGRAPH_START: /p\. /
LINE.-1: /.+/
BULLET.-2: "*"
PARAGRAPH_LINE.-3: /.+?(?=(<\d>|\r|\n))/

%import common.NEWLINE

The output: output：

elements
  element
    field
      #[
      field_name        contents
      ]#


    content
      paragraph
        p.
        paragraph_line
          This is a paragraph.



        paragraph_line
          With some bullet points:



        bullets
          *
          paragraph_line
             Bullet 1


          *
          paragraph_line
             Bullet 2


          *
          paragraph_line
             Bullet 3



        paragraph_line
          And a code block:



      code_block
        bc..
        # Program to display the Fibonacci sequence up to n-th term



        """\


        * Program to display the Fibonacci sequence up to n-th term


        """



        search_string = r"<5>"



        nterms = int(input("How many terms? "))



        # first two terms


        n1, n2 = 0, 1


        count = 0



      paragraph
        p.
        paragraph_line
          And after the block of code another paragraph, with a footnote
          <1>
          .



        footnote
          bc. fn1.
          This is the footnote contents



      paragraph
        p.
        paragraph_line
          And here is another paragraph



  element
    field
      #[
      field_name        second_field
      ]#


    content
      paragraph
        paragraph_line
          Some more contents

Note that the parser also correctly pasrses bullets and footnotes.请注意，解析器也正确地传递了项目符号和脚注。 In order to parse footnote anchors within a line, I have made a special PARAGRAPH_LINE , which stops at the first footnote it encounters or at the end of the line.为了解析一行中的脚注锚点，我制作了一个特殊的PARAGRAPH_LINE ，它在遇到的第一个脚注或行尾停止。 Also note that line has precedence over bullets, so bullets will not be matched in a code block, (since it looks for a normal line), only in a paragraph.另请注意，行优先于项目符号，因此项目符号不会在代码块中匹配（因为它查找正常行），仅在段落中。

解析没有结束标记的元素/字段，非贪婪正则表达式的问题，自定义词法分析器的使用

问题描述

1 个解决方案

解决方案1
0 2022-01-27 09:50:11

解析没有结束标记的元素/字段，非贪婪正则表达式的问题，自定义词法分析器的使用

问题描述

1 个解决方案

解决方案1 0 2022-01-27 09:50:11

解决方案1
0 2022-01-27 09:50:11