简体   繁体   English

如何在python中解析一个markdown文件为json?

[英]How to parse a markdown file to json in python?

I have many markdown files with titles, subheadings, sub-subheadings etc.我有许多 markdown 文件,其中包含标题、副标题、子标题等。

I'm interested in parsing them into a JSON that'll separate for each heading the text and "subheadings" in it.我有兴趣将它们解析为 JSON,它将为每个标题分隔文本和其中的“副标题”。

For example, I've got the following markdown file, I want it to be parsed into something of the form:例如,我有以下 markdown 文件,我希望将其解析为以下形式:

outer1
outer2

# title 1
text1.1

## title 1.1
text1.1.1

# title 2
text 2.1

to:到:

{
  "text": [
    "outer1",
    "outer2"
  ],
  "inner": [
    {
      "section": [
        {
          "title": "title 1",
          "inner": [
            {
              "text": [
                "text1.1"
              ],
              "inner": [
                {
                  "section": [
                    {
                      "title": "title 1.1",
                      "inner": [
                        {
                          "text": [
                            "text1.1.1"
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        {
          "title": "title 2",
          "inner": [
            {
              "text": [
                "text2.1"
              ]
            }
          ]
        }
      ]
    }
  ]
}

To further illustrate the need - notice how the inner heading is nested inside the outer one, whereas the 2nd outer heading is not.为了进一步说明需求——注意内部标题是如何嵌套在外部标题中的,而第二个外部标题不是。

I tried using pyparser to solve this but it seems to me that it's not able to achieve this because to get section "title 2" to be on the same level as "title 1" I need some sort of "counting logic" to check that the number or "#" in the new header is less than or equal which is something I can't seem to do.我尝试使用pyparser来解决这个问题,但在我看来它无法实现这一点,因为要让“标题 2”部分与“标题 1”处于同一级别,我需要某种“计数逻辑”来检查新 header 中的数字或“#”小于或等于我似乎无法做到的事情。

Is this an issue with the expressibility of pyparser ?这是pyparser的可表达性问题吗? Is there another kind of parser that could achieve this?是否有另一种解析器可以实现这一点?

I could implement this in pure python but I wanted to do something better.我可以在纯 python 中实现它,但我想做得更好。


Here is my current pyparsing implementation which doesn't work as explained above:这是我当前的pyparsing实现,它不像上面解释的那样工作:

section = pp.Forward()("section")
inner_block = pp.Forward()("inner")

start_section = pp.OneOrMore(pp.Word("#"))
title_section = line
title = start_section.suppress() + title_section('title')

line = pp.Combine(
pp.OneOrMore(pp.Word(pp.unicode.Latin1.printables), stop_on=pp.LineEnd()),
join_string=' ', adjacent=False)
text = \~title + pp.OneOrMore(line, stop_on=(pp.LineEnd() + pp.FollowedBy("#")))

inner_block \<\< pp.Group(section | (text('text') + pp.Optional(section.set_parse_action(foo))))

section \<\< pp.Group(title + pp.Optional(inner_block))

markdown = pp.OneOrMore(inner_block)


test = """\
out1
out2

# title 1
text1.1

# title 2
text2.1

"""

res = markdown.parse_string(test, parse_all=True).as_dict()
test_eq(res, dict(
    inner=[
        dict(
            text = ["out1", "out2"],
            section=[
                dict(title="title 1", inner=[
                    dict(
                        text=["text1.1"]
                    ),
                ]),
                dict(title="title 2", inner=[
                    dict(
                        text=["text2.1"]
                    ),
                ]),
            ]
        )
    ]
))

I took a slightly different approach to this problem, using scan_string instead of parse_string , and doing more of the data structure management and storage in a scan_string loop instead of in the parser itself with parse actions.我对这个问题采取了稍微不同的方法,使用scan_string而不是parse_string ,并在scan_string循环中进行更多的数据结构管理和存储,而不是在解析器本身中进行解析操作。

scan_string scans the input and for each match found, returns the matched tokens as a ParseResults , and the start and end locations of the match in the source string. scan_string扫描输入并为找到的每个匹配项返回匹配的标记作为ParseResults ,以及源字符串中匹配项的开始和结束位置。

Starting with an import, I define an expression for a title line:从导入开始,我为标题行定义了一个表达式:

import pyparsing as pp

# define a pyparsing expression that will match a line with leading '#'s
title = pp.AtLineStart(pp.Word("#")) + pp.rest_of_line

To get ready to gather data by title, I define a title_stack list, and a last_end int to keep track of the end of the last title found (so we can slice out the contents of the last title that was parsed).为了准备好按标题收集数据,我定义了一个title_stack列表和一个last_end int 来跟踪找到的最后一个标题的结尾(这样我们就可以切出被解析的最后一个标题的内容)。 I initialize this stack with a fake entry representing the start of the file:我用一个代表文件开头的假条目初始化这个堆栈:

# initialize title_stack with level-0 title at the start of the file
title_stack.append([0, '<start of file>'])

Here is the scan loop using scan_string:这是使用 scan_string 的扫描循环:

for t, start, end in title.scan_string(sample):
    # save content since last title in the last item in title_stack
    title_stack[-1].append(sample[last_end:start].lstrip("\n"))

    # add a new entry to title_stack
    marker, title_content = t
    level = len(marker)
    title_stack.append([level, title_content.lstrip()])

    # update last_end to the end of the current match
    last_end = end

# add trailing text to the final parsed title
title_stack[-1].append(sample[last_end:])

At this point, title_stack contains a list of 3-element lists, the title level, the title text, and the body text for that title.此时,title_stack 包含 3 元素列表的列表、标题级别、标题文本和该标题的正文文本。 Here is the output for your sample markdown:这是您的样本 markdown 的 output:

[[0, '<start of file>', 'outer1\nouter2\n\n'],
 [1, 'title 1', 'text1.1\n\n'],
 [2, 'title 1.1', 'text1.1.1\n\n'],
 [3, 'title 1.1.1', 'text 1.1.1\n\n'],
 [1, 'title 2', 'text 2.1']]

From here, you should be able to walk this list and convert it into your desired tree structure.从这里,您应该能够遍历此列表并将其转换为所需的树结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM