简体   繁体   English

解析Docx文件内容标题

[英]Parse Docx file content w.r.t. headings

I want to parse the structure of a docx file and its content using python-docx. 我想使用python-docx解析docx文件的结构及其内容。 The file ist structured using 'Heading 1' to 'Heading 6'. 该文件由“标题1”到“标题6”构成。 Under any heading content could be in form of an table element. 在任何标题下,内容都可以采用表格元素的形式。

I understand how to extract the headings and the tables independent of each other , using python-docx: 我了解如何使用python-docx提取彼此独立的标题和表格:

    doc = Document("file.docx")
    for paragraph in doc.paragraphs:
        if paragraph.style == doc.styles['Heading 1']:
            indent = 1
            result.append('- %s' % paragraph.text.strip())
        elif paragraph.style == doc.styles['Heading 2']:
            indent = 2
            result.append('  ' * indent + '- %s:' % paragraph.text.strip())
        elif paragraph.style == doc.styles['Heading 3']:
            indent = 3
            result.append('  ' * indent + '- %s:' % paragraph.text.strip())
        [...]
        else:
            [...]

    for table in doc.tables:
        if _is_content(table.row_cells(0)[0].text):
            result.add_table(table)

My problem is preserving the structure. 我的问题是保留结构。 How does I find out under with heading a table is in the source document? 我如何在源文件的标题下找到?

You can extract the structured information from docx file using the xml. 您可以使用xml从docx文件中提取结构化信息。 Try this: 尝试这个:

doc = Document("file.docx")
headings = [] #extract only headings from your code
tables = [] #extract tables from your code
tags = []
all_text = []
schema = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
for elem in doc.element.getiterator():
    if elem.tag == schema + 'body':
        for i, child in enumerate(elem.getchildren()):
            if child.tag != schema + 'tbl':
                 node_text = child.text
                 if node_text:
                     if node_text in headings:
                         tags.append('heading')
                     else:
                         tags.append('text')
                     all_text.append(node_text)
             else:
                 tags.append('table')
        break

After above code you will have the list of tags which will show the structure of document heading,text and table then you can map the respective data from the lists. 在上面的代码之后,您将获得标签列表,其中将显示文档标题,文本和表格的结构,然后您可以映射列表中的相应数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM