[英]Maintaining document Heading hierarchy python-docx
I am developing algorithms for extracting sections of a Docx file while maintaining document structure I managed to get headings but How do I go about getting the data between headers and maintain header hierarchy: This is what I have done so far. 我正在开发用于在维护文档结构的同时提取Docx文件各节的算法,但我设法获得标题,但如何在标头之间获取数据并维护标头层次结构:这是我到目前为止所做的。
Sample Code : 样例代码:
from docx import Document
document=Document('headerEX.docx')
paragraphs=document.paragraphs
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
for heading in iter_headings(document.paragraphs):
print (heading.text)
Something like this should give you a start: 这样的事情应该给您一个开始:
sections = []
section_heading = None
section_paragraphs = []
for paragraph in paragraph:
if paragraph.style.name.startswith('Heading'):
section = {
'heading': section_heading,
'paragraphs': section_paragraphs
}
sections.append(section)
section_heading = paragraph.text
section_paragraphs = []
continue
section_paragraphs.append(paragraph)
for section in sections:
print(section['heading'])
for paragraph in section['paragraphs']:
print(paragraph.text)
As written, this may give you an empty section extract as the first one, and will not capture the last section. 如所写,这可能会为您提供一个空白部分作为第一个部分的摘录,而不会捕获最后一个部分。 I leave those details to you as an exercise to strengthen your coding skills :)
我将这些细节留给您作为练习,以增强您的编码技巧:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.