简体   繁体   English

python-如何在docx文件中识别分页符,并在每个页面中创建文本列表

[英]python - how to identify page breaks within a docx file, and create list of text within each page

I have the following code to split each paragraph of a docx file and append to a list, but I need to identify the page breaks within the xml tree structure and create a list of text for each page. 我有以下代码来拆分docx文件的每个段落并追加到列表中,但是我需要确定xml树结构内的分页符并为每个页面创建一个文本列表。 Happy to provide the exact namespaces if it'd be helpful: 如果有帮助,很乐意提供确切的名称空间:

xml_content = document.read('word/document.xml')
tree = XML(xml_content)
aggText = []
#tree.getiterator method looks at previously defined word namespaces
for paragraph in tree.getiterator(PARA):
     texts = [node.text
             for node in paragraph.getiterator(TEXT)
             if node.text]
     if texts:
        aggText.append(''.join(texts))

I'm imagining that the updated loop will looking something like the below, but am unsure about locating the page break within the xml tree structure: 我正在想象更新后的循环将类似于以下内容,但是不确定如何在xml树结构中找到分页符:

aggText = []
for paragraph in tree.getiterator(PARA):
     texts = [node.text
             for node in paragraph.getiterator(TEXT)
             if node.text]
     #page breaks in xml read 'w:lastRenderedPageBreak'
     #below doesn't work, need a way to search raw xml for the page break identifier
     if texts.count(lastRenderedPageBreak) > 0:
        pages = aggText.append(''.join(texts))
        texts = []

Any ideas would be greatly appreciated! 任何想法将不胜感激!

Created a Word doc in MS Word 2011 on a Mac. 在Mac上的MS Word 2011中创建了Word文档。

Word doc saved as Word XML doc is 4 pages with the following content per page: 保存为Word XML文档的Word文档为4页,每页包含以下内容:

  1. Page 1 第1页
  2. Page 2 第2页
  3. [empty on purpose] [故意空]
  4. Page 4 第4页

The xml that is relevant is as follows: 相关的xml如下:

<w:t>Page1</w:t></w:r></w:p><w:p w14:paraId="7DC7FC1F" w14:textId="77777777" w:rsidR="00147F82" w:rsidRDefault="00147F82"><w:r><w:br w:type="page"/></w:r></w:p><w:p w14:paraId="7C202865" w14:textId="77777777" w:rsidR="00E3126A" w:rsidRDefault="00147F82"><w:r><w:lastRenderedPageBreak/><w:t>Page2</w:t></w:r></w:p><w:p w14:paraId="78BAA3B3" w14:textId="77777777" w:rsidR="00E3126A" w:rsidRDefault="00E3126A"><w:r><w:br w:type="page"/></w:r></w:p><w:p w14:paraId="2B26F15B" w14:textId="77777777" w:rsidR="00E3126A" w:rsidRDefault="00E3126A"><w:r><w:br w:type="page"/></w:r></w:p><w:p w14:paraId="1005F61F" w14:textId="77777777" w:rsidR="00C66DE3" w:rsidRDefault="00E3126A"><w:r><w:t>Page4</w:t>

Between each page is a <w:br w:type="page"/></w:r> tag. 在每个页面之间是一个<w:br w:type="page"/></w:r>标记。

The solution is as follows, I also updated the iterator function as getiterator() is now deprecated. 解决方案如下,由于不赞成使用getiterator(),因此我还更新了迭代器函数。

NAMESPACE = 
'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
pages = []
PARA = NAMESPACE + 'p'
TEXT = NAMESPACE + 't'
PAGE = NAMESPACE + 'lastRenderedPageBreak'
aggText = ''
for paragraph in tree.iter(PARA):
     aggText += ''.join([node.text
             for node in paragraph.iter(TEXT)
             if node.text])
     if aggText and [node for node in paragraph.iter(PAGE):
        pages.append(aggText)
        aggText = ''
if aggText != '':
     pages.append(aggText)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用docx中的python-docx识别分页符 - How to identify page breaks using python-docx from docx 如何使用python3 docx使用分页符拆分从docx文件读取的文本 - How to split text read from a docx file with Page breaks using python3 docx Python-docx:确定段落中的分页符 - Python-docx: identify a page break in paragraph 从python中的超链接中的模板创建多文件docx - create multi file docx from a template within hyperlink in python 如何使用python selenium单击没有唯一标识符的html页面列表中的文本? - How to click on text located within the list of an html page which has no unique identifier using python selenium? 使用 python 识别存储在文本文件中的多边形内的文本 - Using python identify text within a polygon stored in a text file 如何在for循环中使用docx在Python中使文本变为粗体 - How to make the text bold in Python using docx within a for loop 如何识别docx文件中的文本位置? - How to identify text position in docx file? 有没有办法将文本文件导入python中的列表,并使每个字符在该列表中成为一个单独的项目? - Is there a way to import a text file into a list in python, and have each character a separate item within that list? 如果页面中有不同的变体,如何在页面中查找文本? - How to find text within page if it is in different variation?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM