简体   繁体   English

如何使用python3 docx使用分页符拆分从docx文件读取的文本

[英]How to split text read from a docx file with Page breaks using python3 docx

I have a word document(.docx file) consisting of 10 pages with 1 paragraph on each page where each page/paragraph is seperated by a pagebreak. 我有一个word文档(.docx文件),该文档由10页组成,每页上有1个段落,其中每个页面/段落都由分页符分隔。 I want to read the text in the docx file and split it with the page breaks. 我想阅读docx文件中的文本,并使用分页符将其拆分。

I am able to read the text with python-docx library but I am not sure how to split it with page break. 我可以使用python-docx库读取文本,但不确定如何使用分页符拆分文本。 I can see a similar question but it's solution was proposed using the old python-docx library. 我可以看到一个类似的问题,但是它的解决方案是使用旧的python-docx库提出的。

Here's the code for reading text from docx file : 这是从docx文件读取文本的代码:

from docx import Document

paratextlist = Document("ex.docx")
docText = '\n'.join([
    paragraph.text for paragraph in paratextlist.paragraphs
])

Can use regex to search for form fill character \\f I think. 我认为可以使用正则表达式来搜索表格填充字符\\ f。

import re

pattern = re.compile(r"\f")
matches = pattern.finditer(text)
for match in matches:
    print(f"Page break occurs at character {match.span()[0]}")

If 'text' is your document string, this would return the location of each pagebreak in the string. 如果“ text”是您的文档字符串,则将返回字符串中每个分页符的位置。 You could then break it up using those indices. 然后,您可以使用这些索引对其进行分解。

This could probably be adapted using the Document object, but I'm not 100% on how. 可以使用Document对象对此进行调整,但是我不是100%知道如何使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM