简体   繁体   English

如何使用docx中的python-docx识别分页符

[英]How to identify page breaks using python-docx from docx

I have several .docx files that contain a number of similar blocks of text: docx files that contain 300+ press releases that are 1-2 pages each, that need to be separated into individual text files. 我有几个.docx文件,其中包含许多类似的文本块:docx文件包含300多个新闻稿,每个新闻稿各1-2页,需要分成单独的文本文件。 The only consistent way to tell differences between articles is that there is always and only a page break between 2 articles. 判断文章差异的唯一一致方法是在两篇文章之间始终只有一个分页符。

However, I don't know how to find page breaks when converting the encompassing Word documents to text, and the page break information is lost after the conversion using my current script 但是,我不知道如何在将包含的Word文档转换为文本时找到分页符,并且使用我当前的脚本在转换后丢失分页信息

I want to know how to preserve HARD page breaks when converting a .docx file to .txt. 我想知道在将.docx文件转换为.txt时如何保留HARD分页符。 It doesn't matter to me what they look like in the text file, as long as they're uniquely identifiable when scanning the text file later 对我来说,在文本文件中它们的外观并不重要,只要它们在以后扫描文本文件时是唯一可识别的。

Here is the script I am using to convert the docx files to txt: 这是我用来将docx文件转换为txt的脚本:

def docx2txt(file_path):
    document = opendocx(file_path)
    text_file = open("%s.txt" % file_path[:len(file_path)-5], "w")
    paratextlist = getdocumenttext(document)
    newparatextlist = []
    for paratext in paratextlist:
        newparatextlist.append(paratext.encode("utf-8"))
    text_file.write('\n\n'.join(newparatextlist))
    text_file.close()

A hard page break will appear as a <w:br> element within a run element ( <w:r> ), something like this: 硬分页符将在run元素( <w:r> )中显示为<w:br>元素,如下所示:

<w:p>
  <w:r>
    <w:t>some text</w:t>
    <w:br w:type="page"/>
  </w:r>
</w:p>

So one approach would be to replace all those occurrences with a distinctive string of text, like maybe "{{foobar}}". 因此,一种方法是用一个独特的文本字符串替换所有这些事件,例如“{{foobar}}”。

An implementation of that would be something like this: 这样的实现将是这样的:

from lxml import etree
from docx import nsprefixes

page_br_elements = document.xpath(
    "//w:p/w:r/w:br[@w:type='page']", namespaces={'w': nsprefixes['w']}
)
for br in page_br_elements:
    t = etree.Element('w:t', nsmap={'w': nsprefixes['w']})
    t.text = '{{foobar}}'
    br.addprevious(t)
    parent = br.getparent()
    parent.remove(br)

I don't have time to test this, so you might run into some missing imports or whatever, but everything you need should already be in the docx module. 我没有时间对此进行测试,因此您可能会遇到一些丢失的导入或其他任何内容,但您需要的所有内容应该已经在docx模块中。 The rest is lxml method calls on _Element. 其余的是_Element上的lxml方法调用。

Let me know how you go and I can tweak this if needed. 让我知道你如何去,我可以根据需要调整它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM