简体   繁体   English

如何使用python-docx替换Word文档中的文本并保存

[英]How to use python-docx to replace text in a Word document and save

The oodocx module mentioned in the same page refers the user to an /examples folder that does not seem to be there.同一页面中提到的 odocx 模块将用户指向一个似乎不存在的 /examples 文件夹。
I have read the documentation of python-docx 0.7.2, plus everything I could find in Stackoverflow on the subject, so please believe that I have done my “homework”.我已经阅读了 python-docx 0.7.2 的文档,以及我在 Stackoverflow 中可以找到的关于该主题的所有内容,所以请相信我已经完成了我的“作业”。

Python is the only language I know (beginner+, maybe intermediate), so please do not assume any knowledge of C, Unix, xml, etc. Python 是我所知道的唯一语言(初学者+,可能是中级),所以请不要假设您对 C、Unix、xml 等有任何了解。

Task : Open a ms-word 2007+ document with a single line of text in it (to keep things simple) and replace any “key” word in Dictionary that occurs in that line of text with its dictionary value.任务:打开一个包含单行文本的 ms-word 2007+ 文档(为简单起见),并用字典值替换出现在该文本行中的任何“关键”词。 Then close the document keeping everything else the same.然后关闭文档,保持其他所有内容不变。

Line of text (for example) “We shall linger in the chambers of the sea.”一行文字(例如)“我们将在大海的房间里逗留。”

from docx import Document

document = Document('/Users/umityalcin/Desktop/Test.docx')

Dictionary = {‘sea’: “ocean”}

sections = document.sections
for section in sections:
    print(section.start_type)

#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.

document.save('/Users/umityalcin/Desktop/Test.docx')

I am not seeing anything in the documentation that allows me to do this—maybe it is there but I don't get it because everything is not spelled-out at my level.我在文档中没有看到任何允许我这样做的内容 - 也许它在那里,但我不明白,因为所有内容都没有按照我的水平进行说明。

I have followed other suggestions on this site and have tried to use earlier versions of the module ( https://github.com/mikemaccana/python-docx ) that is supposed to have "methods like replace, advReplace" as follows: I open the source-code in the python interpreter, and add the following at the end (this is to avoid clashes with the already installed version 0.7.2):我遵循了本网站上的其他建议,并尝试使用模块的早期版本( https://github.com/mikemaccana/python-docx ),该模块应该具有“替换、advReplace 等方法”,如下所示:我打开python解释器中的源代码,并在最后添加以下内容(这是为了避免与已安装的0.7.2版本冲突):

document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
    if word in Dictionary.keys():
        print "found it", Dictionary[word]
        document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
    wordrelationships, output, imagefiledict=None) 

Running this produces the following error message:运行它会产生以下错误消息:

NameError: name 'coreprops' is not defined NameError:未定义名称“coreprops”

Maybe I am trying to do something that cannot be done—but I would appreciate your help if I am missing something simple.也许我正在尝试做一些无法完成的事情——但如果我遗漏了一些简单的东西,我会很感激你的帮助。

If this matters, I am using the 64 bit version of Enthought's Canopy on OSX 10.9.3如果这很重要,我在 OSX 10.9.3 上使用 64 位版本的 Enthought's Canopy

UPDATE: There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx .更新:有几个段落级函数可以很好地完成这项工作,可以在python-docx的 GitHub 站点上找到。

  1. This one will replace a regex-match with a replacement str .这将用替换 str 替换正则表达式匹配 The replacement string will appear formatted the same as the first character of the matched string.替换字符串的格式与匹配字符串的第一个字符的格式相同。
  2. This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.这将隔离运行,以便可以将某些格式应用于该单词或短语,例如突出显示文本中每个出现的“foobar”,或者使其加粗或以更大的字体显示。

The current version of python-docx does not have a search() function or a replace() function.当前版本的 python-docx 没有search()函数或replace()函数。 These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.这些请求相当频繁,但一般情况的实现非常棘手,它还没有上升到积压的顶部。

Several folks have had success though, getting done what they need, using the facilities already present.有几个人已经取得了成功,使用现有的设施完成了他们需要的工作。 Here's an example.这是一个例子。 It has nothing to do with sections by the way :)顺便说一下,它与部分无关:)

for paragraph in document.paragraphs:
    if 'sea' in paragraph.text:
        print paragraph.text
        paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:要在表格中进行搜索,您需要使用以下内容:

for table in document.tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                if 'sea' in paragraph.text:
                    paragraph.text = paragraph.text.replace("sea", "ocean")

If you pursue this path, you'll probably discover pretty quickly what the complexities are.如果您走这条路,您可能很快就会发现其中的复杂性。 If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.如果您替换一个段落的整个文本,这将删除任何字符级格式,例如粗体或斜体的单词或短语。

By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.顺便说一句,@wnnmaw 的答案中的代码适用于 python-docx 的旧版本,并且根本不适用于 0.3.0 之后的版本。

I needed something to replace regular expressions in docx.我需要一些东西来替换 docx 中的正则表达式。 I took scannys answer.我接受了scannys的回答。 To handle style I've used answer from: Python docx Replace string in paragraph while keeping style added recursive call to handle nested tables.为了处理样式,我使用了以下答案: Python docx 替换段落中的字符串,同时保持样式添加递归调用以处理嵌套表。 and came up with something like this:并想出了这样的事情:

import re
from docx import Document

def docx_replace_regex(doc_obj, regex , replace):

    for p in doc_obj.paragraphs:
        if regex.search(p.text):
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if regex.search(inline[i].text):
                    text = regex.sub(replace, inline[i].text)
                    inline[i].text = text

    for table in doc_obj.tables:
        for row in table.rows:
            for cell in row.cells:
                docx_replace_regex(cell, regex , replace)



regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')

To iterate over dictionary:迭代字典:

for word, replacement in dictionary.items():
    word_re=re.compile(word)
    docx_replace_regex(doc, word_re , replacement)

Note that this solution will replace regex only if whole regex has same style in document.请注意,仅当整个正则表达式在文档中具有相同的样式时,此解决方案才会替换正则表达式。

Also if text is edited after saving same style text might be in separate runs.此外,如果在保存相同样式的文本后编辑文本可能会在单独的运行中。 For example if you open document that has "testabcd" string and you change it to "test1abcd" and save, even dough its the same style there are 3 separate runs "test", "1", and "abcd", in this case replacement of test1 won't work.例如,如果您打开具有“testabcd”字符串的文档并将其更改为“test1abcd”并保存,即使面团的样式相同,也会有 3 个单独的运行“test”、“1”和“abcd”,在这种情况下替换 test1 将不起作用。

This is for tracking changes in the document.这是用于跟踪文档中的更改。 To marge it to one run, in Word you need to go to "Options", "Trust Center" and in "Privacy Options" unthick "Store random numbers to improve combine accuracy" and save the document.要将其调整为一次运行,在 Word 中,您需要转到“选项”、“信任中心”并在“隐私选项”中取消“存储随机数以提高组合准确性”并保存文档。

I got much help from answers from the earlier, but for me, the below code functions as the simple find and replace function in word would do.我从前面的答案中得到了很多帮助,但对我来说,下面的代码功能就像 word 中的简单查找和替换功能一样。 Hope this helps.希望这可以帮助。

#!pip install python-docx
#start from here if python-docx is installed
from docx import Document
#open the document
doc=Document('./test.docx')
Dictionary = {"sea": "ocean", "find_this_text":"new_text"}
for i in Dictionary:
    for p in doc.paragraphs:
        if p.text.find(i)>=0:
            p.text=p.text.replace(i,Dictionary[i])
#save changed document
doc.save('./test.docx')

The above solution has limitations.上述解决方案有局限性。 1) The paragraph containing The "find_this_text" will became plain text without any format, 2) context controls that are in the same paragraph with the "find_this_text" will be deleted, and 3) the "find_this_text" in either context controls or tables will not be changed. 1) 包含“find_this_text”的段落将变成没有任何格式的纯文本,2) 与“find_this_text”在同一段落中的上下文控件将被删除,3) 上下文控件或表格中的“find_this_text”将被删除不被改变。

Sharing a small script I wrote - helps me generating legal .docx contracts with variables while preserving the original style.分享我编写的一个小脚本 - 帮助我生成带有变量的合法.docx合同,同时保留原始样式。

pip install python-docx

Example:例子:

from docx import Document
import os


def main():
    template_file_path = 'employment_agreement_template.docx'
    output_file_path = 'result.docx'

    variables = {
        "${EMPLOEE_NAME}": "Example Name",
        "${EMPLOEE_TITLE}": "Software Engineer",
        "${EMPLOEE_ID}": "302929393",
        "${EMPLOEE_ADDRESS}": "דרך השלום מנחם בגין דוגמא",
        "${EMPLOEE_PHONE}": "+972-5056000000",
        "${EMPLOEE_EMAIL}": "example@example.com",
        "${START_DATE}": "03 Jan, 2021",
        "${SALARY}": "10,000",
        "${SALARY_30}": "3,000",
        "${SALARY_70}": "7,000",
    }

    template_document = Document(template_file_path)

    for variable_key, variable_value in variables.items():
        for paragraph in template_document.paragraphs:
            replace_text_in_paragraph(paragraph, variable_key, variable_value)

        for table in template_document.tables:
            for col in table.columns:
                for cell in col.cells:
                    for paragraph in cell.paragraphs:
                        replace_text_in_paragraph(paragraph, variable_key, variable_value)

    template_document.save(output_file_path)


def replace_text_in_paragraph(paragraph, key, value):
    if key in paragraph.text:
        inline = paragraph.runs
        for item in inline:
            if key in item.text:
                item.text = item.text.replace(key, value)


if __name__ == '__main__':
    main()

在此处输入图片说明

For the table case, I had to modify @scanny's answer to:对于表格案例,我不得不将@scanny 的答案修改为:

for table in doc.tables:
    for col in table.columns:
        for cell in col.cells:
            for p in cell.paragraphs:

to make it work.使其工作。 Indeed, this does not seem to work with the current state of the API:事实上,这似乎不适用于 API 的当前状态:

for table in document.tables:
    for cell in table.cells:

Same problem with the code from here: https://github.com/python-openxml/python-docx/issues/30#issuecomment-38658149这里的代码有同样的问题: https : //github.com/python-openxml/python-docx/issues/30#issuecomment-38658149

Office 开发人员中心有一个条目,其中开发人员已发布(此时获得 MIT 许可)对几种算法的描述,这些算法似乎为此提出了解决方案(尽管是在 C# 中,并且需要移植):” MS 开发中心发帖

The problem with your second attempt is that you haven't defined the parameters that savedocx needs.您第二次尝试的问题在于您尚未定义savedocx需要的参数。 You need to do something like this before you save:保存之前,您需要执行以下操作:

relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []

coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
                       keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"

he changed the API in docx py again...他再次更改了 docx py 中的 API ......

for the sanity of everyone coming here:为了每个来到这里的人的理智:

import datetime
import os
from decimal import Decimal
from typing import NamedTuple

from docx import Document
from docx.document import Document as nDocument


class DocxInvoiceArg(NamedTuple):
  invoice_to: str
  date_from: str
  date_to: str
  project_name: str
  quantity: float
  hourly: int
  currency: str
  bank_details: str


class DocxService():
  tokens = [
    '@INVOICE_TO@',
    '@IDATE_FROM@',
    '@IDATE_TO@',
    '@INVOICE_NR@',
    '@PROJECTNAME@',
    '@QUANTITY@',
    '@HOURLY@',
    '@CURRENCY@',
    '@TOTAL@',
    '@BANK_DETAILS@',
  ]

  def __init__(self, replace_vals: DocxInvoiceArg):
    total = replace_vals.quantity * replace_vals.hourly
    invoice_nr = replace_vals.project_name + datetime.datetime.strptime(replace_vals.date_to, '%Y-%m-%d').strftime('%Y%m%d')
    self.replace_vals = [
      {'search': self.tokens[0], 'replace': replace_vals.invoice_to },
      {'search': self.tokens[1], 'replace': replace_vals.date_from },
      {'search': self.tokens[2], 'replace': replace_vals.date_to },
      {'search': self.tokens[3], 'replace': invoice_nr },
      {'search': self.tokens[4], 'replace': replace_vals.project_name },
      {'search': self.tokens[5], 'replace': replace_vals.quantity },
      {'search': self.tokens[6], 'replace': replace_vals.hourly },
      {'search': self.tokens[7], 'replace': replace_vals.currency },
      {'search': self.tokens[8], 'replace': total },
      {'search': self.tokens[9], 'replace': 'asdfasdfasdfdasf'},
    ]
    self.doc_path_template = os.path.dirname(os.path.realpath(__file__))+'/docs/'
    self.doc_path_output = self.doc_path_template + 'output/'
    self.document: nDocument = Document(self.doc_path_template + 'invoice_placeholder.docx')


  def save(self):
    for p in self.document.paragraphs:
      self._docx_replace_text(p)
    tables = self.document.tables
    self._loop_tables(tables)
    self.document.save(self.doc_path_output + 'testiboi3.docx')

  def _loop_tables(self, tables):
    for table in tables:
      for index, row in enumerate(table.rows):
        for cell in table.row_cells(index):
          if cell.tables:
            self._loop_tables(cell.tables)
          for p in cell.paragraphs:
            self._docx_replace_text(p)

        # for cells in column.
        # for cell in table.columns:

  def _docx_replace_text(self, p):
    print(p.text)
    for el in self.replace_vals:
      if (el['search'] in p.text):
        inline = p.runs
        # Loop added to work with runs (strings with same style)
        for i in range(len(inline)):
          print(inline[i].text)
          if el['search'] in inline[i].text:
            text = inline[i].text.replace(el['search'], str(el['replace']))
            inline[i].text = text
        print(p.text)

Test case:测试用例:

from django.test import SimpleTestCase
from docx.table import Table, _Rows

from toggleapi.services.DocxService import DocxService, DocxInvoiceArg


class TestDocxService(SimpleTestCase):

  def test_document_read(self):
    ds = DocxService(DocxInvoiceArg(invoice_to="""
    WAW test1
    Multi myfriend
    """,date_from="2019-08-01", date_to="2019-08-30", project_name='WAW', quantity=10.5, hourly=40, currency='USD',bank_details="""
    Paypal to:
    bippo@bippsi.com"""))

    ds.save()

have folders docs and docs/output/ in same folder where you have DocxService.py在您拥有DocxService.py同一文件夹中有文件夹docsdocs/output/

eg例如

在此处输入图片说明

be sure to parameterize and replace stuff一定要参数化和替换东西

The library python-docx-template is pretty useful for this.python-docx-template对此非常有用。 It's perfect to edit Word documents and save them back to .docx format.编辑 Word 文档并将其保存回 .docx 格式是完美的选择。

As shared by some of the fellow users above that one of the challenges is finding and replacing text in word document is retaining styles if the word spans across multiple runs this could happen if word has many styles or if the word was edited multiple times when the document was created.正如上面的一些用户所分享的那样,在 word 文档中查找和替换文本的挑战之一是保留样式,如果单词跨越多次运行,这可能发生在单词有多种样式或单词被多次编辑时文档已创建。 So a simple code which assumes a word would be found completely within a single run is generally not true so python-docx based code shared above may not work for many many scenarios.因此,假设在一次运行中完全找到一个单词的简单代码通常是不正确的,因此上面共享的基于 python-docx 的代码可能不适用于许多场景。

You can try the following API你可以试试下面的API

https://rapidapi.com/more.sense.tech@gmail.com/api/document-filter1 https://rapidapi.com/more.sense.tech@gmail.com/api/document-filter1

This has generic code to deal with the scenarios.这具有处理场景的通用代码。 The API currently only addresses the paragraphic text and tabular text is currently not supported and I will try that soon. API 目前只处理段落文本,目前不支持表格文本,我会尽快尝试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python-docx搜索和替换word文档中的单词/文本 - How to search and replace a word/text in word document using python-docx docx 中的文本替换并使用 python-docx 保存更改的文件 - Text-Replace in docx and save the changed file with python-docx 如何使用python-docx或任何其他类似的包从Word文档中保存字体样式(粗体和斜体) - How to use python-docx or any other similar package to save font style (bold and italic) from Word document 将文本文件附加到 Word 文档 Python-docx - Attaching text file into word document Python-docx 如何使用python-docx在Word文档中增加段落对象? - How to increment paragraph object in word document using python-docx? 使用Python-docx编写Word文档时,如何更改段落中特定文本的字体? - How to change the font of particular text in a paragraph when writing a word document using Python-docx? 如何使用 python-docx 将分数输入到 WORD 文档中 - How to input fractions into WORD document using python-docx 如何使用 python-docx 向 Word 文档添加超链接? - How to add a hyperlink to a Word document with python-docx? 使用 python-docx 将 HTML 转换为 Word 文档? - Convert HTML to Word document with python-docx? 如何替换 .docx 文件中的多个单词并使用 python-docx 保存 docx 文件 - How to replace multiple words in .docx file and save the docx file using python-docx
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM