简体   繁体   English

如何使用python-docx搜索和替换word文档中的单词/文本

[英]How to search and replace a word/text in word document using python-docx

For example: Please find below paragraphs in a word document. 例如:请在Word文档中找到以下段落。 The paragraphs are inside a table. 这些段落在表格中。

  1. Ok Guys Please get up 好的,请起床
  2. Ok Guys Please getting up. 好的,请起床。

I'm trying to replace "get" with "wake". 我正在尝试将“ get”替换为“ wake”。 I am looking for "get" to replace with "wake" only in the case of paragraph 1. But in the below-given code, its getting replaced in both paragraph as shown in below result. 我只在第1款的情况下才希望将“ get”替换为“ wake”。但是在下面给出的代码中,在两段中都将其替换为以下结果所示。 This behavior is same for all paragraphs in a word document. 对于Word文档中的所有段落,此行为都是相同的。 Please suggest working as per the above requirement. 请根据上述要求建议工作。

Actual Result: 1. Ok Guys Please wake up. 实际结果:1.好的,请醒来。 2. Ok Guys Please waketing up. 2.好的,请醒来。

doc = docx.Document("path/docss.docx")
def Search_replace_text():
 for table in doc.tables:
  for row in table.rows:
   for cell in row.cells:
    for paragraph in cell.paragraphs:
     for run in paragraph.runs:
       if str(word.get()) in run.text:
         text = run.text.split(str(word.get())) # Gets input from GUI
         if text[1] == " ":
            run.text = text[0] + str(replace.get()) # Gets input from GUI
            print(run.text)
        else:
            run.text = text[0] + str(replace.get()) + text[1]
     else: break
     doc.save("docss.docx")

I want the result as shown below: 我想要如下所示的结果:

  1. Ok Guys Please wake up. 好的,请醒来。

  2. Ok Guys Please getting up. 好的,请起床。

Actual Result: 实际结果:

  1. Ok Guys Please wake up. 好的,请醒来。

  2. Ok Guys Please waketing up. 好的,请醒来。

replace 更换

if str(word.get()) in run.text:

with little formating 很少格式化

if ' {} '.format(str(word.get())) in run.text:

to search separeted word(with two spaces). 搜索分隔的单词(两个空格)。

The problem with replacing text in runs is that the text can become split over multiple runs meaning a simple find and replace of the text will not always work. 运行中替换文本的问题在于,文本可能会分成多个运行,这意味着简单的查找和替换文本并不总是有效。

Adapting my answer to Python docx Replace string in paragraph while keeping style 使我的答案适应Python docx在保留样式的同时替换段落中的字符串

The text to be replaced can be split over several runs so it needs to searched by partial matching, identify which runs need to have text replaced then replace the text in those identified. 可以将要替换的文本分为多个运行,因此需要通过部分匹配进行搜索,确定哪些运行需要替换文本,然后替换所标识的文本。

This function replaces strings and retains the original text styling. 此函数替换字符串,并保留原始文本样式。 This process is the same regardless of whether styling is required to be retained as it is the styling that causes text to be potentially broken into multiple runs, even if the text visually lacks styling. 无论是否需要保留样式,此过程都是相同的,因为即使文本在视觉上缺乏样式,该样式也会使文本潜在地分成多个运行。

The code 编码

import docx


def docx_find_replace_text(doc, search_text, replace_text):
    paragraphs = list(doc.paragraphs)
    for t in doc.tables:
        for row in t.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    paragraphs.append(paragraph)
    for p in paragraphs:
        if search_text in p.text:
            inline = p.runs
            # Replace strings and retain the same style.
            # The text to be replaced can be split over several runs so
            # search through, identify which runs need to have text replaced
            # then replace the text in those identified
            started = False
            search_index = 0
            # found_runs is a list of (inline index, index of match, length of match)
            found_runs = list()
            found_all = False
            replace_done = False
            for i in range(len(inline)):

                # case 1: found in single run so short circuit the replace
                if search_text in inline[i].text and not started:
                    found_runs.append((i, inline[i].text.find(search_text), len(search_text)))
                    text = inline[i].text.replace(search_text, str(replace_text))
                    inline[i].text = text
                    replace_done = True
                    found_all = True
                    break

                if search_text[search_index] not in inline[i].text and not started:
                    # keep looking ...
                    continue

                # case 2: search for partial text, find first run
                if search_text[search_index] in inline[i].text and inline[i].text[-1] in search_text and not started:
                    # check sequence
                    start_index = inline[i].text.find(search_text[search_index])
                    check_length = len(inline[i].text)
                    for text_index in range(start_index, check_length):
                        if inline[i].text[text_index] != search_text[search_index]:
                            # no match so must be false positive
                            break
                    if search_index == 0:
                        started = True
                    chars_found = check_length - start_index
                    search_index += chars_found
                    found_runs.append((i, start_index, chars_found))
                    if search_index != len(search_text):
                        continue
                    else:
                        # found all chars in search_text
                        found_all = True
                        break

                # case 2: search for partial text, find subsequent run
                if search_text[search_index] in inline[i].text and started and not found_all:
                    # check sequence
                    chars_found = 0
                    check_length = len(inline[i].text)
                    for text_index in range(0, check_length):
                        if inline[i].text[text_index] == search_text[search_index]:
                            search_index += 1
                            chars_found += 1
                        else:
                            break
                    # no match so must be end
                    found_runs.append((i, 0, chars_found))
                    if search_index == len(search_text):
                        found_all = True
                        break

            if found_all and not replace_done:
                for i, item in enumerate(found_runs):
                    index, start, length = [t for t in item]
                    if i == 0:
                        text = inline[index].text.replace(inline[index].text[start:start + length], str(replace_text))
                        inline[index].text = text
                    else:
                        text = inline[index].text.replace(inline[index].text[start:start + length], '')
                        inline[index].text = text
            # print(p.text)


# sample usage as per example 

doc = docx.Document('find_replace_test_document.docx')
docx_find_replace_text(doc, 'Testing1', 'Test ')
docx_find_replace_text(doc, 'Testing2', 'Test ')
docx_find_replace_text(doc, 'rest', 'TEST')
doc.save('find_replace_test_result.docx')

Sample output 样品输出

Here are a couple of screenshots showing a source document and the result after replacing the text: 以下是一些截图,显示了源文档和替换文本后的结果:

'Testing1' -> 'Test '
'Testing2' -> 'Test '
'rest' -> 'TEST'

Source document: 原始文件:

原始文件

Resultant document: 结果文件:

结果文件

I hope this helps someone. 我希望这可以帮助别人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM