简体   繁体   English

在 pdf、Python 中搜索和替换文本

[英]Search and replace for text within a pdf, in Python

I am writing mailmerge software as part of a Python web app.我正在编写邮件合并软件作为 Python web 应用程序的一部分。

I have a template called letter.pdf which was generated from a MS Word file and includes the text {name} where the resident's name will go.我有一个名为letter.pdf的模板,它是从 MS Word 文件生成的,其中包含文本 {name},其中居民的姓名将为 go。 I also have a list of c.我还有一个 c 的列表。 100 residents' names. 100个居民的名字。

What I want to do is to read in letter.pdf do a search for "{name}" and replace it with the resident's name (for each resident) then write the result to another pdf.我想要做的是阅读letter.pdf搜索"{name}"并将其替换为居民姓名(对于每个居民),然后将结果写入另一个 pdf。 I then want to gather all these pdfs together into a big pdf (one page per letter) which my web app's users will print out to create their letters.然后,我想将所有这些 pdf 文件收集到一个大的 pdf (每个字母一页)中,我的 web 应用程序的用户将打印出来以创建他们的字母。

Are there any Python libraries that will do this?是否有任何 Python 库可以做到这一点? I've looked at pdfrw and pdfminer but I couldn't see where they would be able to do it.我查看了 pdfrw 和 pdfminer,但我看不出他们能在哪里做到这一点。

(NB: I also have the MS Word file, so if there was another way of using that, and not going through a pdf, that would also do the job.) (注意:我也有 MS Word 文件,所以如果有另一种使用方式,而不是通过 pdf,那也可以。)

This can be done with PyPDF2 package.这可以通过 PyPDF2 包来完成。 The implementation may depend on the original PDF template structure.实现可能取决于原始 PDF 模板结构。 But if the template is stable enough and isn't changed very often the replacement code shouldn't be generic but rather simple.但是,如果模板足够稳定并且不经常更改,则替换代码不应该是通用的,而应该是简单的。

I did a small sketch on how you could replace the text inside a PDF file .我做了一个关于如何替换PDF 文件中的文本的小草图 It replaces all occurrences of PDF tokens to DOC .它将所有出现的PDF标记替换为DOC

import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject


def replace_text(content, replacements = dict()):
    lines = content.splitlines()

    result = ""
    in_text = False

    for line in lines:
        if line == "BT":
            in_text = True

        elif line == "ET":
            in_text = False

        elif in_text:
            cmd = line[-2:]
            if cmd.lower() == 'tj':
                replaced_line = line
                for k, v in replacements.items():
                    replaced_line = replaced_line.replace(k, v)
                result += replaced_line + "\n"
            else:
                result += line + "\n"
            continue

        result += line + "\n"

    return result


def process_data(object, replacements):
    data = object.getData()
    decoded_data = data.decode('utf-8')

    replaced_data = replace_text(decoded_data, replacements)

    encoded_data = replaced_data.encode('utf-8')
    if object.decodedSelf is not None:
        object.decodedSelf.setData(encoded_data)
    else:
        object.setData(encoded_data)


if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--input", required=True, help="path to PDF document")
    args = vars(ap.parse_args())

    in_file = args["input"]
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = { 'PDF': 'DOC'}

    pdf = PdfFileReader(in_file)
    writer = PdfFileWriter()

    for page_number in range(0, pdf.getNumPages()):

        page = pdf.getPage(page_number)
        contents = page.getContents()

        if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
            process_data(contents, replacements)
        elif len(contents) > 0:
            for obj in contents:
                if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
                    streamObj = obj.getObject()
                    process_data(streamObj, replacements)

        writer.addPage(page)

    with open(filename_base + ".result.pdf", 'wb') as out_file:
        writer.write(out_file)

The results are结果是

原始PDF 替换 PDF

UPDATE 2021-03-21:更新 2021-03-21:

Updated the code example to handle DecodedStreamObject and EncodedStreamObject which actually contian data stream with text to update.更新了代码示例以处理DecodedStreamObjectEncodedStreamObject ,它们实际上包含要更新的文本数据流。

  1. Decompress the pdf to make parsing easier (solves many of the issues in the previous answer).解压pdf,让解析更简单(解决了上个回答的很多问题)。 I use pdftk .我使用pdftk (If this step fails, one hack to pre-process the pdf is to open the pdf in OSX Preview, print it, and then choose save as pdf from the print menu. Then retry the command below.) (如果此步骤失败,预处理 pdf 的一个技巧是在 OSX Preview 中打开 pdf,打印它,然后从打印菜单中选择另存为 Z437175BA4191210EE004E1D937494D09。)
pdftk original.pdf output uncompressed.pdf uncompress
  1. Parse and replace using PyPDF2 .使用PyPDF2解析和替换。
from PyPDF2 import PdfFileReader, PdfFileWriter

replacements = [
    ("old string", "new string")
]

pdf = PdfFileReader(open("uncompressed.pdf", "rb"))
writer = PdfFileWriter() 

for page in pdf.pages:
    contents = page.getContents().getData()
    for (a,b) in replacements:
        contents = contents.replace(a.encode('utf-8'), b.encode('utf-8'))
    page.getContents().setData(contents)
    writer.addPage(page)
    
with open("modified.pdf", "wb") as f:
     writer.write(f)
  1. [Optional] Re-compress the pdf. [可选] 重新压缩 pdf。
pdftk modified.pdf output recompressed.pdf compress

If @Dmytrio solution do not alter final PDF如果@Dmytrio 解决方案不改变最终 PDF

Dymitrio's updated code example to handle DecodedStreamObject and EncodedStreamObject which actually contain data stream with text to update could run fine, but with a file different from example, was not able to alter pdf text content. Dymitrio 更新的代码示例用于处理实际包含数据 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的数据 stream 和要更新的文本的更新代码示例可以正常运行,但使用与示例不同的文件,无法更改 pdf 文本内容。

According to EDIT 3, from How to replace text in a PDF using Python?根据编辑 3,来自如何使用 Python 替换 PDF 中的文本? :

By inserting page[NameObject("/Contents")] = contents.decodedSelf before writer.addPage(page) , we force pyPDF2 to update content of the page object.通过在writer.addPage(page)之前插入page[NameObject("/Contents")] = contents.decodedSelf ,我们强制 pyPDF2 更新页面 object 的内容。

This way I was able to overcome this problem and replace text from pdf file.这样我就能够克服这个问题并替换 pdf 文件中的文本。

Final code should look like this:最终代码应如下所示:

import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject, NameObject


def replace_text(content, replacements = dict()):
    lines = content.splitlines()

    result = ""
    in_text = False

    for line in lines:
        if line == "BT":
            in_text = True

        elif line == "ET":
            in_text = False

        elif in_text:
            cmd = line[-2:]
            if cmd.lower() == 'tj':
                replaced_line = line
                for k, v in replacements.items():
                    replaced_line = replaced_line.replace(k, v)
                result += replaced_line + "\n"
            else:
                result += line + "\n"
            continue

        result += line + "\n"

    return result


def process_data(object, replacements):
    data = object.getData()
    decoded_data = data.decode('utf-8')

    replaced_data = replace_text(decoded_data, replacements)

    encoded_data = replaced_data.encode('utf-8')
    if object.decodedSelf is not None:
        object.decodedSelf.setData(encoded_data)
    else:
        object.setData(encoded_data)


if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--input", required=True, help="path to PDF document")
    args = vars(ap.parse_args())

    in_file = args["input"]
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = { 'PDF': 'DOC'}

    pdf = PdfFileReader(in_file)
    writer = PdfFileWriter()

    for page_number in range(0, pdf.getNumPages()):

        page = pdf.getPage(page_number)
        contents = page.getContents()

        if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
            process_data(contents, replacements)
        elif len(contents) > 0:
            for obj in contents:
                if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
                    streamObj = obj.getObject()
                    process_data(streamObj, replacements)

        # Force content replacement
        page[NameObject("/Contents")] = contents.decodedSelf
        writer.addPage(page)

    with open(filename_base + ".result.pdf", 'wb') as out_file:
        writer.write(out_file)

Important: from PyPDF2.generic import NameObject重要提示: from PyPDF2.generic import NameObject

Here is a solution using the MS Word source file.这是使用 MS Word 源文件的解决方案。

As trying to edit the pdf itself turned out to be too complicated for me because of the encoding errors, I went with the MS Word >> Pdf option.由于编码错误,尝试编辑 pdf 本身对我来说太复杂了,我选择了 MS Word >> Pdf 选项。

  1. Prepare MS Word template with {{input_fields}}使用 {{input_fields}} 准备 MS Word 模板
  2. Fill in the template with data用数据填写模板
  3. Convert the filled in MS Word file to PDF将填好的MS Word文件转换为PDF

The DocxTemplate module uses jinja like syntax: {{variable_name}} DocxTemplate 模块使用类似 jinja 的语法:{{variable_name}}

In my solution I use an intermediate temp file.在我的解决方案中,我使用了一个中间临时文件。 I tried to get rid of this step using BytesIO/StringIO to virtualize this step only in memory, but haven't make that work yet.我试图摆脱这一步,使用 BytesIO/StringIO 仅在 memory 中虚拟化这一步,但还没有实现。

Here is an easy and working solution to perform the required task:这是执行所需任务的简单且有效的解决方案:

import os
import comtypes.client
from pathlib import Path
from docxtpl import DocxTemplate
import random


# CFG
in_file_path = "files/template.docx"
temp_file_path = "files/"+str(random.randint(0,50))+".docx"
out_file_path = "files/output.pdf"


# Fill in text
data_to_fill = {'Field_name' : "John Tester",
                  'Field_ocupation' : "Test tester",
                  'Field_address' : "Test Address 123",
                  }

template = DocxTemplate(Path(in_file_path))
template.render(data_to_fill)
template.save(Path(temp_file_path))

# Convert to PDF
wdFormatPDF = 17

in_file = os.path.abspath(Path(temp_file_path))
out_file = os.path.abspath(Path(out_file_path))

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

# Get rid of the temp file
os.remove(Path(temp_file_path))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM