[英]Search and replace for text within a pdf, in Python
I am writing mailmerge software as part of a Python web app.我正在编写邮件合并软件作为 Python web 应用程序的一部分。
I have a template called letter.pdf
which was generated from a MS Word file and includes the text {name} where the resident's name will go.我有一个名为
letter.pdf
的模板,它是从 MS Word 文件生成的,其中包含文本 {name},其中居民的姓名将为 go。 I also have a list of c.我还有一个 c 的列表。 100 residents' names.
100个居民的名字。
What I want to do is to read in letter.pdf
do a search for "{name}"
and replace it with the resident's name (for each resident) then write the result to another pdf.我想要做的是阅读
letter.pdf
搜索"{name}"
并将其替换为居民姓名(对于每个居民),然后将结果写入另一个 pdf。 I then want to gather all these pdfs together into a big pdf (one page per letter) which my web app's users will print out to create their letters.然后,我想将所有这些 pdf 文件收集到一个大的 pdf (每个字母一页)中,我的 web 应用程序的用户将打印出来以创建他们的字母。
Are there any Python libraries that will do this?是否有任何 Python 库可以做到这一点? I've looked at pdfrw and pdfminer but I couldn't see where they would be able to do it.
我查看了 pdfrw 和 pdfminer,但我看不出他们能在哪里做到这一点。
(NB: I also have the MS Word file, so if there was another way of using that, and not going through a pdf, that would also do the job.) (注意:我也有 MS Word 文件,所以如果有另一种使用方式,而不是通过 pdf,那也可以。)
This can be done with PyPDF2 package.这可以通过 PyPDF2 包来完成。 The implementation may depend on the original PDF template structure.
实现可能取决于原始 PDF 模板结构。 But if the template is stable enough and isn't changed very often the replacement code shouldn't be generic but rather simple.
但是,如果模板足够稳定并且不经常更改,则替换代码不应该是通用的,而应该是简单的。
I did a small sketch on how you could replace the text inside a PDF file .我做了一个关于如何替换PDF 文件中的文本的小草图。 It replaces all occurrences of
PDF
tokens to DOC
.它将所有出现的
PDF
标记替换为DOC
。
import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject
def replace_text(content, replacements = dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = line
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(object, replacements):
data = object.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if object.decodedSelf is not None:
object.decodedSelf.setData(encoded_data)
else:
object.setData(encoded_data)
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to PDF document")
args = vars(ap.parse_args())
in_file = args["input"]
filename_base = in_file.replace(os.path.splitext(in_file)[1], "")
# Provide replacements list that you need here
replacements = { 'PDF': 'DOC'}
pdf = PdfFileReader(in_file)
writer = PdfFileWriter()
for page_number in range(0, pdf.getNumPages()):
page = pdf.getPage(page_number)
contents = page.getContents()
if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
process_data(contents, replacements)
elif len(contents) > 0:
for obj in contents:
if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
streamObj = obj.getObject()
process_data(streamObj, replacements)
writer.addPage(page)
with open(filename_base + ".result.pdf", 'wb') as out_file:
writer.write(out_file)
The results are结果是
UPDATE 2021-03-21:更新 2021-03-21:
Updated the code example to handle DecodedStreamObject
and EncodedStreamObject
which actually contian data stream with text to update.更新了代码示例以处理
DecodedStreamObject
和EncodedStreamObject
,它们实际上包含要更新的文本数据流。
pdftk original.pdf output uncompressed.pdf uncompress
from PyPDF2 import PdfFileReader, PdfFileWriter
replacements = [
("old string", "new string")
]
pdf = PdfFileReader(open("uncompressed.pdf", "rb"))
writer = PdfFileWriter()
for page in pdf.pages:
contents = page.getContents().getData()
for (a,b) in replacements:
contents = contents.replace(a.encode('utf-8'), b.encode('utf-8'))
page.getContents().setData(contents)
writer.addPage(page)
with open("modified.pdf", "wb") as f:
writer.write(f)
pdftk modified.pdf output recompressed.pdf compress
Dymitrio's updated code example to handle DecodedStreamObject and EncodedStreamObject which actually contain data stream with text to update could run fine, but with a file different from example, was not able to alter pdf text content. Dymitrio 更新的代码示例用于处理实际包含数据 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的数据 stream 和要更新的文本的更新代码示例可以正常运行,但使用与示例不同的文件,无法更改 pdf 文本内容。
According to EDIT 3, from How to replace text in a PDF using Python?根据编辑 3,来自如何使用 Python 替换 PDF 中的文本? :
:
By inserting page[NameObject("/Contents")] = contents.decodedSelf
before writer.addPage(page)
, we force pyPDF2 to update content of the page object.通过在
writer.addPage(page)
之前插入page[NameObject("/Contents")] = contents.decodedSelf
,我们强制 pyPDF2 更新页面 object 的内容。
This way I was able to overcome this problem and replace text from pdf file.这样我就能够克服这个问题并替换 pdf 文件中的文本。
Final code should look like this:最终代码应如下所示:
import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject, NameObject
def replace_text(content, replacements = dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = line
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(object, replacements):
data = object.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if object.decodedSelf is not None:
object.decodedSelf.setData(encoded_data)
else:
object.setData(encoded_data)
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to PDF document")
args = vars(ap.parse_args())
in_file = args["input"]
filename_base = in_file.replace(os.path.splitext(in_file)[1], "")
# Provide replacements list that you need here
replacements = { 'PDF': 'DOC'}
pdf = PdfFileReader(in_file)
writer = PdfFileWriter()
for page_number in range(0, pdf.getNumPages()):
page = pdf.getPage(page_number)
contents = page.getContents()
if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
process_data(contents, replacements)
elif len(contents) > 0:
for obj in contents:
if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
streamObj = obj.getObject()
process_data(streamObj, replacements)
# Force content replacement
page[NameObject("/Contents")] = contents.decodedSelf
writer.addPage(page)
with open(filename_base + ".result.pdf", 'wb') as out_file:
writer.write(out_file)
Important: from PyPDF2.generic import NameObject
重要提示:
from PyPDF2.generic import NameObject
Here is a solution using the MS Word source file.这是使用 MS Word 源文件的解决方案。
As trying to edit the pdf itself turned out to be too complicated for me because of the encoding errors, I went with the MS Word >> Pdf option.由于编码错误,尝试编辑 pdf 本身对我来说太复杂了,我选择了 MS Word >> Pdf 选项。
The DocxTemplate module uses jinja like syntax: {{variable_name}} DocxTemplate 模块使用类似 jinja 的语法:{{variable_name}}
In my solution I use an intermediate temp file.在我的解决方案中,我使用了一个中间临时文件。 I tried to get rid of this step using BytesIO/StringIO to virtualize this step only in memory, but haven't make that work yet.
我试图摆脱这一步,使用 BytesIO/StringIO 仅在 memory 中虚拟化这一步,但还没有实现。
Here is an easy and working solution to perform the required task:这是执行所需任务的简单且有效的解决方案:
import os
import comtypes.client
from pathlib import Path
from docxtpl import DocxTemplate
import random
# CFG
in_file_path = "files/template.docx"
temp_file_path = "files/"+str(random.randint(0,50))+".docx"
out_file_path = "files/output.pdf"
# Fill in text
data_to_fill = {'Field_name' : "John Tester",
'Field_ocupation' : "Test tester",
'Field_address' : "Test Address 123",
}
template = DocxTemplate(Path(in_file_path))
template.render(data_to_fill)
template.save(Path(temp_file_path))
# Convert to PDF
wdFormatPDF = 17
in_file = os.path.abspath(Path(temp_file_path))
out_file = os.path.abspath(Path(out_file_path))
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
# Get rid of the temp file
os.remove(Path(temp_file_path))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.