简体   繁体   English

如何使用 python 从 pdf 替换/删除文本?

[英]How to replace/delete text from a pdf using python?

I have code that hides parts of the pdf (by just covering it with a white polygon) but the issue with this is, the text is still there , if you ctrl-f you can still find it.我有隐藏部分 pdf 的代码(仅用白色多边形覆盖它)但问题是,文本仍然存在,如果你按 ctrl-f 你仍然可以找到它。

My goal is to actually remove the text from the pdf itself.我的目标是实际从 pdf 本身中删除文本。 Using pdfminer I managed to extract the text from the pdf but I don't know if its possible to actually "replace" the text with say just some empty spaces.使用 pdfminer 我设法从 pdf 中提取了文本,但我不知道是否可以用一些空格来“替换”文本。 Is such a thing possible using python?使用 python 可以做到这一点吗? Extracting it isn't enough.提取它是不够的。 I need the text to be removed from the PDF我需要从 PDF 中删除文本

This is kind of memory intensive but you can copy the rest of the pdf apart from the part you are removing and then overwrite the file with the new version which does not contain the part you wish to remove.这是一种内存密集型,但您可以复制除要删除的部分之外的其余 pdf,然后使用不包含您要删除的部分的新版本覆盖文件。 You can do this using PyPDF by retrieving a content stream and finding and removing the relevant parts.您可以使用 PyPDF 通过检索内容流并查找和删除相关部分来执行此操作。

PyPDF docs https://pythonhosted.org/PyPDF2/PageObject.html?highlight=getcontents#PyPDF2.pdf.PageObject.getContents ; PyPDF 文档https://pythonhosted.org/PyPDF2/PageObject.html?highlight=getcontents#PyPDF2.pdf.PageObject.getContents

PDF standard https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf pg 78, pg 81; PDF 标准https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf第 78 页,第 81 页;

Is such a thing possible?这样的事情可能吗? Yes, although it is not recommended.是的,虽然不推荐。 In my opinion, your best bet is to open and read your existing file, move it to an editable format, remove whatever text that you don't want present and then convert it back.在我看来,最好的办法是打开并阅读现有文件,将其移动为可编辑格式,删除您不希望出现的任何文本,然后将其转换回来。

However, you could extract the data and remove it from memory by using:但是,您可以使用以下方法提取数据并将其从内存中删除:

import PyPDF2 

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
print(pdfReader.numPages) 

# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
print(pageObj.extractText()) 

# closing the pdf file object 
pdfFileObj.close() 

Line by line, this program would:一行一行,这个程序将:

pdfFileObj = open('example.pdf', 'rb') Open the example.pdf and save the file object as pdfFileObj . pdfFileObj = open('example.pdf', 'rb')打开example.pdf并将文件对象另存为pdfFileObj

pdfReader = PyPDF2.PdfFileReader(pdfFileObj) Create an object of PdfFileReader and pass the PDF file object whole getting a PDF reader object. pdfReader = PyPDF2.PdfFileReader(pdfFileObj)创建一个PdfFileReader对象并传递整个 PDF 文件对象得到一个 PDF 阅读器对象。

print(pdfReader.numPages) Give the number of pages. print(pdfReader.numPages)给出页数。

pageObj = pdfReader.getPage(0) Create an object of PageObject class. pageObj = pdfReader.getPage(0)创建一个PageObject类的对象。 PDF reader object has function getPage() which takes page number (starting form index 0) as an argument and returns the page object. PDF 阅读器对象具有函数getPage() ,它以页码(从索引 0 开始)作为参数并返回页面对象。

print(pageObj.extractText()) Extract text from the PDF page. print(pageObj.extractText())从 PDF 页面中提取文本。

pdfFileObj.close() Close the PDF file object. pdfFileObj.close()关闭 PDF 文件对象。

The replacement text would simply be "", as you want to remove all instances / cases of a certain piece of text.替换文本将简单地为“”,因为您要删除某段文本的所有实例/案例。

I used pdf-redactor in one of my projects and it works pretty nice.我在我的一个项目中使用了pdf-redactor ,效果很好。

Here is an example how to redact Social Security Numbers from text layer. 是如何从文本层编辑社会安全号码的示例。

I know I am late but for future readers here is a workaround I found to resolve this using pymupdf.我知道我迟到了,但对于未来的读者来说,这是我发现使用 pymupdf 解决此问题的一种解决方法。 This solution successfully deletes text from pdf.此解决方案成功从 pdf 中删除文本。

page = doc.load_page(0)

draft = page.search_for("Invoice")

for rect in draft:
    annot = page.add_redact_annot(rect)
    page.apply_redactions()
    page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
# then save the doc to a new PDF:
doc.save("new.pdf", garbage=3, deflate=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM