简体   繁体   中英

Remove some images and text objects from pdf

I have a pdf page object with an image and a lot of text.

I want to remove that image and remove some text objects based on their contents. That is I want to get all text objects' contents, then remove some of them if they satisfied the condition.

How can I do that with PyPDF2 ? Or is there another library which allows doing that?

To remove all images from a PDF file using PyPDF2 you can do:

from PyPDF2 import PdfFileWriter, PdfFileReader

inputStream = open("src.pdf", "rb")
outputStream = open("dst.pdf", "wb")

src = PdfFileReader(inputStream)
output = PdfFileWriter()

[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()

output.write(outputStream)

I'm adding an answer that should really remove images (and hopefully only images, which is not actually the case with PyPDF2.removeImages as noted above). This solution rather uses PyMuPDF .

import fitz

doc = fitz.open(pdf_filename)
for n in range(doc.page_count):
    page = doc[n]
    # Get all rectangles corresponding to images
    rects = [page.get_image_rects(i) for i in doc.get_page_images(n, True)]
    for rect in rects:
        # Redact it with white. This will also delete the image
        page.add_redact_annot(rect, fill=(1, 1, 1))
    if rects:
        page.apply_redactions()
    # with deflate=True it reduces the file size (since now the images are not embedded anymore)
    doc.save(output_filename, garbage=3, deflate=True)

This answer is based on this great answer on PyMuPDF Github page. However in that answer they are trying to delete a specific file, which gets selected via page.getImageBbox (now camel casing is deprecated, it became page.get_image_bbox ). However looping through all images this way is both inefficient and, perhaps more importantly, in my case was crashing because it assumed that all images were embedded with distinct names, which was not the case in the pdf I was working with.

I suppose a similar approach could be used for the second part of the question, ie get all boxes containing text and potentially deleting some of them, but the original question is 8 years old so it might not be that relevant anymore ;-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM