PyPDF2 compression

Question

I am struggling to compress my merged pdf's using the PyPDF2 module. this is my attempt based on http://www.blog.pythonlibrary.org/2012/07/11/pypdf2-the-new-fork-of-pypdf/

import PyPDF2
path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
merger.append(fileobj=path2)
merger.append(fileobj=path)
pdf.filters.compress(merger)
merger.write(open("test_out2.pdf", 'wb'))

The error I receive is

TypeError: must be string or read-only buffer, not file

I have also tried to compressing the pdf after the merging is complete. I am basing my failed compression on what file size I got after using PDFSAM with compression. Any thoughts? Thanks.

Answer 1

PyPDF2 doesn't have a reliable compression method. That said, there's a compress_content_streams() method with the following description:

Compresses the size of this page by joining all content streams and applying a FlateDecode filter.

However, it is possible that this function will perform no action if content stream compression becomes "automatic" for some reason.

Again, this won't make any difference in most cases but you can try this code:

from PyPDF2 import PdfReader, PdfWriter


writer = PdfWriter()

for pdf in ["path/to/hello.pdf", "path/to/another.pdf"]:
    reader = PdfReader(pdf)
    for page in reader.pages:
        page.compress_content_streams()
        writer.add_page(page)

with open("test_out2.pdf", "wb") as f:
    writer.write(f)

Answer 2

Your error says that it must be string or read-only buffer, not file.

So it's better to write your merger to a byte or string.

import PyPDF2
from io import BytesIO

tmp = BytesIO()
path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
merger.append(fileobj=path2)
merger.append(fileobj=path)
merger.write(tmp)
PyPDF2.filters.compress(tmp.getvalue())
merger.write(open("test_out2.pdf", 'wb'))

Answer 3

The initial approach isn't that wrong. Just add the pages to your writer and compress them before writing to a file:

...

for i in list(range(reader.numPages)):
    page = reader.getPage(i)
    writer.addPage(page);
for i in list(range(writer.getNumPages())):
    page.compressContentStreams()

...

Answer 4

pypdf offers several ways to reduce the file size: https://pypdf.readthedocs.io/en/latest/user/file-size.html

compress_content_streams is one that only has the disadvantage that it might take long (depends on the PDF; think of it as ZIP-for-PDF):

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.compress_content_streams()  # This is CPU intensive!
    writer.add_page(page)

with open("out.pdf", "wb") as f:
    writer.write(f)

PyPDF2 compression

Question

4 answers

solution1
6 2018-03-26 04:57:34

solution2
0 2020-05-18 12:46:00

solution3
0 2021-10-20 06:39:47

solution4
0 2023-01-03 23:11:59

PyPDF2 compression

Question

4 answers

solution1 6 2018-03-26 04:57:34

solution2 0 2020-05-18 12:46:00

solution3 0 2021-10-20 06:39:47

solution4 0 2023-01-03 23:11:59

solution1
6 2018-03-26 04:57:34

solution2
0 2020-05-18 12:46:00

solution3
0 2021-10-20 06:39:47

solution4
0 2023-01-03 23:11:59