PyPDF2: Concatenating pdfs in memory

Question

I wish to concatenate (append) a bunch of small pdfs together effectively in memory in pure python. Specifically, an usual case is 500 single page pdfs, each with a size of about 400 kB, to be merged into one. Let's say the pdfs are available as a iterable in memory, say a list:

my_pdfs = [pdf1_fileobj, pdf2_fileobj, ..., pdfn_fileobj]  # type is BytesIO

Where each pdf_fileobj is of type BytesIO. Then, the base memory usage is about 200 MB (500 pdfs, 400kB each).

Ideally, I would want the following code to concatenate using no more than 400-500 MB of memory in total (including my_pdfs ). However, that doesn't seem to be the case, the debugging statement on the last line indicates the maximum memory used to be almost 700 MB. Moreover, using the Mac os x resource monitor, the allocated memory is indicated to be 600 MB when reaching the last line.

Running gc.collect() reduces this to 350 MB (almost too good?). Why do I have to run garbage collection manually to get rid of merging garbage, in this case? I have seen this (probably) causing memory build up in a slightly different scenario I'll skip for now.

import PyPDF2
import io
import resources  # For debugging

def merge_pdfs(iterable):
    ''' Merge pdfs in memory '''
    merger = PyPDF2.PdfFileMerger()
    for pdf_fileobj in iterable:
        merger.append(pdf_fileobj)

    myio = io.BytesIO()
    merger.write(myio)
    merger.close()

    myio.seek(0)
    return myio

my_concatenated_pdf = merge_pdfs(my_pdfs)

# Print the maximum memory usage
print('Memory usage: %s (kB)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

Question summary

Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimize it?
Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
What about this general approach? Is BytesIO suitable to use is this case? merger.write(myio) does seem to run kind of slow given that all happen in ram.

Thank you!

Answer 1

Q: Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimise it?

A: Because .append creates a new stream object and then you use merger.write(myio) , which creates another stream object and you already have 200 MB of pdf files in memory so 3*200 MB.

Q: Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?

A: It is a known issue in PyPDF2.

Q: What about this general approach? Is BytesIO suitable to use is this case?

A: Considering the memory issues, you might want to try a different approach. Maybe merging one by one, temporarily saving the files to disk, then clearing the already merged ones from memory.

Answer 2

PyMuPdf库也可以很好地替代PDFMerger的PyPDF2的性能问题。

PyPDF2: Concatenating pdfs in memory

Question

Question summary

2 answers

solution1
1 ACCPTED 2018-01-02 10:54:20

solution2
0 2019-04-23 11:59:01

PyPDF2: Concatenating pdfs in memory

Question

Question summary

2 answers

solution1 1 ACCPTED 2018-01-02 10:54:20

solution2 0 2019-04-23 11:59:01

solution1
1 ACCPTED 2018-01-02 10:54:20

solution2
0 2019-04-23 11:59:01