pyPdf PdfFileReader vs PdfFileWriter

Question

I have the following code:

import os
from pyPdf import PdfFileReader, PdfFileWriter

path = "C:/Real Python/Course materials/Chapter 12/Practice files"

input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_PDF = PdfFileWriter()

for page_num in range(1, 4):
    output_PDF.addPage(input_file.getPage(page_num))

output_file_name = os.path.join(path, "Output/portion.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()

Till now I was just reading from Pdfs and later learned to write from Pdf to txt... But now this... Why the PdfFileReader differs so much from PdfFileWriter

Can someone explain this? I would expect something like:

import os
from pyPdf import PdfFileReader, PdfFileWriter

path = "C:/Real Python/Course materials/Chapter 12/Practice files"

input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))

output_file_name = os.path.join(path, "out Pride and Prejudice.pdf")
output_file = PdfFileWriter(file(output_file_name, "wb"))

for page_num in range(1,4):
    page = input_file.petPage(page_num)
    output_file.addPage(page_num)
    output_file.write(page)

Any help??? Thanks

EDIT 0: What does .addPage() do?

for page_num in range(1, 4):
        output_PDF.addPage(input_file.getPage(page_num))

Does it just creates 3 BLANK pages?

EDIT 1: Someone can explain what happends when:

1) output_PDF = PdfFileWriter()

2) output_PDF.addPage(input_file.getPage(page_num))

3) output_PDF.write(output_file)

The 3rd one passes a JUST CREATED(!) object to output_PDF , why?

Answer 1

The issue is basically the PDF Cross-Reference table.

It's a somewhat tangled spaghetti monster of references to pages, fonts, objects, elements, and these all need to link together to allow for random access.

Each time a file is updated, it needs to rebuild this table. The file is created in memory first so this only has to happen once, and further decreasing the chances of torching your file.

output_PDF = PdfFileWriter()

This creates the space in memory for the PDF to go into. (to be pulled from your old pdf)

output_PDF.addPage(input_file.getPage(page_num))

add the page from your input pdf, to the PDF file created in memory (the page you want.)

output_PDF.write(output_file)

Finally, this writes the object stored in memory to a file, building the header, cross-reference table, and linking everything together all hunky dunky.

Edit: Presumably, the JUST CREATED flag signals PyPDF to start building the appropriate tables and link things together.

--

in response to the why vs .txt and csv:

When you're copying from a text or CSV file, there's no existing data structures to comprehend and move to make sure things like formatting, image placement, and form data (input sections, etc) are preserved and created properly.

Answer 2

Most likely, it's done because PDFs aren't exactly linear - the "header" is actually at the end of the file.

If the file was written to disk every time a change was made, your computer needs to keep pushing that data around on the disk. Instead, the module (probably) stores the information about the document in an object (PdfFileWriter), and then converts that data into your actual PDF file when you request it.

pyPdf PdfFileReader vs PdfFileWriter

Question

2 answers

solution1
1 ACCPTED 2015-02-20 21:32:28

solution2
0 2015-02-17 23:48:31

pyPdf PdfFileReader vs PdfFileWriter

Question

2 answers

solution1 1 ACCPTED 2015-02-20 21:32:28

solution2 0 2015-02-17 23:48:31

solution1
1 ACCPTED 2015-02-20 21:32:28

solution2
0 2015-02-17 23:48:31