简体   繁体   English

pyPdf PdfFileReader与PdfFileWriter

[英]pyPdf PdfFileReader vs PdfFileWriter

I have the following code: 我有以下代码:

import os
from pyPdf import PdfFileReader, PdfFileWriter

path = "C:/Real Python/Course materials/Chapter 12/Practice files"

input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_PDF = PdfFileWriter()

for page_num in range(1, 4):
    output_PDF.addPage(input_file.getPage(page_num))

output_file_name = os.path.join(path, "Output/portion.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()

Till now I was just reading from Pdfs and later learned to write from Pdf to txt... But now this... Why the PdfFileReader differs so much from PdfFileWriter 直到现在我只是从PDF文档阅读,后来才从PDF写为TXT ...但是现在这...为什么PdfFileReader相差这么多从PdfFileWriter

Can someone explain this? 有人可以解释吗? I would expect something like: 我希望这样的事情:

import os
from pyPdf import PdfFileReader, PdfFileWriter

path = "C:/Real Python/Course materials/Chapter 12/Practice files"

input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))

output_file_name = os.path.join(path, "out Pride and Prejudice.pdf")
output_file = PdfFileWriter(file(output_file_name, "wb"))

for page_num in range(1,4):
    page = input_file.petPage(page_num)
    output_file.addPage(page_num)
    output_file.write(page)

Any help??? 有帮助吗??? Thanks 谢谢

EDIT 0: What does .addPage() do? 编辑0: .addPage()是做什么的?

for page_num in range(1, 4):
        output_PDF.addPage(input_file.getPage(page_num))

Does it just creates 3 BLANK pages? 它仅创建3个空白页面吗?

EDIT 1: Someone can explain what happends when: 编辑1:在以下情况下,有人可以解释发生了什么:

1) output_PDF = PdfFileWriter() 1) output_PDF = PdfFileWriter()

2) output_PDF.addPage(input_file.getPage(page_num)) 2) output_PDF.addPage(input_file.getPage(page_num))

3) output_PDF.write(output_file) 3) output_PDF.write(output_file)

The 3rd one passes a JUST CREATED(!) object to output_PDF , why? 第三个将JUST CREATED(!)对象传递给output_PDF ,为什么?

The issue is basically the PDF Cross-Reference table. 问题基本上是PDF交叉引用表。

It's a somewhat tangled spaghetti monster of references to pages, fonts, objects, elements, and these all need to link together to allow for random access. 这是一个杂乱的意大利面条怪兽,涉及页面,字体,对象,元素,所有这些都需要链接在一起以允许随机访问。

Each time a file is updated, it needs to rebuild this table. 每次文件更新时,都需要重建该表。 The file is created in memory first so this only has to happen once, and further decreasing the chances of torching your file. 该文件首先在内存中创建,因此只需执行一次,从而进一步减少了破坏文件的可能性。

output_PDF = PdfFileWriter()

This creates the space in memory for the PDF to go into. 这将在内存中创建供PDF使用的空间。 (to be pulled from your old pdf) (从旧的pdf中提取)

output_PDF.addPage(input_file.getPage(page_num))

add the page from your input pdf, to the PDF file created in memory (the page you want.) 将输入的pdf页面添加到内存中创建的PDF文件(所需页面)。

output_PDF.write(output_file)

Finally, this writes the object stored in memory to a file, building the header, cross-reference table, and linking everything together all hunky dunky. 最后,这会将存储在内存中的对象写入文件,构建标头,交叉引用表,并将所有内容链接在一起。

Edit: Presumably, the JUST CREATED flag signals PyPDF to start building the appropriate tables and link things together. 编辑:大概是,JUST CREATED标志表明PyPDF开始建立适当的表并将事物链接在一起。

-- -

in response to the why vs .txt and csv: 回应为什么vs .txt和csv:

When you're copying from a text or CSV file, there's no existing data structures to comprehend and move to make sure things like formatting, image placement, and form data (input sections, etc) are preserved and created properly. 从文本或CSV文件复制时,没有现有的数据结构可以理解和移动,以确保正确保存和创建格式,图像放置和表单数据(输入节等)之类的东西。

Most likely, it's done because PDFs aren't exactly linear - the "header" is actually at the end of the file. 之所以很有可能这样做是因为PDF并不是完全线性的-“标题”实际上位于文件的末尾。

If the file was written to disk every time a change was made, your computer needs to keep pushing that data around on the disk. 如果每次进行更改时都将文件写入磁盘,则您的计算机需要继续将数据推送到磁盘上。 Instead, the module (probably) stores the information about the document in an object (PdfFileWriter), and then converts that data into your actual PDF file when you request it. 取而代之的是,模块(可能)将有关文档的信息存储在对象(PdfFileWriter)中,然后在您请求时将其转换为实际的PDF文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM