[英]Looping through a page range in PyPDF PdfFileReader gives strange loops
I have a PDF , which consists of 4 pages and I want to separate it into separate documents and rename them based on their page number. 我有一个PDF ,它由4页组成,我想将其分成单独的文档,然后根据其页码重命名。 The problem is that I have a loop to loop through each page based on: for page in range(0, pdfReader.numpages) but every time it should end it keeps going and creates duplicates.
问题是我有一个循环来循环访问基于以下内容的每个页面:for range(0,pdfReader.numpages)中的页面,但是每次结束时它都会继续并创建重复项。 I made a print(page) to see what was going on and got:
我进行了打印(页面)以查看发生了什么并得到:
0 1 2 3 0 0 0 0 0 1 2 3 0 0 0 0
Switching the range to range(1, pdfReader.numpages) makes the loop 1,2,3 and skips the first page. 将范围切换到range(1,pdfReader.numpages)会导致循环1,2,3,并跳过第一页。 Making the loop (0, pdfReader.numpages+1) gives the correct output of files but gives the error IndexError: list index out of range
进行循环(0,pdfReader.numpages + 1)可以正确输出文件,但会出现错误IndexError:列表索引超出范围
import os, PyPDF2, re, tika, time
from tika import parser
def split_pdf_pages(root_directory, extract_to_folder):
for root, dirs, files in os.walk(root_directory):
for filename in files:
basename, extension = os.path.splitext(filename)
if extension == ".pdf":
fullpath = root + "\\" + basename + extension
pdfFileObj = open(fullpath, "rb")
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for page in range(0, pdfReader.numPages):
print(page)
pdfWriter = PyPDF2.PdfFileWriter()
pageObj = pdfReader.getPage(page)
outputpdf = extract_to_folder + "\\" + basename + "-{}.pdf".format(page+1)
pdfWriter.addPage(pageObj)
with open(outputpdf, "wb") as f:
pdfWriter.write(f)
pdfFileObj.close()
I expect to get files named filename-1, filename-2.pdf etc but instead get filename-1, filename-1-1, filename-2, filename2-2 etc UNLESS it's range(1,pdfReader.numPages) where it works correctly but skips the first page! 我希望得到名为filename-1,filename-2.pdf等的文件,但取而代之的是获得filename-1,filename-1-1,filename-2,filename2-2等,除非它在range(1,pdfReader.numPages)起作用正确,但跳过首页! It's driving me mad, please help.
这让我发疯了,请帮忙。
I've finally figured it out (sorry, I'm just a hobbyist coder so it wasn't evident at first!) The program loops through every PDF in the directory (of which the extracted and renamed single page documents are contained). 我终于弄清楚了(对不起,我只是一个业余编码员,所以一开始并不明显!)该程序循环遍历目录中的每个PDF(其中包含提取和重命名的单页文档)。 If you change the range to range(1, pdfReader.numPages) it was ignoring all of these newly created documents because they were all 1 page long!
如果将范围更改为range(1,pdfReader.numPages),它将忽略所有这些新创建的文档,因为它们全都是1页长! When it was set to 0 it included all of these newly created ones and duplicated them.
当将其设置为0时,它包括所有这些新创建的副本并重复它们。
All I had to do was move the extracted and renamed folders to a different directory. 我要做的就是将提取和重命名的文件夹移动到另一个目录。 Feels really obvious now that I've done it!
既然我做到了,那就真的很明显了! I also removed the pdfFileObj = open(fullpath, "rb") since reader apparently does it automagically and all works now!
我还删除了pdfFileObj = open(fullpath,“ rb”),因为阅读器显然是自动完成的,并且现在一切正常!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.