简体   繁体   English

在PyPDF PdfFileReader中遍历页面范围会产生奇怪的循环

[英]Looping through a page range in PyPDF PdfFileReader gives strange loops

I have a PDF , which consists of 4 pages and I want to separate it into separate documents and rename them based on their page number. 我有一个PDF ,它由4页组成,我想将其分成单独的文档,然后根据其页码重命名。 The problem is that I have a loop to loop through each page based on: for page in range(0, pdfReader.numpages) but every time it should end it keeps going and creates duplicates. 问题是我有一个循环来循环访问基于以下内容的每个页面:for range(0,pdfReader.numpages)中的页面,但是每次结束时它都会继续并创建重复项。 I made a print(page) to see what was going on and got: 我进行了打印(页面)以查看发生了什么并得到:

0 1 2 3 0 0 0 0 0 1 2 3 0 0 0 0

Switching the range to range(1, pdfReader.numpages) makes the loop 1,2,3 and skips the first page. 将范围切换到range(1,pdfReader.numpages)会导致循环1,2,3,并跳过第一页。 Making the loop (0, pdfReader.numpages+1) gives the correct output of files but gives the error IndexError: list index out of range 进行循环(0,pdfReader.numpages + 1)可以正确输出文件,但会出现错误IndexError:列表索引超出范围

import os, PyPDF2, re, tika, time
from tika import parser

def split_pdf_pages(root_directory, extract_to_folder):
    for root, dirs, files in os.walk(root_directory):
        for filename in files:
            basename, extension = os.path.splitext(filename)

            if extension == ".pdf":
                fullpath = root + "\\" + basename + extension
                pdfFileObj = open(fullpath, "rb")
                pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

                for page in range(0, pdfReader.numPages):
                    print(page)
                    pdfWriter = PyPDF2.PdfFileWriter()
                    pageObj = pdfReader.getPage(page)

                    outputpdf = extract_to_folder + "\\" + basename + "-{}.pdf".format(page+1)
                    pdfWriter.addPage(pageObj)

                    with open(outputpdf, "wb") as f:

                        pdfWriter.write(f)

                pdfFileObj.close()

I expect to get files named filename-1, filename-2.pdf etc but instead get filename-1, filename-1-1, filename-2, filename2-2 etc UNLESS it's range(1,pdfReader.numPages) where it works correctly but skips the first page! 我希望得到名为filename-1,filename-2.pdf等的文件,但取而代之的是获得filename-1,filename-1-1,filename-2,filename2-2等,除非它在range(1,pdfReader.numPages)起作用正确,但跳过首页! It's driving me mad, please help. 这让我发疯了,请帮忙。

I've finally figured it out (sorry, I'm just a hobbyist coder so it wasn't evident at first!) The program loops through every PDF in the directory (of which the extracted and renamed single page documents are contained). 我终于弄清楚了(对不起,我只是一个业余编码员,所以一开始并不明显!)该程序循环遍历目录中的每个PDF(其中包含提取和重命名的单页文档)。 If you change the range to range(1, pdfReader.numPages) it was ignoring all of these newly created documents because they were all 1 page long! 如果将范围更改为range(1,pdfReader.numPages),它将忽略所有这些新创建的文档,因为它们全都是1页长! When it was set to 0 it included all of these newly created ones and duplicated them. 当将其设置为0时,它包括所有这些新创建的副本并重复它们。

All I had to do was move the extracted and renamed folders to a different directory. 我要做的就是将提取和重命名的文件夹移动到另一个目录。 Feels really obvious now that I've done it! 既然我做到了,那就真的很明显了! I also removed the pdfFileObj = open(fullpath, "rb") since reader apparently does it automagically and all works now! 我还删除了pdfFileObj = open(fullpath,“ rb”),因为阅读器显然是自动完成的,并且现在一切正常!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM