在PyPDF PdfFileReader中遍历页面范围会产生奇怪的循环

Question

I have a PDF , which consists of 4 pages and I want to separate it into separate documents and rename them based on their page number. 我有一个PDF ，它由4页组成，我想将其分成单独的文档，然后根据其页码重命名。 The problem is that I have a loop to loop through each page based on: for page in range(0, pdfReader.numpages) but every time it should end it keeps going and creates duplicates. 问题是我有一个循环来循环访问基于以下内容的每个页面：for range（0，pdfReader.numpages）中的页面，但是每次结束时它都会继续并创建重复项。 I made a print(page) to see what was going on and got: 我进行了打印（页面）以查看发生了什么并得到：

0 1 2 3 0 0 0 0 0 1 2 3 0 0 0 0

Switching the range to range(1, pdfReader.numpages) makes the loop 1,2,3 and skips the first page. 将范围切换到range（1，pdfReader.numpages）会导致循环1,2,3，并跳过第一页。 Making the loop (0, pdfReader.numpages+1) gives the correct output of files but gives the error IndexError: list index out of range 进行循环（0，pdfReader.numpages + 1）可以正确输出文件，但会出现错误IndexError：列表索引超出范围

import os, PyPDF2, re, tika, time
from tika import parser

def split_pdf_pages(root_directory, extract_to_folder):
    for root, dirs, files in os.walk(root_directory):
        for filename in files:
            basename, extension = os.path.splitext(filename)

            if extension == ".pdf":
                fullpath = root + "\\" + basename + extension
                pdfFileObj = open(fullpath, "rb")
                pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

                for page in range(0, pdfReader.numPages):
                    print(page)
                    pdfWriter = PyPDF2.PdfFileWriter()
                    pageObj = pdfReader.getPage(page)

                    outputpdf = extract_to_folder + "\\" + basename + "-{}.pdf".format(page+1)
                    pdfWriter.addPage(pageObj)

                    with open(outputpdf, "wb") as f:

                        pdfWriter.write(f)

                pdfFileObj.close()

I expect to get files named filename-1, filename-2.pdf etc but instead get filename-1, filename-1-1, filename-2, filename2-2 etc UNLESS it's range(1,pdfReader.numPages) where it works correctly but skips the first page! 我希望得到名为filename-1，filename-2.pdf等的文件，但取而代之的是获得filename-1，filename-1-1，filename-2，filename2-2等，除非它在range（1，pdfReader.numPages）起作用正确，但跳过首页！ It's driving me mad, please help. 这让我发疯了，请帮忙。

Answer 1

I've finally figured it out (sorry, I'm just a hobbyist coder so it wasn't evident at first!) The program loops through every PDF in the directory (of which the extracted and renamed single page documents are contained). 我终于弄清楚了（对不起，我只是一个业余编码员，所以一开始并不明显！）该程序循环遍历目录中的每个PDF（其中包含提取和重命名的单页文档）。 If you change the range to range(1, pdfReader.numPages) it was ignoring all of these newly created documents because they were all 1 page long! 如果将范围更改为range（1，pdfReader.numPages），它将忽略所有这些新创建的文档，因为它们全都是1页长！ When it was set to 0 it included all of these newly created ones and duplicated them. 当将其设置为0时，它包括所有这些新创建的副本并重复它们。

All I had to do was move the extracted and renamed folders to a different directory. 我要做的就是将提取和重命名的文件夹移动到另一个目录。 Feels really obvious now that I've done it! 既然我做到了，那就真的很明显了！ I also removed the pdfFileObj = open(fullpath, "rb") since reader apparently does it automagically and all works now! 我还删除了pdfFileObj = open（fullpath，“ rb”），因为阅读器显然是自动完成的，并且现在一切正常！

在PyPDF PdfFileReader中遍历页面范围会产生奇怪的循环

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-11 10:17:24

在PyPDF PdfFileReader中遍历页面范围会产生奇怪的循环

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-11 10:17:24

解决方案1
1 已采纳 2019-07-11 10:17:24