简体   繁体   中英

Error in my python script produces 2 - 3 times too many jpgs (pdf2image) sometimes, but not always

I am using pdf2image to change pdfs to jpgs in about 1600 folders. I have looked around and adapted code from many SO answers, but this one section seems to be overproducing jpgs in certain folders (hard to tell which).

In one particular case, using an Adobe Acrobat tool to make pdfs creates 447 jpgs (correct amount) but my script makes 1059. I looked through and found some pdf pages are saved as jpgs multiple times and inserted into the page sequences of other pdf files.

For example: PDF A has 1 page and creates PDFA_page_1.jpg. PDF B has 44 pages and creates PDFB_page_1.jpg through ....page_45.jpg because PDF A shows up again as page_10.jpg. If this is confusing, let me know.

I have tried messing with the index portion of the loop (specifically, taking the +1 away, using pages instead of page, placing the naming convention as a variable rather than directly into the .save and .move functions.

I also tried using the fmt='jpg' parameter in pdf2image.py but was unable to produce the correct naming scheme because I am unsure how to iterate the page numbers without the for page in pages loop.

for pdf_file in os.listdir(pdf_dir):

        if pdf_file.endswith(".pdf") and pdf_file.startswith("602024"):
            #Convert function from pdf2image
            pages = convert_from_path(pdf_file, 72, output_folder=final_directory)
            print(pages)
            pdf_file = pdf_file[:-4]


            for page in pages:
                #save with designated naming scheme <pdf file name> + page index
                jpg_name = "%s-page_%d.jpg" % (pdf_file,pages.index(page)+1)
                page.save(jpg_name, "JPEG")
                #Moves jpg to the mini_jpg folder
                shutil.move(jpg_name, 'mini_jpg')
                #no_Converted += 1
    # Delete ppm files
    dir_name = final_directory
    ppm_remove_list = os.listdir(dir_name)

    for ppm_file in ppm_remove_list:
        if ppm_file.endswith(".ppm"):
            os.remove(os.path.join(dir_name, ppm_file))

There are no error messages, just 2 - 3 times as many jpgs as I expected in just SOME cases. Folders with many single-page pdfs do not experience this problem, nor do folders with a single multi-page pdf. Some folders with multiple multi-page pdfs also function correctly.

If you can create a reproducible example, feel free to open an issue on the official repository, I am not sure that I understand how that could happen: https://github.com/Belval/pdf2image

Do provide PDF examples otherwise, I can't test.

As an aside, instead of pages.index use for i, page in enumerate(pages) and page number will be i + 1 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM