Extract pdfs from a directory and output images to a different directory with pdf2image

Question

I'm trying to read in some pdfs located in a directory, and outputting images of their pages in a different directory.

(I'm seeking to learn how this code works and I am hoping there's a cleaner way to specify an output directory for my image files.)

What I've done works, but I think it is just bouncing back and forth between my save directory and my pdf directory.

This doesn't feel like a clean approach. Is there a better option, which preserves the existing code and accomplishes what my added lines do?

import os
from pdf2image import convert_from_path

pdf_dir = r"mydirectorypathwithPDFs"
save_dir = 'mydirectorypathforimages'

os.chdir(pdf_dir)

for pdf_file in os.listdir(pdf_dir):
    os.chdir(pdf_dir) #I added this, change back to the pdf directory
    if pdf_file.endswith(".pdf"):
        pages = convert_from_path(pdf_file, 300)
        pdf_file = pdf_file[:-4]
        for page in pages:
            os.chdir(save_dir) #I added this, change to the save directory
            page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

The code I slightly modified was created by @photek1944 and found here: https://stackoverflow.com/a/53463015/10216912

Answer 1

This might go a little beyond the scope of exactly what you asked, but anytime someone's looking to streamline code involving os for manipulating paths and files, I always like to recommend Python's pathlib module,because it is awesome . Here's how I personally would implement your program:

from pathlib import Path
from pdf2image import convert_from_path

# Use forward slashes here, even if you're on Windows.
pdf_dir = Path('my/directory/path/with/PDFs')
save_dir = Path('my/directory/path/for/images')

for pdf_file in pdf_dir.glob('*.pdf'):
    pages = convert_from_path(pdf_file, 300)
    for num, page in enumerate(pages, start=1):
        page.save(save_dir / f'{pdf_file.stem}-page{num}.jpg', 'JPEG')

pathlib automatically handles providing the right separator ( \ on Windows and / mostly everywhere else), it lets you add onto paths with / as an operator, and it makes searching through a folder particularly convenient with the glob method. It also exposes properties like name ( blah.pdf ), stem ( blah ), and extension ( .pdf ) to more easily access the parts of the path and file name.

I'm also using an f-string for more readable formatting, and enumerate to track the page numbers. (I've set it to start at 1 ; I believe your original code would number the first page as 0 .)

Extract pdfs from a directory and output images to a different directory with pdf2image

Question

1 answers

solution1
0 ACCPTED 2021-01-01 22:25:51

Extract pdfs from a directory and output images to a different directory with pdf2image

Question

1 answers

solution1 0 ACCPTED 2021-01-01 22:25:51

solution1
0 ACCPTED 2021-01-01 22:25:51