pdf2image - convert_from_path returns an empty image for pdfs with colour

Question

I have a collection of pdfs, each containing a scan of an A4 paper, that are different in size. I would like to convert them to an image and fix the resolution of the outgoing image.

My code to convert to jpg (without resizing):

from pdf2image import convert_from_path

filename_in = 'myfile.pdf'
filename_out = 'myfile.jpg'

jpeg = convert_from_path( filename_in )
jpeg[0].save( filename_out , 'JPEG' )

If the pdf I am trying to convert has any colour in it, the above does not work and the outgoing image is completely white (with non-zero dimensions). Is this a known problem and does a solution exist?

I am using Python 3.7.3.

I am unable to share the pdf files as they contain private information.

Answer 1

You can try to extract the images and correct resolutions instead of converting PDFs.

Try pdfreader , here is a sample code extracting all images (the both inline and XObject) from a doc.

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

images = []
try:
    while True:
        viewer.render()
        images.extend(viewer.canvas.inline_images)
        images.extend(viewer.canvas.images.values())
        viewer.next()
except PageDoesNotExist:
    pass

Then you can convert images to PIL/Pillow object and save (or do whatever you need)

for i, img in enumerate(images):
    img.to_Pillow().save("{}.png".format(i))

pdf2image - convert_from_path returns an empty image for pdfs with colour

Question

1 answers

solution1
0 2019-12-06 14:58:07

pdf2image - convert_from_path returns an empty image for pdfs with colour

Question

1 answers

solution1 0 2019-12-06 14:58:07

solution1
0 2019-12-06 14:58:07